2013 8th International Conference on Computer Engineering &Amp; Systems (ICCES) 2013
DOI: 10.1109/icces.2013.6707225
|View full text |Cite
|
Sign up to set email alerts
|

Web-based Arabic/English duplicate record detection with nested blocking technique

Abstract: Data accuracy and quality affects the success of any business intelligence and data mining solutions. The first step to ensure the data accuracy is to make sure that each real world object is represented once and only once in a certain dataset, this operation becomes more complicated when entities are identified by a string value like the case of person names. These data inaccuracy problems exist due to misspelling and wide range of typographical variations especially with non-Latin languages like Arabic. Up t… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1

Citation Types

0
4
0

Year Published

2015
2015
2021
2021

Publication Types

Select...
3
2
2
1

Relationship

0
8

Authors

Journals

citations
Cited by 10 publications
(4 citation statements)
references
References 14 publications
0
4
0
Order By: Relevance
“…The system used ten features to classify the character models. Also used [34]. There is a set of cleaning tools available for data cleaning.…”
Section: Related Workmentioning
confidence: 99%
“…The system used ten features to classify the character models. Also used [34]. There is a set of cleaning tools available for data cleaning.…”
Section: Related Workmentioning
confidence: 99%
“…In this step, datasets are converted to Unicode system to support Arabic language. Several works (Yousef, 2015;Higazy et al, 2013;El-Shishtawy, 2013;Yousef, 2013) used a set of standardization rules for Arabic datasets. These rules consist in replacing a set of characters with their equivalent character.…”
Section: Preprocessingmentioning
confidence: 99%
“…Several ER frameworks have been developed for datasets in Latin and in particular English language such as Febrl (Christen, 2008), TAILOR (Elfeky, Verykios, & Elmagarmid, 2002) and BigMatch (Yancey,2002).Theseframeworksdonotrecognizenon-Latincharactersandinparticular ArabiccharactersbecausetheydonotuseUnicodesystem (Higazy,ElTobely,Yousef,&Sarhan, 2013).Ontheotherhand,developedapproachestosupportERinArabicdatasets (Gueddah,Yousfi, &Belkasmi,2012;Ghafour,El-Bastawissy,&Heggazy,2011;El-Shishtawy,2013;Yousef,2013;Aqeel,Beitzel,Jensen,Grossman,&Frieder,2006)requirematchingrulesortrainingsetdeveloped byanexpert.…”
Section: Introductionmentioning
confidence: 99%
“…Duplicate detection tools such as the Febrl system, TAILOR, and BigMatch were also used in cleaning data. However, Febrl has usability limitations such as slowness, unclear error messages, and complicated installations [17][18][19][20]. The listed programs are rather complex to the average users who do not have experience with programming and language functions.…”
Section: Introductionmentioning
confidence: 99%