330 likes | 471 Views
Arabic Text Correction Using Dynamic Categorized Dictionaries A Statistical approach. By Adnan Yahya Ali Salhi Department of Computer Systems Engineering Birzeit University Presented at CITALA’12 Conference, Rabat, Morocco May 2-3, 2012. Outline. Introduction.
E N D
Arabic Text Correction Using Dynamic Categorized DictionariesA Statistical approach By AdnanYahya Ali Salhi Department of Computer Systems Engineering Birzeit University Presented at CITALA’12 Conference, Rabat, Morocco May 2-3, 2012
Outline • Introduction. • Spelling Dictionaries and Data Collection. • Spelling and Ranking Variables. • Categorization Method. • Spelling Algorithm. • Conclusions.
Introduction • With the increase of Arabic content awareness and creation, there is a great need for tools to overcome the many challenges in processing and retrieving Arabic web content. • One of the major challenges is ensuring the correctness of data through spelling and correction. • We propose a technique providing multiple spelling/ranking variables controlled to give customized results based on the detected/declared category of the processed text.
Introduction (Cont …) • The technique depends on dynamic dictionaries selected based on the input text categorization. • The dynamic dictionaries are build using a statistical/Corpus-based approach from data in the Arabic Wikipedia and local newspaper. • Output: databases of expressions of different dimensionalities and their frequencies: as single, double and triple word expressions.
Introduction (Cont …) • The suggested spelling approach is a mix of: • context dependent (using categorization), and • context independent (using customized dictionaries) errors correction. • It has three components: • Misspelled words detection : compare input with elements of predefined Arabic dictionary. • Dynamic dictionaries : the selection of the dictionary depends on the category of the input text. • Correction/Ranking metric: based on a number of correction and ranking variables, possibly with different weights.
Spelling Dictionaries and Data Collection • The spelling algorithm depends on several dictionaries: • General Dictionary : depends on a statistical/Corpus approach based on contemporary data we obtained from various sources. General Dictionary Statistics
Spelling Dictionaries and Data Collection (Cont…) • Dynamic Dictionaries : Based on Wikipedia categorization: • Wikipedia data collected using an automated process of connecting related articles based on categorization done manually by the Wikipedia editors. • A pre defined list of collected Arabic Wikipedia topics is availableas data rows of the form <title, content> • Sets of related articles are generated, HOW ? • Wikipedia editors use manual tagging/categorization. • E.g. tags found in an article talks about القدس :
Spelling Dictionaries And Data Collection (Cont…) • Based on Wikipedia categories we can link articles together based on the shared categories. • There is a possible relation between categories appearing jointly in articles. • Example: Article “A” categorized under : قوانين نيوتن , ميكانيكا Article “B” categorized under : ميكانيكا , طاقة حركية We can conclude that the three categories are related because Article “A” and “B” have ميكانيكا
Spelling Dictionaries and Data Collection (Cont…) • Going too deep in this relation analysis may end up connecting all categories in the Wikipedia (Not Good) • Solution: related categories approach: • Start with a predefined category, say فيزياء • Parse articles to select ones having category فيزياء. • For each such article, add categories found to the queue. • Move to the next in queue … • For each category seen, have a variable indicating the number of articles in the category.
Spelling Dictionaries and Data Collection (Cont…) • A number N is introduced to control how many articles to process. We set N = 50: It means that when the number of processed articles reaches N = 50 the operation stops and the categories stored in the queue are considered for manual inspection. A manual phase is needed here to make sure that the categories in the queue (from the 50 articles) are truly related and do not cause major problems in categorization.
Spelling Dictionaries And Data Collection (Cont…) • Some categories with their top 10 related categories, using the related categories approach. • Using this technique we built a Wikipedia based categorized corpus.
Spelling Dictionaries And Data Collection (Cont…) Wikipedia Categories Current Statistics
SPELLING VARIABLES • V1: Levenshtein distance (Lev) • Levenshteindistance works as a metric for the minimum number of steps needed to convert string A to string B. For example if we have A: الوظن and B: الوطن then Levenshtein distance between A and B will be 1 • Levenshtein distance function will output the result based on the following equation: Lev(A,B)= 1 – [ #ofDiffrenentLetteres/min(Length A, Length B)]
SPELLING VARIABLES (Cont …) • V2: Letter Pair Similarity (LPS) Finds out how many adjacent character pairs are contained in two strings A and B. LPS (A, B) = (2*| pairs (A) ∩ pairs (B)|) / (|pairs (A)| + |pairs (B)|) Example: String A: الوطن العربي String B: العربي الوطن String A & B will be divided into pairs of letters : P1{ال, لو , وط , طن, ,ال , لع, عر, رب, بي } P2{ ال, لع, عر , رب, بي , ال, لو, وط, طن} According to the equation the similarity will be: (9+9)/18 = 1.00 , which will rank the sentence high in the possible match list. Takes care of word order!
SPELLING VARIABLES (Cont …) • V3: Shape Similarity • A function that measures the similarity in shape between two strings A and B • Example : الوظن Vs الوطن Vs الوزن ShapeSimilarity(A, B) = #OfRelatedLetters/ max(length A, length B) Letters are considered related if they have similar shapes (say differing in dots). The result above will give 1.00 in one case الوظن and 0.8 in the other الوزنwhen compared with الوطن
SPELLING VARIABLES (Cont …) • V4: Location Parameter • The function compares two strings to measure how close are their locations on the keyboard. • For example, the letter ا will have the following related group: {ت ل ف غ ع ة ى لآ أ لألإ إ ‘ ـ آ} which are the adjacent letters on the keyboard (possibly with Shift). • This also can be extended to restore Arabic text entered in Latin due to failure to switch language for data entry. • LetterLocation(A,B) = (#equalLetters + #relatedLetters])/max(length A, length B) Example: مدرسة Vs كدؤسة (See the above figure)equalLetters = 3 & relatedLetters = 2 LetterLocation= [3+2]/5 = 1
SPELLING VARIABLES (Cont …) • V5: Soundex Function The soundex function works much like the shape similarity function, but measures the similarity in sound between two strings (ذ،ظ) (س،ص) (د،ض) or even perceived as related: (ذ،ز،ظ). Soundex (A, B) = (#relatedLetters / max (length A, length B)) Example: الصف , السف will have a value of 1 , which means that both are fully related since س , ص are in the same group.
SPELLING VARIABLES (Cont …) • V6: String Ranking and Frequency (R&F) • String ranking refers to the position of a string in a dictionary according to frequency. • Frequency refers to the number of appearances of a string over the sum of all frequencies in the candidates match list. R&F(s) = [Rank(s)/TotalRank]*[Freq(s)/TotalDicFreq] Example: A dictionary consisting of six strings and their dictionary frequencies { A:100 , B:75 , C:75 , D:30 , E:28 , F:10} . String B Rank = 2 / 5 ( not 6 because B, C frequencies are repeated) String B Frequency = 75/(100+75+75+30+28+10)
CATEGORIZING METHODPercentage and Difference Categorization Algorithm • The algorithm utilizes the relation between ratios of words in the input text and their counterparts in the reference text (Wikipedia categories) to decide on the category of the input text. • This will be done by : • Calculate the percentage of each word A in the input (word frequency/total words) . • Compare the input frequency of A with the word percentage of A in each category (if exists). Find the difference between the two values. • Assign A to the category with smallest difference. • The most frequent category is considered as the best match for the entire input text.
CATEGORIZING METHODPercentage and Difference Categorization Algorithm (Cont …) Example: Word A holds : - Input Frequency of: 7 , Input Text Size: 300 words. Percentage of A in the input table is 7/300 = 0.023333 Now , A percentage value is calculated in each categorized dictionary (if it exists): - A frequency in dictionary X : 500 , Total X size: 10000 A in X has a percentage of 500/10000 = 0.05 The relation between A in X and A in the input text will be the absolute value of |0.023333 – 0.05| = 0.026667. This is done to all dictionaries, and the dictionary (category) with minimum difference is assigned to A.
CATEGORIZING METHOD Testing the Percentage and Difference Categorization Algorithm The testing was done on a sample of 380 files, distributed among different categories.
SPELLING ALGORITHM Our spelling algorithm depends on: • Single, double and triple dimensionality general and categorized dictionaries. • Spelling Variables: • V1: Levenshtein distance and • V2: Letter Pair Similarity . • Ranking system based on four parameters: • V3: Shape Similarity • V4: Location Parameter • V5: Soundex Function • V6: String Ranking and Frequency (R&F)
SPELLING ALGORITHM (Cont …) Why Single, Double and Triple Dictionaries? • Take the following example, misspelling: الوظن الغربي • The word الوظن is wrong and the word الغربي is correct . • If spelling each word in isolation, the word الوظنwill be spelled either الوطنor maybe الوزن • When considering الوظن الغربي as a single block the output will be الوطن العربي that is both entered words are treated as misspelled. • The spelling system follows the user input while typing, for word (w) in a text , it checks (when possible): • {w || w, w -1 || , w, w-1, w -2} • { w || w , w+1|| w, w+1, w+2} • This is done to make sure that the right dictionary is used (single, double or triple).
SPELLING ALGORITHM (Cont …) The spelling algorithm works as follows: • Select the dimensionality of the dictionary (Single , double or triple) and type (general , dynamic/categorized). • Calculates Levenshtein distance (V1) and Letter Pair Similarity (V2) between each expression in the input text and in the dictionary. • Each expression is assigned a value of W = V1*V2. • N ( where N = 20 ) Words with the highest W values are selected for the next phase.
SPELLING ALGORITHM (Cont …) • The selected words are ranked based on the ranking variables using the following equation for word Wi : V (Wi) = A*V3 + B*V4 + C*V5+ D*V6 • V3: Shape Similarity • V4: Location Parameter • V5: Soundex Function • V6: String Ranking and Frequency (R&F) • A, B, C, D are weights (percentages with summation of 100% ) • Consider A = 0.5 and B = 0.20 and C= 0.25 and D = 0.05 • The chosen values for the weights are not necessarily the best. They are based on experimentation and thus need more testing to decide the best range (or values) for them.
SPELLING ALGORITHM: Samples • The following table shows some sample results from our algorithm compared with Microsoft Speller (2010) and another open source speller “AyaSpell”
SPELLING ALGORITHM: Weights • We also build a technique for customizing the weights (A, B, C and D) in the ranking equation. • The following table explains the process:
SPELLING ALGORITHM: Test Results Initial Tests: • An initial test was done using 100 articles from the 380 articles used in testing the categorizing algorithm. • The articles were tested twice: • By manually introduce mistakes in each article. • Random letter change to generate errors in each article. • No auto learning technique is used in the tests. • The tests are made with weight values A=0.2, B=0.25, C=0.05, D=0.5.
Conclusions • Presented a spellchecking approach supported by categorized dictionaries, customizable ranking, using categorization, can add auto learning spelling mechanism. • Employed a statistical/Corpus-based approach , data obtained from the Wikipedia and local newspapers. • Based on corpus statistics we constructed • databases of expressions and their frequencies (single, double and triple word expressions and • categorized dictionaries based on automated Wikipedia articles filtration. • The spelling technique modifies earlier work by incorporating new spelling variables and dynamic dictionaries. • Need better data structures to improve performance and allow transparent integration into systems. • Need more testing.
Thank You Questions ?
SPELLING ALGORITHM (Cont …) Complexity and Response time results for our algorithms is shown in the following table : 1Average time was calculated by testing the time of the 380 testing file.2Average Machine: Intel(R) Core(TM) Due CPU, P8400 @ 2.26 GHZ & 2 GB of RAM, OS: Microsoft Windows XP(TM).3 Average time was calculated by testing 100 misspelled words.
CATEGORIZING METHODPercentage and Difference Categorization Algorithm • Skipped • May talk about this if time permits. • Good results.