270 likes | 392 Views
Arabic Text Categorization Based on Arabic Wikipedia. Presenter : CHANG, SHIH-JIE Authors : ADNAN YAHYA and ALI SALHI 2014. ACM TALIP . Outlines. Motivation Objectives Methodology Experiments Conclusions Comments. Motivation.
E N D
Arabic Text Categorization Based on Arabic Wikipedia Presenter : CHANG, SHIH-JIE Authors : ADNAN YAHYA and ALI SALHI 2014. ACM TALIP.
Outlines • Motivation • Objectives • Methodology • Experiments • Conclusions • Comments
Motivation A challenge due to the correlationbetween certain subcategories and overlap between main categories. EX:
Objectives • To solve this, we use algorithm and further adopt the two approaches .
CATEGORIZATION CORPORA - Training Data Related Tags Approach
Testing Data 10 categories with 40 documents in each category
Methodology - PREPROCESSING TECHNIQUES • Root Extraction (RE) • Light Stemming (LS) • Special Expressions Extraction
Methodology- CATEGORIZATION PROCESS Categorize the input text in two phases Phase one: we categorize the text into one of the main categories. Phase two: We further categorize the input text based on subcategories:
Methodology - Percentage and Difference Categorization (PDC) Algorithm has frequency 7 in the 300-word
Methodology - Percentage and Difference Categorization (PDC) Algorithm The category with the highest sum of flag values is considered to be the best match for the input text.
Methodology – Enhancing Main/Subcategories Grouping Problem : The possible high correlation between subcategories of different main categories (1) Overlapping Main Categories for Phase Two
Methodology – Enhancing Main/Subcategories Grouping (2) Replacing Main Categories by Groups of Related Categories
Methodology - Word Filtration Techniques within Categories
Modified PDC with N Scales 1 0.5 0 Definea scaling of 0.5 0.25 0 1 0.75
Further Testing on the PDC Algorithm ToolRoot Extraction Tool Light Stemming & Light10 Tool Double Words Tool Expressions Extraction
Conclusions • To use training and testing data from same source by splitting the corpus into test and training components. This consistently gives better results. • However, we believe that the second method • (different source ) makes more sense, as the tests will be more credible and indicative of performance in real-life environments.
Comments • Advantages • To. • Applications • Arabic Text Categorization .