160 likes | 187 Views
Automatic Language Identification – A Syntactic Approach. Mahesh Soundalgekar. The Road Map. Introduction. System Architecture. Classification Approaches. Experimental Results. Summary and Future Work. Introduction. Goal : Efficiently crawl Web pages in a given language;
E N D
Automatic Language Identification – A Syntactic Approach Mahesh Soundalgekar CFILT, IIT Bombay
The Road Map • Introduction • System Architecture • Classification Approaches • Experimental Results • Summary and Future Work CFILT, IIT Bombay
Introduction • Goal : Efficiently crawl Web pages in a given language; • Marathi in our case • Different languages use the same Devanagari script • E.g Marathi, Sanskrit and Hindi • Necessity to accurately distinguish one language from others • We take a syntactic approach to solve this problem, which has given us excellent results on training data of 2MB with test data of 10 MB CFILT, IIT Bombay
System Architecture HTML Documents in different encodings such as Xdvng, DV-TTYogesh HTML to ASCII Plain Text + Font Information Appropriate Encoding Converter Plain Text in ISCII Encoding Classifier Classification Results CFILT, IIT Bombay
Classification Approaches • Most Frequently Occurring Common Words • e.g. English : the, an, is, at,a etc • N-Grams (Most Frequent Character Sequences) • Bi-grams: th, ’s, re, en • Tri-grams: the, ing, ion, • Quad-grams: tion as in classification, association, gratification etc. CFILT, IIT Bombay
Important Factors • Size of the Training Data – Important to capture the • syntactic essence of a language • Domains of Training Data – Usages vary from domain • to domain, author to author • Size of the Test Data – Small test data may not • contain enough information for classification • Requirement of linguistic knowledge for common • words approach CFILT, IIT Bombay
Classifier Architecture Training Samples Test Document Generate Profile Generate Profiles Category Profiles Document Profile Measure Profile Distances Find minimum Distance Identify category CFILT, IIT Bombay
Common Words Approach • List of selected common words • Matched with the test documents • Closest match will give the language of the document • Advantages: • Intuitive • Computationally Efficient • Space Efficient CFILT, IIT Bombay
Top 5 Marathi Common Words • ´É • +ÉÎhÉ • +É½ä • ªÉÉ • iÉä CFILT, IIT Bombay
N-Grams Approach • JAVA • Bi-grams: _J, JA, AV, VA, A_ • Tri-grams: _JA, JAV, AVA, VA_, A__ • Quad-grams: _JAV, JAVA, AVA_, VA__, A___ • ¨ÉniÉ • Bi-grams: _¨É, ¨Én, , niÉ, iÉ_ • Tri-grams: _¨Én, ¨ÉniÉ,niÉ_, iÉ__ CFILT, IIT Bombay
Measuring Distances Out_of_Place () A ER ING AND ON AR AND ER ED ON max_value 2 1 Max_value 0 Category profile sorted in descending order Test profile sorted in descending order Distance =3 + 2* max_value CFILT, IIT Bombay
Extensions to N-Grams Method • Lowest Granularity • +ÉÊniªÉ = + + É + Ê + n + iÉ + ªÉ • Letter Granularity • +ÉÊniªÉ = +É + Ên + iÉ + ªÉ • Conjunct Granularity • +ÉÊniªÉ = +É + Ên + iªÉ CFILT, IIT Bombay
Experimental Training Setup CFILT, IIT Bombay
Category Profiles Generated through Training CFILT, IIT Bombay
Classification Results CFILT, IIT Bombay
Summary and Future Work • Good results have been obtained through syntactic classification • Common words technique is computationally most • efficient, but with a lesser accuracy • Our extensions to N-Grams give the desired accuracy • N-grams technique is robust to syntax errors • N-Grams technique does not require linguistic knowledge • We will be Using language identification techniques to identify a good starting set of pages for crawling activities for the general purpose search engine CFILT, IIT Bombay