180 likes | 351 Views
Rubryx. Document Classification Technology. Authors : V.N. Polyakov, V.V. Sinitsin. State of the Art. Classification Task is a part of IR task There are some successful decisions There are benchmarks (most popular is Reuters-21578 text categorization test collection )
E N D
Rubryx Document Classification Technology Authors: V.N. Polyakov, V.V.Sinitsin
State of the Art • Classification Task is a part of IR task • There are some successful decisions • There are benchmarks (most popular is Reuters-21578 text categorization test collection ) • The better levels of measure F1 are from 0.753 to 0.92 (Sebastiani, 1999) • Existing technologies of machine learning are not low-cost (large volume of manual work is needed)
Rubryx Technology • General Features • Method Description • Formal Task Description • Machine Learning Technology • Dictionary Development Technology • Examples Selection Technology • Tests Results and New Heuristics • Applications and Tools
General Features Rubryx can be characterized as follows: • is based on a controlled dictionary; • uses collocations in ranking texts; • uses machine learning technology; • uses hard-classification; • uses multi-label text categorization; • uses both category-pivoted and document-pivoted text categorization • Moreover, another characteristic feature of the program can be added to the list, which hasn’t been widely used, yet is highly perspective, namely lexical meaning based approach.
Method Description 1. Compile a directory and general thematic dictionary 2. Select sample texts for the category (five documents)by expertfor every rubric 3. Generate a micro-dictionary of special format for the category (rubric)based on frequency of occurance of terms from general dictionary in the texts-examples. Set a thresholdfor every rubric 4. Carry out a complete classification under the category
Machine Learning Technology 1. Compile a directory 2. Select sample texts for the category (five documents)by expertfor every rubric 3. Generate a micro-dictionary 4. Set a thresholdfor every rubric 5. After these four steps Rubryx is ready for using
Dictionary Development Technology • We use an electronic terminological dictionaryfor • whole directory in special formats: three files for one-word, • two-word and three-word terms accordingly 2.For every sample we determine list of terms in used format with frequency of occurance 3. Terms are placed in micro-dictionary if it was occurred in M samples at least 4. Final micro-dictionary can by corrected by expert Remark: 1. Using collocations give us lexical meaning disambiguation 2. Frequencies are normalized to text size of 1000 words 2. Usually M=2
Examples Selection Technology Samples are the most relevant documents to each rubric 1. Samples are selected by expert 2. It is needed 3-5 samples only to each rubricin contrast to thousands of manually classified documents needed in ordinary technologies of machine learning 3. Technology of machine learning in Rubryx also depends of expert qualification but needs less of manual work
Preliminary Results of Rubryx Testing on the Reuters-21578 text categorization test collection • Measure F1 = 0.85 on “places” and “topic” category • Measure F1 is 1 on “exchanges” category • Categories “people” and “org” need new dictionaries of proper names development • Some new heuristics were generated to improve results in categories “places” and “topic”: (taking in account position of terms in clause, taking in account grouping of terms in text, taking in account proper names)
Summary of Advantages andknow-how • Lexical meaning based approach • Using collocations give us lexical meaning disambiguation • We use an electronic terminological dictionaryand micro-dictionaries in special formats: three files for one-word, two-word and three-word terms accordingly • It is needed 3-5 samples only to each rubricin contrast to thousands of manually classified documents needed in ordinary technologies of machine learning • Comparable quality of classification with low-cost machine learning
Applications and Tools • Rubryx – text classification program (versions 1 and 2, See site www.sowsoft.com/rubryx ) • DicTools – utility for dictionary development • Spider – application program for text collection from Internet with preliminary classification • Dictionaries
Rubryx – text classification program Status: Completed application
DicTools – utility for dictionary development Status: Completed application
Spider – application program for text collection from Internet with preliminary classification Application collects from start www-address all pages relevant to interested rubric. 1. We input category and starting URL 2. Spider goes recursively all links andloads pages. All pages are classified and not interesting linkpaths are cut. 3.As result we have sufficient economy of traffic and time. Status: Evaluation and testing
English Dictionaries • Natural Language Processing (7775 terms) • Geography (5941 terms) • Metallurgy (4946 terms) • Politechnical (37488terms) • Economics (1806terms) • Names of market exchanges (69080terms)
Publications • V.N. Polyakov, V.V. Sinitsin “Method Automatic Classification of Web-resource by Patterns” in Text Processing and Cognitive Technologies. Paper Collection. Issue 6. Edited by V.D. Solovyev, V.N. Polyakov. Kazan, Otechestvo, 120-126 (2001) (Article in Russian with abstract in English) • V.N. Polyakov, V.V. Sinitsin “Rubryx: Technology of Text Classification Using Lexical Meaning Based Approach” in Proc. of International Conference Speech and Computer. SPECOM-2003. Moscow, MSLU, 137-143 (2003)
Contact Information Vladimir N. Polyakov Moscow State Linguistic University vladimir_polyakov@yahoo.com Vladimir V.Sinitsyn Moscow State Steel and Alloys Institute (Technological University) sowsoft@land.ru Rubryx HomePages (shareware): www.sowsoft.com/rubryx/ www.rubryx.narod.ru