1 / 18

Rubryx

Rubryx. Document Classification Technology. Authors : V.N. Polyakov, V.V. Sinitsin. State of the Art. Classification Task is a part of IR task There are some successful decisions There are benchmarks (most popular is Reuters-21578 text categorization test collection )

cargan
Download Presentation

Rubryx

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Rubryx Document Classification Technology Authors: V.N. Polyakov, V.V.Sinitsin

  2. State of the Art • Classification Task is a part of IR task • There are some successful decisions • There are benchmarks (most popular is Reuters-21578 text categorization test collection ) • The better levels of measure F1 are from 0.753 to 0.92 (Sebastiani, 1999) • Existing technologies of machine learning are not low-cost (large volume of manual work is needed)

  3. Rubryx Technology • General Features • Method Description • Formal Task Description • Machine Learning Technology • Dictionary Development Technology • Examples Selection Technology • Tests Results and New Heuristics • Applications and Tools

  4. General Features Rubryx can be characterized as follows: •  is based on a controlled dictionary; •  uses collocations in ranking texts; •  uses machine learning technology; •  uses hard-classification; •  uses multi-label text categorization; •  uses both category-pivoted and document-pivoted text categorization • Moreover, another characteristic feature of the program can be added to the list, which hasn’t been widely used, yet is highly perspective, namely lexical meaning based approach.

  5. Method Description 1. Compile a directory and general thematic dictionary 2. Select sample texts for the category (five documents)by expertfor every rubric 3. Generate a micro-dictionary of special format for the category (rubric)based on frequency of occurance of terms from general dictionary in the texts-examples. Set a thresholdfor every rubric 4. Carry out a complete classification under the category

  6. Formal Task Description

  7. Machine Learning Technology 1. Compile a directory 2. Select sample texts for the category (five documents)by expertfor every rubric 3. Generate a micro-dictionary 4. Set a thresholdfor every rubric 5. After these four steps Rubryx is ready for using

  8. Dictionary Development Technology • We use an electronic terminological dictionaryfor • whole directory in special formats: three files for one-word, • two-word and three-word terms accordingly 2.For every sample we determine list of terms in used format with frequency of occurance 3. Terms are placed in micro-dictionary if it was occurred in M samples at least 4. Final micro-dictionary can by corrected by expert Remark: 1. Using collocations give us lexical meaning disambiguation 2. Frequencies are normalized to text size of 1000 words 2. Usually M=2

  9. Examples Selection Technology Samples are the most relevant documents to each rubric 1. Samples are selected by expert 2. It is needed 3-5 samples only to each rubricin contrast to thousands of manually classified documents needed in ordinary technologies of machine learning 3. Technology of machine learning in Rubryx also depends of expert qualification but needs less of manual work

  10. Preliminary Results of Rubryx Testing on the Reuters-21578 text categorization test collection • Measure F1 = 0.85 on “places” and “topic” category • Measure F1 is 1 on “exchanges” category • Categories “people” and “org” need new dictionaries of proper names development • Some new heuristics were generated to improve results in categories “places” and “topic”: (taking in account position of terms in clause, taking in account grouping of terms in text, taking in account proper names)

  11. Summary of Advantages andknow-how • Lexical meaning based approach • Using collocations give us lexical meaning disambiguation • We use an electronic terminological dictionaryand micro-dictionaries in special formats: three files for one-word, two-word and three-word terms accordingly • It is needed 3-5 samples only to each rubricin contrast to thousands of manually classified documents needed in ordinary technologies of machine learning • Comparable quality of classification with low-cost machine learning

  12. Applications and Tools • Rubryx – text classification program (versions 1 and 2, See site www.sowsoft.com/rubryx ) • DicTools – utility for dictionary development • Spider – application program for text collection from Internet with preliminary classification • Dictionaries

  13. Rubryx – text classification program Status: Completed application

  14. DicTools – utility for dictionary development Status: Completed application

  15. Spider – application program for text collection from Internet with preliminary classification Application collects from start www-address all pages relevant to interested rubric. 1. We input category and starting URL 2. Spider goes recursively all links andloads pages. All pages are classified and not interesting linkpaths are cut. 3.As result we have sufficient economy of traffic and time. Status: Evaluation and testing

  16. English Dictionaries • Natural Language Processing (7775 terms) • Geography (5941 terms) • Metallurgy (4946 terms) • Politechnical (37488terms) • Economics (1806terms) • Names of market exchanges (69080terms)

  17. Publications • V.N. Polyakov, V.V. Sinitsin “Method Automatic Classification of Web-resource by Patterns” in Text Processing and Cognitive Technologies. Paper Collection. Issue 6. Edited by V.D. Solovyev, V.N. Polyakov. Kazan, Otechestvo, 120-126 (2001) (Article in Russian with abstract in English) • V.N. Polyakov, V.V. Sinitsin “Rubryx: Technology of Text Classification Using Lexical Meaning Based Approach” in Proc. of International Conference Speech and Computer. SPECOM-2003. Moscow, MSLU, 137-143 (2003)

  18. Contact Information Vladimir N. Polyakov Moscow State Linguistic University vladimir_polyakov@yahoo.com Vladimir V.Sinitsyn Moscow State Steel and Alloys Institute (Technological University) sowsoft@land.ru Rubryx HomePages (shareware): www.sowsoft.com/rubryx/ www.rubryx.narod.ru

More Related