Document Categorization

Document Categorization • Problem: given • a collection of documents, and • a taxonomy of subject areas • Classification: Determine the subject area(s) most pertinent to each document • Indexing: Select a set of keywords / index terms appropriate to each document

Classification Techniques • Manual (a.k.a. Knowledge Engineering) • typically, rule-based expert systems • Machine Learning • Probabalistic (e.g., Naïve Bayesian) • Decision Structures (e.g., Decision Trees) • Profile-Based • compare document to profile(s) of subject classes • similarity rules similar to those employed in I.R. • Support Machines (e.g., SVM)

Machine Learning Procedures • Usually train-and-test • Exploit an existing collection in which documents have already been classified • a portion used as the training set • another portion used as a test set • permits measurement of classifier effectiveness • allows tuning of classifier parameters to yield maximum effectiveness • Single- vs. multi-label • can 1 document be assigned to multiple categories?

Automatic Indexing • Assign to each document up to k terms drawn from a controlled vocabulary • Typically reduced to a multi-label classification problem • each keyword corresponds to a class of documents for which that keyword is an appropriate descriptor

Case Study: SVM categorization • Document Collection from DTIC • 10,000 documents • previously classified manually • Taxonomy of • 25 broad subject fields, divided into a total of • 251 narrower groups • Document lengths average 27051464 words, 623274 significant unique terms. • Collection has 32457 significant unique terms

Document Collection

Sample: Broad Subject Fields 01--Aviation Technology 02--Agriculture 03--Astronomy and Astrophysics 04--Atmospheric Sciences 05--Behavioral and Social Sciences 06--Biological and Medical Sciences 07--Chemistry 08--Earth Sciences and Oceanography

Sample: Narrow Subject Groups Aviation Technology 01 Aerodynamics 02 Military Aircraft Operations 03 Aircraft 0301 Helicopters 0302 Bombers 0303 Attack and Fighter Aircraft 0304 Patrol and Reconnaissance Aircraft

Distribution among Categories

Baseline • Establish baseline for conventional techniques • classification • training SVM for each subject area • “off-the-shelf” document modelling and SVM libraries

Why SVM? • Prior studies have suggested good results with SVM • relatively immune to “overfitting” – fitting to coincidental relations encountered during training • low dimensionality of model parameters

Machine Learning: Support Vector Machines • Binary Classifier • Finds the plane with largest margin to separate the two classes of training samples • Subsequently classifies items based on which side of line they fall hyperplane Font size margin Line number

SVM Evaluation

Baseline SVM Evaluation • Training & Testing process repeated for multiple subject categories • Determine accuracy • overall • positive (ability to recognize new documents that belong in the class the SVM was trained for) • negative (ability to reject new documents that belong to other classes) • Explore Training Issues

SVM “Out of the Box” • 16 broad categories with 150 or more documents • Lucene library for model preparation • LibSVM for SVM training & testing • no normalization or parameter tuning • Training set of 100/100 (positive/negative samples) • Test set of 50/50

“OOtB” Interpretation • Reasonable performance on broad categories given modest training set size. • Related experiment showed that with normalization and optimized parameter selection, accuracy could be improved as much as an additional 10%

Training Set Size

Training Set Size • accuracy plateaus for training set sizes well under the number of terms in the document model

Training Issues • Training Set Size • Concern: detailed subject groups may have too few known examples to perform effective SVM training in that subject • Possible Solution: collection may have few positive examples, but has many, many negative example • Positive/Negative Training Mixes • effects on accuracy

Increased Negative Training

Training Set Composition • experiment performed with 50 positive training examples • OotB SVM training • increasing the number of negative training examples has little effect on overall accuracy • but positive accuracy reduced

Interpretation • may indicate a weakness in SVM • or simply further evidence of the importance of optimizing SVM parameters • may indicate unsuitability of treating SVM output as simple boolean decision • might do better as “best fit” in a multi-label classifier

Document Categorization

Document Categorization

Presentation Transcript

Dataware’s Document Categorization Toolkit

Text Categorization

Entity Categorization Over Large Document Collections

Categorization

Categorization

Recursive Bipartite Spectral Clustering for Document Categorization

Entity Categorization Over Large Document Collections

Text Categorization

Image Categorization

Text Categorization

Document Categorization Issues

text categorization

Categorization

Medical Document Categorization Using a Priori Knowledge

Text Document Categorization by Term Association

Text Categorization

A Comparison of SOM Based Document Categorization Systems

Language Technology: Document Categorization Walter Daelemans walter . daelemans@ua.ac.be

Categorization

Text Categorization

Categorization

Categorization