360 likes | 728 Views
Document Categorization. Problem: given a collection of documents, and a taxonomy of subject areas Classification : Determine the subject area(s) most pertinent to each document Indexing : Select a set of keywords / index terms appropriate to each document. Classification Techniques.
E N D
Document Categorization • Problem: given • a collection of documents, and • a taxonomy of subject areas • Classification: Determine the subject area(s) most pertinent to each document • Indexing: Select a set of keywords / index terms appropriate to each document
Classification Techniques • Manual (a.k.a. Knowledge Engineering) • typically, rule-based expert systems • Machine Learning • Probabalistic (e.g., Naïve Bayesian) • Decision Structures (e.g., Decision Trees) • Profile-Based • compare document to profile(s) of subject classes • similarity rules similar to those employed in I.R. • Support Machines (e.g., SVM)
Machine Learning Procedures • Usually train-and-test • Exploit an existing collection in which documents have already been classified • a portion used as the training set • another portion used as a test set • permits measurement of classifier effectiveness • allows tuning of classifier parameters to yield maximum effectiveness • Single- vs. multi-label • can 1 document be assigned to multiple categories?
Automatic Indexing • Assign to each document up to k terms drawn from a controlled vocabulary • Typically reduced to a multi-label classification problem • each keyword corresponds to a class of documents for which that keyword is an appropriate descriptor
Case Study: SVM categorization • Document Collection from DTIC • 10,000 documents • previously classified manually • Taxonomy of • 25 broad subject fields, divided into a total of • 251 narrower groups • Document lengths average 27051464 words, 623274 significant unique terms. • Collection has 32457 significant unique terms
Sample: Broad Subject Fields 01--Aviation Technology 02--Agriculture 03--Astronomy and Astrophysics 04--Atmospheric Sciences 05--Behavioral and Social Sciences 06--Biological and Medical Sciences 07--Chemistry 08--Earth Sciences and Oceanography
Sample: Narrow Subject Groups Aviation Technology 01 Aerodynamics 02 Military Aircraft Operations 03 Aircraft 0301 Helicopters 0302 Bombers 0303 Attack and Fighter Aircraft 0304 Patrol and Reconnaissance Aircraft
Baseline • Establish baseline for conventional techniques • classification • training SVM for each subject area • “off-the-shelf” document modelling and SVM libraries
Why SVM? • Prior studies have suggested good results with SVM • relatively immune to “overfitting” – fitting to coincidental relations encountered during training • low dimensionality of model parameters
Machine Learning: Support Vector Machines • Binary Classifier • Finds the plane with largest margin to separate the two classes of training samples • Subsequently classifies items based on which side of line they fall hyperplane Font size margin Line number
Baseline SVM Evaluation • Training & Testing process repeated for multiple subject categories • Determine accuracy • overall • positive (ability to recognize new documents that belong in the class the SVM was trained for) • negative (ability to reject new documents that belong to other classes) • Explore Training Issues
SVM “Out of the Box” • 16 broad categories with 150 or more documents • Lucene library for model preparation • LibSVM for SVM training & testing • no normalization or parameter tuning • Training set of 100/100 (positive/negative samples) • Test set of 50/50
“OOtB” Interpretation • Reasonable performance on broad categories given modest training set size. • Related experiment showed that with normalization and optimized parameter selection, accuracy could be improved as much as an additional 10%
Training Set Size • accuracy plateaus for training set sizes well under the number of terms in the document model
Training Issues • Training Set Size • Concern: detailed subject groups may have too few known examples to perform effective SVM training in that subject • Possible Solution: collection may have few positive examples, but has many, many negative example • Positive/Negative Training Mixes • effects on accuracy
Training Set Composition • experiment performed with 50 positive training examples • OotB SVM training • increasing the number of negative training examples has little effect on overall accuracy • but positive accuracy reduced
Interpretation • may indicate a weakness in SVM • or simply further evidence of the importance of optimizing SVM parameters • may indicate unsuitability of treating SVM output as simple boolean decision • might do better as “best fit” in a multi-label classifier