370 likes | 494 Views
An Introduction To Categorization. Soam Acharya, PhD soamdev@yahoo.com 1/15/2003. What is Categorization?. { c 1 … c m } set of predefined categories { d 1 … d n } set of candidate documents Fill decision matrix with values {0,1} Categories are symbolic labels. Uses.
E N D
An Introduction To Categorization Soam Acharya, PhD soamdev@yahoo.com 1/15/2003
What is Categorization? • {c1 … cm} set of predefined categories • {d1 … dn} set of candidate documents • Fill decision matrix with values {0,1} • Categories are symbolic labels
Uses • Document organization • Document filtering • Word sense disambiguation • Web • Internet directories • Organization of search results • Clustering
Categorization Techniques • Knowledge systems • Machine Learning
Knowledge Systems • Manually build an expert system • Makes categorization judgments • Sequence of rules per category • If <booleancondition> then category • If document contains “buena vista home entertainment” then document category is “Home Video”
Knowledge System Issues • Scalability • Build • Tune • Requires Domain Experts • Transferability
Machine Learning Approach • Build a classifier for a category • Training set • Hierarchy of categories • Submit candidate documents for automatic classification • Expend effort in building a classifier, not in knowing the knowledge domain
Machine Learning Process taxonomy Training Document Pre-processing Training Set documents Classifier documents DB
Training Set • Initial corpus can be divided into: • Training set • Test set • Role of workflow tools
Document Preprocessing • Document Conversion: • Converts file formats (.doc, .ppt, .xls, .pdf etc) to text • Tokenizing/Parsing: • Stemming • Document vectorization • Dimension reduction
Document Vectorization • Convert document text into “bag of words” • Each document is a vector of nweighted terms Federal express 3 Severe 3 Flight 2 Y2000-Q3 1 Mountain 2 Document Exactly 1 Simple 5
Document Vectorization • Use tfidf function for term weighting • tfidf value may be normalized • All vectors of equal length • [0,1] tfidf(tk, dj) = #(tk, dj) . Log [|Tr| / #(tk)] # of times tk occurs in dj # of documents where tk occurs at least once Cardinality of training set
Dimension Reduction • Reduce dimensionality of vector space • Why? • Reduce computational complexity • Address “overfitting” problem • Overtuning classifier • How? • Feature selection • Feature extraction
Feature Selection • Also known as “term space reduction” • Remove “stop” words • Identify “best” words to be used in categorizing per topic • Document frequency of terms • Keep terms that occur in highest number of documents • Other measures • Chi square • Information gain
Feature Extraction • Synthesize new features from existing features • Term clustering • Use clusters/centroids instead of terms • Co-occurrence and co-absence • Latent Semantic Indexing • Compresses vectors into a lower dimensional space
Creating a Classifier • Define a function, Categorization Status Value, CSV, that for a document d: • CSVi: D -> [0,1] • Confidence that d belongs in ci • Boolean • Probability • Vector distance
Creating a Classifier • Define a threshold, thresh, such that if CSVi(d) > thresh(i) then categorize d under ci otherwise, don’t • CSV thresholding • Fixed value across all categories • Vary per category • Optimize via testing
Naïve Bayes Classifier Probability of doc dj belonging in category ci Training set terms/weights present in dj used to calculate probability of dj belonging to ci
Naïve Bayes Classifier If wkj is binary (0, 1) and pki is short for P(wkx = 1 | ci) After further derivation, the original equation looks like: Can be used for CSV Constants for all docs
Naïve Bayes Classifier • Independence assumption • Feature selection can be counterproductive
k-NN Classifier • Compute closeness between candidate documents and category documents Confidence score indicating whether dz belongs to category ci Similarity between dj and training set document dz
k-NN Classifier • k nearest neighbors • Find k nearest neighbors from all training documents and use their categories • K can also indicate the number of top ranked training documents per category to compare against • Similarity computation can be: • Inner product • Cosine coefficient
Support Vector Machines • “decision surface” that best separates data points in two classes • Support vectors are the training docs that best define hyperplane Max. margin Optimal hyperplane
Support Vector Machines • Training process involves finding the support vectors • Only care about support vectors in the training set, not other documents
Neural Networks • Train net to learn from a mapping of input words to a category • One neural net per category • Too expensive • One network overall • Perceptron approach without a hidden layer • Three layered
Classifier Committees • Combine multiple classifiers • Majority voting • Category specialization • Mixed results
Classification Performance • Category ranking evaluation • Recall = categories found and correct • Precision = categories found and correct • Micro and Macro averaging over categories Total categories correct Total categories found
Classification Performance • Hard • Two studies • Yiming Yang, 1997 • Yiming Yang and Xin Liu, 1999 • SVM, kNN >> Neural Net > Naïve Bayes • Performance converges for common categories (with many training docs)
Computational Bottlenecks • Quiver • # of topics • # of training documents • # of candidate documents
Categorization and the Internet • Classification as a service • Standardizing vocabulary • Confidentiality • performance • Use of hypertext in categorization • Augment existing classifiers to take advantage
Hypertext and Categorization • An already categorized document links to documents within same category • Neighboring documents in a similar category • Hierarchical nature of categories • Metatags
Augmenting Classifiers • Inject anchor text for a document into that document • Treat anchor text as separate terms • Depends on dataset • Mixed experimental results • Links may be noisy • Ads • Navigation
Topics and the Web • Topic distillation • Analysis of hyperlink graph structure • Authorities • popular pages • Hubs • Links to authorities authorities hubs
Topic Distillation • Kleinberg’s HITS algorithm • An initial set of pages: root set • Use this to create an expanded set • Weight propagation phase • Each node: authority score and hub score • Alternate • Authority = sum of current hub weights of all nodes pointing to it • Hub = sum of all authority score of all pages it points to • Normalize node scores and iterate until convergence • Output is a set of hubs and authorities
Conclusion • Why Classifiy? • The Classification Process • Various Classifiers • Which ones are better? • Other applications