170 likes | 343 Views
text categorization. Updated 11/1/2006. Performance measures – binary classification. Ground truth. Accuracy: acc = (a+d)/(a+b+c+d) Precision: p = a/(a+b) Recall: r = a/(a+c) F F = ( 2 +1) pr/( 2 p +r) Ususally one uses F 1 = 2pr/( p +r) Break-even point.
E N D
text categorization Updated 11/1/2006
Performance measures – binary classification Ground truth • Accuracy: acc = (a+d)/(a+b+c+d) • Precision: p = a/(a+b) • Recall: r = a/(a+c) • F F = (2+1) pr/(2p +r) Ususally one uses F1 = 2pr/(p +r) • Break-even point Classifier assigned Contigency table
Performance measures – multiple categories • Micro averaging • Macro averaging
Reuters 21578 • Reuters collection contains 9603 training articles and 3299 test articles. • Were sent over the Reuters newswire in 1987. • Contains about 100 categories such as ‘mergers and acquisitions’, ‘interset rates’, ‘wheat’, ‘silver’ etc. • Distribution of articles among categories is highly non-uniform. • ‘earning’ contains 2709 docs • 75 categories contain less than 10 docs each.
Example of a Reuters news story from category ‘earning’ <DATE>26-FEB-1987 15:18:59.34</DATE> <TOPICS><D>earn</D></TOPICS> <TEXT> <TITLE>COBANCO INC <CBCO> YEAR NET</TITLE> <DATELINE> SANTA CRUZ, Calif., Feb 26 - </DATELINE> <BODY>Shr 34 cts vs 1.19 dlrs Net 807,000 vs 2,858,000 Assets 510.2 mln vs 479.7 mln Deposits 472.3 mln vs 440.3 mln Loans 299.2 mln vs 327.2 mln Note: 4th qtr not available. Year includes 1985 extraordinary gain from tax carry forward of 132,000 dlrs, or five cts per shr. Reuter </BODY></TEXT> </REUTERS>
Categorization methods • Decision trees • Naïve bayes • K-nearest neighbors (KNN) • Neural networks • Support Vector Machines (SVM)
Representation of documents • The most popular representation is ‘Bag of Words’, which ignores all structure of documents. • Document I will be represented by a vector Xi Rn (n is the number of word types), where the j’th coordinate is just the number of times word wj appears in the document. (so called ‘term frequency – tfj).
contains “cents” 2 times contains “cents” < 2 times contains “versus” 2 times contains “versus” < 2 times contains “net” 1 time contains “net” < 1 time 272/5436 = 0.050 209/301 = 0.694 422/541 = 0.780 1398/1403 = 0.996 “yes” “no” Decision trees Earnings? 2301/7681 = 0.3 of all docs 1607/1704 = 0.943 694/5977 = 0.116
Building decision trees • Information gain
Naïve bayes • Multivariate Bernoulli model • Multinomial model
Neural network • Perceptrons • Multi-layer perceptrons
reuters 21578 – comparison* *Yiming-Yang & Xin Liu, A re-examination of text categorization methods, SIGIR99)