280 likes | 418 Views
Corpora and Statistical Methods Lecture 13. Albert Gatt. Part 2. Supervised methods. General characterisation. Training set: documents labeled with one or more classes encoded using some representation model Typical data representation model: every document represented as
E N D
Corpora and Statistical MethodsLecture 13 Albert Gatt
Part 2 Supervised methods
General characterisation • Training set: • documents labeled with one or more classes • encoded using some representation model • Typical data representation model: • every document represented as • vector of real-valued measurements + class • vector may represent word counts • class is given since this is supervised learning
Data sources: Reuters collection • http://www.daviddlewis.com/resources/testcollections/reuters21578/ • Large collection of Reuters newswire texts, categorised by the topic. • Topics include: • earn(ings) • grain • wheat • acq(uisitions) • …
Reuters dataset Text 1 Text 2 <REUTERS> <DATE>17-MAR-1987 11:07:22.82</DATE> <TOPICS><D>earn</D></TOPICS> <TEXT> <TITLE>AMRE INC <AMRE> 3RD QTR JAN 31 NET</TITLE> <DATELINE> DALLAS, MArch 17 - </DATELINE> <BODY> Shr five ctsvs one ct Net 196,986 vs 37,966 Revs 15.5 mlnvs 8,900,000 Nine mths Shr 52 ctsvs 22 cts Net two mlnvs 874,000 Revs 53.7 mlnvs 28.6 mln Reuter </BODY> </TEXT> </REUTERS> <REUTERS> <DATE>17-MAR-1987 11:26:47.36</DATE> <TOPICS><D>acq</D></TOPICS> <TEXT> <TITLE>DEVELOPMENT CORP OF AMERICA <DCA> MERGED</TITLE> <DATELINE>HOLLYWOOD, Fla., March 17 -</DATELINE> <BODY> Development Corp of America said its merger with Lennar Corp <LEN> was completed and its stock no longer existed. Development Corp of America, whose board approved the acquisitionlast November for 90 mlndlrs, said the merger was effective today and its stock now represents the right to receive 15 dlrs a share.TheAmerican Stock Exchange said it would provide further details later. Reuter </BODY> </TEXT> </REUTERS>
Representing documents:vector representations • Suppose we select k = 20 keywords that are diagonistic of the earnings category. • Can be done using chi-square, topic signatures etc • Each document d represented as a vector, containing term weights for each of the k terms: #times term i occurs in doc j length of doc j
Why use a log weighting scheme? • A formula like 1 + log(tf) dampens the actual frequency • Example: • let d be a document of 89 words • profit occurs 6 times • tf(profit) = 6; • 10 * [1+log(tf(profit))/1+log(89)] = 6 • cts (“cents”) occurs 3 times • tf(cents)= 3; • 10 * [1+log(tf(cts))/1+log(89)] = 5 • we avoid overestimating the importance of profit relative to cts (profit is more important than cts, but not twice as important) • Log weighting schemes are common in information retrieval
Form of a decision tree Example: • probability of belonging to category “earnings” given that s(cts) > 2 is .116 node 1 7861 items p(c|n1) = 0.3 split: cts value: 2 cts < 2 cts ≥ 2 node 2 5977 items p(c|n2) = 0.116 split: net value: 1 node 5 1704 items p(c|n5) = 0.9 split: vs value: 2 vs < 2 net ≥ 2 net < 1 net ≥ 1 node 6 node 7 node 3 node 4
Form of a decision tree • Equivalent to a formula in disjunctive normal form. (cts < 2 & net < 1 &…) V (cts ≥ 2 & net ≥ 1 &…) • a complete path is a conjunction node 1 7861 items p(c|n1) = 0.3 split: cts value: 2 cts < 2 cts ≥ 2 node 2 5977 items p(c|n2) = 0.116 split: net value: 1 node 5 1704 items p(c|n5) = 0.9 split: vs value: 2 vs < 2 net ≥ 2 net < 1 net ≥ 1 node 6 node 7 node 3 node 4
How to grow a decision tree • Typical procedure: • grow a very large tree • prune it • Pruning avoids overfitting the training data. • e.g. a tree can contain several branches which are based on accidental properties of the training set • e.g. only 1 document in category earnings contains both “dlrs” and “pct”
Growing the tree • Splitting criterion: • to identify a value for a feature a on which a node is split • Stopping criterion: • determines when to stop splitting • e.g. stop splitting when all elements at a node have an identical representation (equal vectors for all keywords)
Growing the tree: Splitting criterion • Information gain: • do we reduce uncertainty if we split node n into two when attribute a has value y? • let t be the distribution of n • this is equivalent to comparing: • entropy of t vsentropy of t given a • i.e. entropy of t vs entropy of its child nodes if we split sum of entropy of child nodes, weighted by the proportion p of items from n in each child (l & r)
Information gain example • at node 1 • P(c|n1) = 0.3 • H = 0.6 • at node 2: • P(c|n2) = 0.1 • H = 0.35 • at node 5: • P(c|n5) = 0.9 • H = 0.22 • weighted sum of 2 & 5 = 0.328 • gain = 0.611 – 0.328 = 0.283 node 1 7861 items p(c|n1) = 0.3 split: cts value: 2 cts < 2 cts ≥ 2 node 2 5977 items p(c|n2) = 0.116 split: net value: 1 node 5 1704 items p(c|n5) = 0.9 split: vs value: 2 vs < 2 net ≥ 2 net < 1 net ≥ 1 node 6 node 7 node 3 node 4
Leaf nodes • Suppose n3 has: • 1500 “earnings” docs • other docs in other categories • Where do we classify a new doc d? • e.g. use MLE with add-one smoothing node 1 7861 items p(c|n1) = 0.3 split: cts value: 2 cts < 2 cts ≥ 2 node 2 5977 items p(c|n2) = 0.116 split: net value: 1 node 5 1704 items p(c|n5) = 0.9 split: vs value: 2 net < 1 net ≥ 1 node 4 541 items p(c|n4) = 0.649 node 3 5436 items p(c|n3) = 0.050
Pruning the tree • Pruning proceeds by removing leaf nodes one by one, until tree is empty. • At each step, remove the leaf node expected to be least helpful. • Needs a pruning criterion. • i.e. a measure of “confidence” indicating what evidence we have that the node is useful. • Each pruning step gives us a new tree (old tree minus one node) – total of n trees if original tree had n nodes • Which of these trees do we select as our final classifier?
Pruning the tree: held-out data • To select the best tree, we can use held-out data. • At each pruning step, try resulting tree against held-out data, and check success rate. • Since held-out data reduces training set, better to perform cross-validation.
When are decision trees useful? • Some disadvantages: • A decision tree is a complex classification device • many parameters • split training data into very small chunks • small sets will display regularities that don’t generalise (overfitting) • Main advantage: • very easy to understand!
A reminder from lecture 9 • MaxEnt distribution • a log-linear model: • probability of a category c and document d computed in terms of weighted multiplication of feature values (normalised by a constant) • each feature imposes a constraint on the model:
A reminder from lecture 9 • The MaxEnt principle dictates that we find the simplest model p* satisfying the constraints: • where P is the set of possible distributions with • p* is unique and has the form given earlier • Weights for features can be found using Generalised Iterative Scaling
Application to text categorisation • Example: • we’ve identified 20 keywords which are diagnostic of the “earnings” category in Reuters • each keyword is a feature
“Earnings” features (from M&S `99) Very salient/ diagnostic features (higher weights) less important features
Classifying with the maxent model • Recall that: • As a decision criterion we can use: • Classify a new document as “earnings” if • P(“earnings”|d) > P(¬”earnings”|d)
Rationale • Simple nearest neighbour (1NN): • Given: a new document d • Find: the document in the training set that is most similar to d • Classify d with the same category • Generalisation (kNN): • compare d to its k nearest neighbours • The crucial thing is the similarity measure.
Example: 1NN + cosine similarity • Given: document d • Goal: categorise d based on training set T • Define: • Find the subset T’ of T s.t.:
Generalising to k>1 neighbours • Choose the k nearest neighbours and weight them by similarity. • Repeat method for each neighbour. • Decide on a classification based on the majority class for these neighbours.