180 likes | 357 Views
EXPERIMENTAL TECHNIQUES & EVALUATION IN NLP. Universita’ di Venezia 1 Ottobre 2003. The rise of empiricism. CL up until the 1980s primarily a theoretical discipline. The experimental methodology is now paid much more attention to. Empirical methodology & evaluation.
E N D
EXPERIMENTAL TECHNIQUES & EVALUATION IN NLP Universita’ di Venezia 1 Ottobre 2003
The rise of empiricism CL up until the 1980s primarily a theoretical discipline The experimental methodology is now paid much more attention to
Empirical methodology & evaluation • Starting with the big US ASR competitions of the 1980s, evaluation has progressively become a central component of work in NL • DARPA Speech initiative • MUC • TREC • GOOD: • Much easier for community (& researchers themselves) to understand which proposals are really improvements • BAD: • too much focus on small improvements • cannot afford to try entirely new technique (may not lead to improvements for a couple of years!)
Training set and test set Models estimated / systems developed using a TRAINING SET The training set should be- representative of the task- as large as possible- well-known and understood
The test set Estimated models evaluated using a TEST SET The test set should be- disjoint from the training set- large enough for results to be reliable- unseen
Possible problems with the training set Too small performance drops OVERFITTINGcan be reduced using- cross-validation(large variance may mean training set too small)- large priors
Possible problems with the test set Are results using the test set believable? - results might be distorted if too easy / hard- training set and test set may be too different (language non stationary)
Evaluation Two types:- BLACK BOX (system as a whole)- WHITE BOX (components independently) Typically QUANTITATIVE(but need QUALITATIVE as well)
Simplest quantitative evaluation metrics ACCURACY: percentage correct(against some gold standard)- e.g., tagger gets 96.7% of tags correct when evaluated using the Penn Treebank ERROR: percentage wrong- ERROR REDUCTION most typical metric in ASR
A more general form of evaluation: precision & recall sfvfvfnv lnv sjjsjnskffsavdkf d lsjnvjf fvjnfdfj djf v lafnlanflj affrvjfkjfkbvKFKRQVFsjfanvnf sfvfvfnv lnv sjjsjnskffsavdkf d lsjnvjf fvjnfdfj djf v lafnlanflj affrvjfkjfkbvKFKRQVFsjfanvnf sfvfvfnv lnv sjjsjnskffsavdkf d lsjnvjf fvjnfdfj djf v lafnlanflj affrvjfkjfkbvKFKRQVFsjfanvnf sfvfvfnv lnv sjjsjnskffsavdkf d lsjnvjf fvjnfdfj djf v lafnlanflj affrvjfkjfkbvKFKRQVFsjfanvnf sfvfvfnv lnv sjjsjnskffsavdkf d lsjnvjf fvjnfdfj djf v lafnlanflj affrvjfkjfkbvKFKRQVFsjfanvnf CDKBCWDK
Positives and negatives FALSE NEGATIVES TP FP TRUE NEGATIVES
Precision and recall PRECISION: proportion correct AMONG SELECTED ITEMS RECALL: proportion of correct items selected
The tradeoff between precision and recall Easy to get high precision: never classify anything Easy to get high recall: return everything Really need to report BOTH, or F-measure
Simple vs. multiple runs Single run may be lucky:- Do multiple runs- Report averaged results- Report degree of variation- Do SIGNIFICANCE TESTING (cfr. t-test, etc.) A lot of people are lazy and just report single runs.
Interpreting results A 97% accuracy may look impressive … but not so much if 98% of items have same tag: need BASELINE An F measure of .7 may not look very high unless told that humans only achieve .71 at this task: need UPPER BOUND
Confusion matrices Once you’ve evaluated your model, you may want to try to do some ERROR ANALYSIS. This usually done with a CONFUSION MATRIX
Readings • Manning and Schütze, chapter 8.1