I256: Applied Natural Language Processing

I256: Applied Natural Language Processing Marti Hearst Sept 27, 2006

Evaluation Measures

Evaluation Measures • Precision: • Proportion of those you labeled X that the gold standard thinks really is X • #correctly labeled by alg/ all labels assigned by alg • #True Positive / (#True Positive + #False Positive) • Recall: • Proportion of those items that are labeled X in the gold standard that you actually label X • #correctly labeled by alg / all possible correct labels • #True Positive / (#True Positive + # False Negative)

F-measure • Can “cheat” with precision scores by labeling (almost) nothing with X. • Can “cheat” on recall by labeling everything with X. • The better you do on precision, the worse on recall, and vice versa • The F-measure is a balance between the two. • 2*precision*recall / (recall+precision)

Evaluation Measures • Accuracy: • Proportion that you got right • (#True Positive + #True Negative) / N N = TP + TN + FP + FN • Error: • (#False Positive + #False Negative)/N

Prec/Recall vs. Accuracy/Error • When to use Precision/Recall? • Useful when there are only a few positives and many many negatives • Also good for ranked ordering • Search results ranking • When to use Accuracy/Error • When every item has to be judged, and it’s important that every item be correct. • Error is better when the differences between algorithms are very small; let’s you focus on small improvements. • Speech recognition

Evaluating Partial Parsing • How do we evaluate it?

Evaluating Partial Parsing

Testing our Simple Fule • Let’s see where we missed:

Update rules; Evaluate Again

Evaluate on More Examples

Incorrect vs. Missed • Add code to print out which were incorrect

Missed vs. Incorrect

What is a good Chunking Baseline?

The Tree Data Structure

Baseline Code (continued)

Evaluating the Baseline

Cascaded Chunking

Next Time • Summarization

I256: Applied Natural Language Processing