210 likes | 307 Views
I256: Applied Natural Language Processing. Marti Hearst Sept 27, 2006. Evaluation Measures. Evaluation Measures. Precision: Proportion of those you labeled X that the gold standard thinks really is X #correctly labeled by alg/ all labels assigned by alg
E N D
I256: Applied Natural Language Processing Marti Hearst Sept 27, 2006
Evaluation Measures • Precision: • Proportion of those you labeled X that the gold standard thinks really is X • #correctly labeled by alg/ all labels assigned by alg • #True Positive / (#True Positive + #False Positive) • Recall: • Proportion of those items that are labeled X in the gold standard that you actually label X • #correctly labeled by alg / all possible correct labels • #True Positive / (#True Positive + # False Negative)
F-measure • Can “cheat” with precision scores by labeling (almost) nothing with X. • Can “cheat” on recall by labeling everything with X. • The better you do on precision, the worse on recall, and vice versa • The F-measure is a balance between the two. • 2*precision*recall / (recall+precision)
Evaluation Measures • Accuracy: • Proportion that you got right • (#True Positive + #True Negative) / N N = TP + TN + FP + FN • Error: • (#False Positive + #False Negative)/N
Prec/Recall vs. Accuracy/Error • When to use Precision/Recall? • Useful when there are only a few positives and many many negatives • Also good for ranked ordering • Search results ranking • When to use Accuracy/Error • When every item has to be judged, and it’s important that every item be correct. • Error is better when the differences between algorithms are very small; let’s you focus on small improvements. • Speech recognition
Evaluating Partial Parsing • How do we evaluate it?
Testing our Simple Fule • Let’s see where we missed:
Incorrect vs. Missed • Add code to print out which were incorrect
Next Time • Summarization