180 likes | 345 Views
You Can’t Beat Frequency (Unless You Use Linguistic Knowledge) – A Qualitative Evaluation of Association Measures for Collocation and Term Extraction. Joachim Wermter and Udo Hahn Jena University ACL 2006 Regular Conference Paper. Objective.
E N D
You Can’t Beat Frequency (Unless You Use Linguistic Knowledge) – A Qualitative Evaluation of Association Measures for Collocation and Term Extraction Joachim Wermter and Udo Hahn Jena University ACL 2006 Regular Conference Paper
Objective • Compare the performance of frequency, t-test, LSM and LPM methods on collocation extraction and domain-specific automatic term recognition
Collocation Extraction • Extract idioms • “kick the bucket”
Domain-Specific Term Extraction • Extract domain-specific phrases • “mitochondrial inheritance”
LSM • A “linguistic knowledge-based” method for collocation extraction proposed by the same authors in another paper • Assumes that idioms are less modifiable by supplements • e.g. “kick the beautiful bucket” • probability of PNVtriplehaving Suppk : • f(x) : frequency of x
LSM • Modifiability of a PNVtriple • Probability of a PNVtriple • Collocation Score
LPM • A “linguistic knowledge-based” method for automatic term recognition proposed by the same authors in another paper • Assumes that words in a phrase are less interchangeable • e.g mitochondrion inheritance money inheritance • Modifiability of a phrase: • modk(n-gram) : replace k words • seli : particular replacement
LPM • Phrase Score:
Evaluation Criteria • Compared to the baseline frequency ranking method, a good ranking function should have the four characteristics: • Keep the true positives in the upper portion of the list • Keep the true negatives in the lower portion of the list • Demote true negatives from the upper portion • Promote true positives from the lower portion
Observations • CE Criterion 1 • t-test and frequency methods have similar performance • LSM promotes some TPs to top 1/6 • ATR Criterion 1 • t-test and frequency methods have similar performance • LPM promotes a few TPs to top 1/6
Observations • CE Criterion 2 • LSM promotes a lot more TNs to upper portion than t-test method (bad…) • ATR Criterion 2 • Same as above
Observations • CE Criterion 3 • LSM demotes a lot more TNs to the lower portion than t-test • ATR Criterion 3 • Same as above
Observations • CE Criterion 4 • LSM promotes more TPs to upper portion than t-test • ATR Criterion 4 • Same as above
Conclusion • LSM and LPM methods are better than t-test and frequency methods • Pure statistics methods are worse than knowledge-based methods