1 / 18

Joachim Wermter and Udo Hahn Jena University ACL 2006 Regular Conference Paper

You Can’t Beat Frequency (Unless You Use Linguistic Knowledge) – A Qualitative Evaluation of Association Measures for Collocation and Term Extraction. Joachim Wermter and Udo Hahn Jena University ACL 2006 Regular Conference Paper. Objective.

chars
Download Presentation

Joachim Wermter and Udo Hahn Jena University ACL 2006 Regular Conference Paper

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. You Can’t Beat Frequency (Unless You Use Linguistic Knowledge) – A Qualitative Evaluation of Association Measures for Collocation and Term Extraction Joachim Wermter and Udo Hahn Jena University ACL 2006 Regular Conference Paper

  2. Objective • Compare the performance of frequency, t-test, LSM and LPM methods on collocation extraction and domain-specific automatic term recognition

  3. Collocation Extraction • Extract idioms • “kick the bucket”

  4. Domain-Specific Term Extraction • Extract domain-specific phrases • “mitochondrial inheritance”

  5. Corpus

  6. LSM • A “linguistic knowledge-based” method for collocation extraction proposed by the same authors in another paper • Assumes that idioms are less modifiable by supplements • e.g. “kick the beautiful bucket” • probability of PNVtriplehaving Suppk : • f(x) : frequency of x

  7. LSM • Modifiability of a PNVtriple • Probability of a PNVtriple • Collocation Score

  8. LPM • A “linguistic knowledge-based” method for automatic term recognition proposed by the same authors in another paper • Assumes that words in a phrase are less interchangeable • e.g mitochondrion inheritance  money inheritance • Modifiability of a phrase: • modk(n-gram) : replace k words • seli : particular replacement

  9. LPM • Phrase Score:

  10. Evaluation Criteria • Compared to the baseline frequency ranking method, a good ranking function should have the four characteristics: • Keep the true positives in the upper portion of the list • Keep the true negatives in the lower portion of the list • Demote true negatives from the upper portion • Promote true positives from the lower portion

  11. Collocation Extraction Results

  12. Automatic Term Recognition Results

  13. Observations • CE Criterion 1 • t-test and frequency methods have similar performance • LSM promotes some TPs to top 1/6 • ATR Criterion 1 • t-test and frequency methods have similar performance • LPM promotes a few TPs to top 1/6

  14. Observations • CE Criterion 2 • LSM promotes a lot more TNs to upper portion than t-test method (bad…) • ATR Criterion 2 • Same as above

  15. Observations • CE Criterion 3 • LSM demotes a lot more TNs to the lower portion than t-test • ATR Criterion 3 • Same as above

  16. Observations • CE Criterion 4 • LSM promotes more TPs to upper portion than t-test • ATR Criterion 4 • Same as above

  17. Conclusion • LSM and LPM methods are better than t-test and frequency methods • Pure statistics methods are worse than knowledge-based methods

More Related