1 / 15

Comparative Analysis of Automatic Term and Collocation Extraction

Comparative Analysis of Automatic Term and Collocation Extraction. Sanja Seljan , Bojana Dalbelo Bašić , Jan Šnajder , Davor Delač , Matija Šamec-Gjurin, Dina Crnec Faculty of Humanities and Social Sciences, Department of I nformation Sciences

lacy
Download Presentation

Comparative Analysis of Automatic Term and Collocation Extraction

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Comparative Analysis of Automatic Term and Collocation Extraction Sanja Seljan, Bojana DalbeloBašić, Jan Šnajder, Davor Delač, Matija Šamec-Gjurin, Dina Crnec Faculty of Humanities and Social Sciences, Department of Information Sciences Faculty of Electrical Engineering and Computing INFuture2009: Digital Resources and Knowledge Sharing, 4-7 November 2009

  2. Overview • Introduction • Reasons for extraction • Research • Resources & tools • Extracted lists • Evaluation • Precision, recall, F-measure • Conclusion INFuture2009: Digital Resources and Knowledge Sharing, 4-7 November 2009

  3. I. Introduction • Monolingual and multilingual resources • Helpful • Integrated • Require human intervention • EU pre-accession activities • Speed up + consistency • Used in further research and practice INFuture2009: Digital Resources and Knowledge Sharing, 4-7 November 2009

  4. List: • Terms (Member State, European Union) • Collocations (adopt a/the resolution, decided as follows) • Multi-word units (depend on, well-being) • Term extraction process: • Term extraction (term acquisition)- identification • Term recognition - verification INFuture2009: Digital Resources and Knowledge Sharing, 4-7 November 2009

  5. II. Research • Resources • 10 documents – legislation, Cro-Eng • Tools • TermeX tool (FER) – list A • SDL Multi Term Extract + NooJ (FF) – list B • Reference list • Evaluation – reference list INFuture2009: Digital Resources and Knowledge Sharing, 4-7 November 2009

  6. Reference list • 470 terms and collocations • Exclude unigrams • Balance between lexical coverage, adequacy, practicality • terms (NPs: 346/470) • collocations (VPs) INFuture2009: Digital Resources and Knowledge Sharing, 4-7 November 2009

  7. Reference list • Contains: • Terms (acquiring company, applicant country) • Collocations (adopt a/the resolution, decided as follows, entry into force, having regard to) • Names and abbreviations (Economic and Monetary Union EMU, European Union EU) • Relevant embedded terms (crime prevention, crime prevention bodies, national crime prevention measures). INFuture2009: Digital Resources and Knowledge Sharing, 4-7 November 2009

  8. List B • Language-independent statistically-based SDL Multi Term Extract tool • Frequency treshold set to 4 • Filtered by the list of stop-words -> 369 cand. • Language dependant NooJ tool • 36 local grammars -> 512 cand. INFuture2009: Digital Resources and Knowledge Sharing, 4-7 November 2009

  9. List A • TermeX • Lexical association measures (AMs) • 14 AMs (PMI, Dice, Chi-square,…) • Lemmatization • POS filtering • Frequency treshold set to? INFuture2009: Digital Resources and Knowledge Sharing, 4-7 November 2009

  10. List A • Extracted terms ranked by AM value • 1816 candidates • AMs used: • 2-grams – PMI • 3-grams, 4-grams – heuristic extensions • Noun phrases only INFuture2009: Digital Resources and Knowledge Sharing, 4-7 November 2009

  11. Results • Evaluation • F1-measure (precision, recall) • True positives calculated by taking into account inflection (suffix stripping) INFuture2009: Digital Resources and Knowledge Sharing, 4-7 November 2009

  12. Results • List A unsatisfactory • Low recall – Verb phrases, terms consisting of more than 4 words • Low precision – ranked list, can be improved with cut-off (true positives are better ranked) • List B modest • can be improved with lemmatization, definition of upper/lower cases, more detailed local grammar INFuture2009: Digital Resources and Knowledge Sharing, 4-7 November 2009

  13. Conclusion • Comparison of two hybrid approaches to term extraction • Human created lists differ from extracted lists • human knowledge, experience and intuition • Space for improvement – automatic extraction combined human intervention INFuture2009: Digital Resources and Knowledge Sharing, 4-7 November 2009

  14. Thank you! INFuture2009: Digital Resources and Knowledge Sharing, 4-7 November 2009

  15. INFuture2009: Digital Resources and Knowledge Sharing, 4-7 November 2009

More Related