150 likes | 255 Views
Comparative Analysis of Automatic Term and Collocation Extraction. Sanja Seljan , Bojana Dalbelo Bašić , Jan Šnajder , Davor Delač , Matija Šamec-Gjurin, Dina Crnec Faculty of Humanities and Social Sciences, Department of I nformation Sciences
E N D
Comparative Analysis of Automatic Term and Collocation Extraction Sanja Seljan, Bojana DalbeloBašić, Jan Šnajder, Davor Delač, Matija Šamec-Gjurin, Dina Crnec Faculty of Humanities and Social Sciences, Department of Information Sciences Faculty of Electrical Engineering and Computing INFuture2009: Digital Resources and Knowledge Sharing, 4-7 November 2009
Overview • Introduction • Reasons for extraction • Research • Resources & tools • Extracted lists • Evaluation • Precision, recall, F-measure • Conclusion INFuture2009: Digital Resources and Knowledge Sharing, 4-7 November 2009
I. Introduction • Monolingual and multilingual resources • Helpful • Integrated • Require human intervention • EU pre-accession activities • Speed up + consistency • Used in further research and practice INFuture2009: Digital Resources and Knowledge Sharing, 4-7 November 2009
List: • Terms (Member State, European Union) • Collocations (adopt a/the resolution, decided as follows) • Multi-word units (depend on, well-being) • Term extraction process: • Term extraction (term acquisition)- identification • Term recognition - verification INFuture2009: Digital Resources and Knowledge Sharing, 4-7 November 2009
II. Research • Resources • 10 documents – legislation, Cro-Eng • Tools • TermeX tool (FER) – list A • SDL Multi Term Extract + NooJ (FF) – list B • Reference list • Evaluation – reference list INFuture2009: Digital Resources and Knowledge Sharing, 4-7 November 2009
Reference list • 470 terms and collocations • Exclude unigrams • Balance between lexical coverage, adequacy, practicality • terms (NPs: 346/470) • collocations (VPs) INFuture2009: Digital Resources and Knowledge Sharing, 4-7 November 2009
Reference list • Contains: • Terms (acquiring company, applicant country) • Collocations (adopt a/the resolution, decided as follows, entry into force, having regard to) • Names and abbreviations (Economic and Monetary Union EMU, European Union EU) • Relevant embedded terms (crime prevention, crime prevention bodies, national crime prevention measures). INFuture2009: Digital Resources and Knowledge Sharing, 4-7 November 2009
List B • Language-independent statistically-based SDL Multi Term Extract tool • Frequency treshold set to 4 • Filtered by the list of stop-words -> 369 cand. • Language dependant NooJ tool • 36 local grammars -> 512 cand. INFuture2009: Digital Resources and Knowledge Sharing, 4-7 November 2009
List A • TermeX • Lexical association measures (AMs) • 14 AMs (PMI, Dice, Chi-square,…) • Lemmatization • POS filtering • Frequency treshold set to? INFuture2009: Digital Resources and Knowledge Sharing, 4-7 November 2009
List A • Extracted terms ranked by AM value • 1816 candidates • AMs used: • 2-grams – PMI • 3-grams, 4-grams – heuristic extensions • Noun phrases only INFuture2009: Digital Resources and Knowledge Sharing, 4-7 November 2009
Results • Evaluation • F1-measure (precision, recall) • True positives calculated by taking into account inflection (suffix stripping) INFuture2009: Digital Resources and Knowledge Sharing, 4-7 November 2009
Results • List A unsatisfactory • Low recall – Verb phrases, terms consisting of more than 4 words • Low precision – ranked list, can be improved with cut-off (true positives are better ranked) • List B modest • can be improved with lemmatization, definition of upper/lower cases, more detailed local grammar INFuture2009: Digital Resources and Knowledge Sharing, 4-7 November 2009
Conclusion • Comparison of two hybrid approaches to term extraction • Human created lists differ from extracted lists • human knowledge, experience and intuition • Space for improvement – automatic extraction combined human intervention INFuture2009: Digital Resources and Knowledge Sharing, 4-7 November 2009
Thank you! INFuture2009: Digital Resources and Knowledge Sharing, 4-7 November 2009
INFuture2009: Digital Resources and Knowledge Sharing, 4-7 November 2009