Comparative Analysis of Automatic Term and Collocation Extraction

Comparative Analysis of Automatic Term and Collocation Extraction Sanja Seljan, Bojana DalbeloBašić, Jan Šnajder, Davor Delač, Matija Šamec-Gjurin, Dina Crnec Faculty of Humanities and Social Sciences, Department of Information Sciences Faculty of Electrical Engineering and Computing INFuture2009: Digital Resources and Knowledge Sharing, 4-7 November 2009

Overview • Introduction • Reasons for extraction • Research • Resources & tools • Extracted lists • Evaluation • Precision, recall, F-measure • Conclusion INFuture2009: Digital Resources and Knowledge Sharing, 4-7 November 2009

I. Introduction • Monolingual and multilingual resources • Helpful • Integrated • Require human intervention • EU pre-accession activities • Speed up + consistency • Used in further research and practice INFuture2009: Digital Resources and Knowledge Sharing, 4-7 November 2009

List: • Terms (Member State, European Union) • Collocations (adopt a/the resolution, decided as follows) • Multi-word units (depend on, well-being) • Term extraction process: • Term extraction (term acquisition)- identification • Term recognition - verification INFuture2009: Digital Resources and Knowledge Sharing, 4-7 November 2009

II. Research • Resources • 10 documents – legislation, Cro-Eng • Tools • TermeX tool (FER) – list A • SDL Multi Term Extract + NooJ (FF) – list B • Reference list • Evaluation – reference list INFuture2009: Digital Resources and Knowledge Sharing, 4-7 November 2009

Reference list • 470 terms and collocations • Exclude unigrams • Balance between lexical coverage, adequacy, practicality • terms (NPs: 346/470) • collocations (VPs) INFuture2009: Digital Resources and Knowledge Sharing, 4-7 November 2009

Reference list • Contains: • Terms (acquiring company, applicant country) • Collocations (adopt a/the resolution, decided as follows, entry into force, having regard to) • Names and abbreviations (Economic and Monetary Union EMU, European Union EU) • Relevant embedded terms (crime prevention, crime prevention bodies, national crime prevention measures). INFuture2009: Digital Resources and Knowledge Sharing, 4-7 November 2009

List B • Language-independent statistically-based SDL Multi Term Extract tool • Frequency treshold set to 4 • Filtered by the list of stop-words -> 369 cand. • Language dependant NooJ tool • 36 local grammars -> 512 cand. INFuture2009: Digital Resources and Knowledge Sharing, 4-7 November 2009

List A • TermeX • Lexical association measures (AMs) • 14 AMs (PMI, Dice, Chi-square,…) • Lemmatization • POS filtering • Frequency treshold set to? INFuture2009: Digital Resources and Knowledge Sharing, 4-7 November 2009

List A • Extracted terms ranked by AM value • 1816 candidates • AMs used: • 2-grams – PMI • 3-grams, 4-grams – heuristic extensions • Noun phrases only INFuture2009: Digital Resources and Knowledge Sharing, 4-7 November 2009

Results • Evaluation • F1-measure (precision, recall) • True positives calculated by taking into account inflection (suffix stripping) INFuture2009: Digital Resources and Knowledge Sharing, 4-7 November 2009

Results • List A unsatisfactory • Low recall – Verb phrases, terms consisting of more than 4 words • Low precision – ranked list, can be improved with cut-off (true positives are better ranked) • List B modest • can be improved with lemmatization, definition of upper/lower cases, more detailed local grammar INFuture2009: Digital Resources and Knowledge Sharing, 4-7 November 2009

Conclusion • Comparison of two hybrid approaches to term extraction • Human created lists differ from extracted lists • human knowledge, experience and intuition • Space for improvement – automatic extraction combined human intervention INFuture2009: Digital Resources and Knowledge Sharing, 4-7 November 2009

Thank you! INFuture2009: Digital Resources and Knowledge Sharing, 4-7 November 2009

INFuture2009: Digital Resources and Knowledge Sharing, 4-7 November 2009

Comparative Analysis of Automatic Term and Collocation Extraction

Comparative Analysis of Automatic Term and Collocation Extraction

Presentation Transcript

Automatic Acquisition of Lexical Classes and Extraction Patterns for Information Extraction

Bilingual term extraction revisited

Collocation and translation

Comparative Analysis

Comparative Analysis of Food

Automatic Extraction and Incorporation of Purpose Data into PurposeNet

Comparative Analysis

Comparative Analysis

Automatic Extraction of Subcategorization Frames From Corpora

“Comparative Analysis”

Automatic Extraction of Hierarchical Relations from Text

Comparative Analysis

DSpace, ETDs, Automatic Metadata Extraction

Automatic Extraction of Object-Oriented Component Interfaces

Morphological Normalization and Collocation Extraction

Comparative Transcription Analysis of

Data Harvesting: automatic extraction of information necessary

Collocation :

Collocation

Extraction and Gel Analysis of DNA

Collocation Extraction Using Monolingual Word Alignment Method

Morphological Normalization and Collocation Extraction