Siham Boulaknadel, Béatrice Daille and Driss Aboutajdine LINA University of Nantes GSCM_LRIT University of Rabat

A multi-word term extraction program for Arabic language LREC 28-30 May 2008 = Marrakech Siham Boulaknadel, Béatrice Daille and Driss Aboutajdine LINA University of Nantes GSCM_LRIT University of Rabat

Outline • Multi-word term • Motivation • Approach • Comparing statistical methods • Conclusion and future work

Terms • Refer to a defined concept ... (ISO 704). • Represent a limited number of part of speech: nouns, verbs, adjectives, and adverbs. • Given subject domain

Multi-word terms تتكون أكاسيد النيتروجين كناتج لجميع عمليات الاحتراق التي تتم في درجات الحرارة العالية[wikipidea] Nitrogen oxides consists of all combustion processes taking place at high temperature MWTs extracted • أكاسيد النيتروجين • درجات الحرارة العالية • عمليات الاحتراق

Motivation • Frequent MWTs • Application • for building index from unstructured documents • for enhancing document retrieval system

MWT extraction systemConcept extraction Corpus Identification of Term Candidates - linguistic filtering (shallow parsing) Filtering of Term Candidates - statistical significance (LLR, FLR, MI3,T-score) Candidate list

MWT evaluation • unithood: measure the strengh of association of the constituents of MWU • United nations [environment domain] • Unithood • termhood: measure relatedness to existing domainspecific concepts. • Soil degradation [environment domain] • Termhood Unithood

MWT patterns

MWT variations • Multiple forms for the same concept • Variations types • Inflexional morphology • Number • N1 N2 / N1 N2 + suffix(ات, ون) • تلوث المحيط «ocean pollution » • تلوث المحيطات« oceans pollution » • Definite form • N Adj / Prefix(ال) + N prefix(ال) + Adj • تلوث كيميائي « chemical polution » • التلوث الكيميائي « the chemical pollution » • Derivational morphosyntactic phenomena • N1 ADJ /N1 PREP N2 • بئر نفطي => بئر من النفط « oil well » • Syntactically (modification postposition) • N1 N2 / N1 N2 ADJ • درجة الحرارة« degree of temperature » • درجة الحرارة العالية« high degree of temperature »

Comparing statistical filtering • Mutual Information (MI3) (Daille, 1994) as baseline • Loglikelihood (Dunning, 1994) • t-Score (Church, 1991) • FLR (Nakagawa and Mori, 2003)

Experiment Data • Arabic specific domain corpus on environment • Compiled from the web “Al-Khat Alakhdar” “Akhbar Albiae” from 2004-2006 • 475,148 words • Motivation • The no-availability of Arabic specific domain corpora

Gold standard • Reference list • Arabic environment terminology : Agrovoc • Total: 65,000 unique known terms ( single and MWT) • Dynamic search • Eurodicautom

Preprocessing • Moving diacritics • Buckwalter’s transliteration • Diab’s parsing (Diab, 2004) • Input • wlm yHtsb AlHkm Almjry sAndwr bwl rklp jzA' SHyHp Avr Erqlp dAxl AlmnTqp mn qbl AlysAndrw. • Output • w/CC lm/RP yHtsb/VBP Al/DT Hkm/NN Al/DT mjry/JJ sAndwr/NNP bwl/NNP rklp/NN jzA'/NN SHyHp/JJ Avr/IN Erqlp/NN dAxl/IN Al/DT mnTqp/NN mn/IN qbl/NN Al/DT ysAndrw/NNP ./PUNC

Evaluation and results • For each association score • Examine the first candidates term • Compute precision (termhood) for 100 candidates term • Precision (termhood) is quotient of attested MWT and all extracted sequences. • the loglikelihood is the best measure

Summary & future work • Develop MWT extraction for Arabic • Define MWT patterns and variations • Obtain best results than european languages • Improvement of system • Adding new variation • Improve lemmatisation

Introduction • MWT’s are sufficiently informative to help human readers get a feel of the essential topics • Use in many text related applications: • Text clustering • Document similarity • Document summarization

Related Work • Linguistic Approach • Based on linguistic pre-processing and annotations (result of taggers, shallow parsers) • Detect recurrent syntactic term formation patterns • Noun + Noun • (Adj | Noun) + Noun

Systems based on linguistics • Ananiadou, S. (1994) recognises single-word terms from domain of Immunology based on morphological analysis of term formation patterns (internal term make up) • Justeson & Katz (1995, TERMS) extract complex terms based on two characteristics (which distinguishes them from non terms) • the syntactic patterns are restricted • terms appear with the same form throughout the text, omissions of modifiers are avoided

Systems based on linguistics • The text is tagged; a filter is applied to extract terms ((A|N) + | ((A|N) * (N P)?) (A|N)*) N AN / NA / AAN / ANN / NAN / NNN / NPN • Filtering based on simple POS pattern • A pattern must occur above a certain threshold to be considered a valid term pattern. • Recall: 71% Precision: 71% -- 96% • LEXTER (Bourigault, 1994) • Extracts French compound terms based on surface syntactic analysis and text heuristics • Terms are identified according to certain syntactic patterns

Uses a boundary method to identify the extent of terms • categories or sequences of categories that are not found in term patterns form the boundaries e.g. verbs, any preposition (except de and à) followed by a determiner. Non productive sequences become boundaries. • Precision: 95% although tests have shown that lots of noise is generated

Approaches using statistical information • Main measures used: • Frequency of occurrence • Mutual Information • C/NC value • Experiments also with loglike coefficient [Dunning, 1993]

Frequency of occurrence • Simplest and most popular method for Domain independent, requires no external resources • Some filtering is used in form of syntactic patterns • Systems using frequency of occurrence • Dagan & Church (TERMIGHT, 1994) • Enguehard & Pantera (1994) • Lauriston (TERMINO, 1996)

Mutual Information • ‘The amount of information provided by the occurrence of the event represented by yi about the occurrence of the event represented by xk is defined as’ I(xk,yi)  log P(xk,yi) / P(xk) P(,yi) Fano (1961:27-28) • This measure is about how much a word tells us about the other. • Problems for MI come from data sparseness; • Damerau (1993) and Daille (1994) used MI for the extraction of candidate terms (only for two-word candidate terms)

C/NC value (Frantzi & Ananiadou) • C/value • total frequency of occurrence of string in corpus • frequency of string as part of longer candidate terms • number of these longer candidate terms • length of string (in number of words)

NC value NC-value(a) = 0.8 * C-value(a) + 0.2 * CF(a) a is the candidate term, C-value(a) is the C-value for the candidate term a, CF(a) is the context factor for the candidate term a • we obtain the CF by summing up the weights for its term context words, multiplied by their frequency appearing with this candidate term.

Hybrid approaches • Combination of linguistic information (filters), shallow parsing results and statistical measures • Daille, B., Frantzi & Ananiadou

Thank You

Siham Boulaknadel, Béatrice Daille and Driss Aboutajdine LINA University of Nantes GSCM_LRIT University of Rabat

Siham Boulaknadel, Béatrice Daille and Driss Aboutajdine LINA University of Nantes GSCM_LRIT University of Rabat

Presentation Transcript

University of Nebraska

How to convince a university to become a research world class university

The Best-Value Business Model

Rebecca Fiebrink Princeton University / University of Washington

Diagnostic Imaging Of The TMJ Done by: afnan zbidat , kholod fahoum , siham othman

Undertaking a Quantitative Synthesis

A U niversal Field Equation for Dispersive Processes

Learning from Text

Wojtek Skulski University of Rochester

Defense Acquisition University

Paul Luo Li (Carnegie Mellon University) James Herbsleb (Carnegie Mellon University)

Nancy K. Hill

BSc Southampton, MSc (LSE), PhD (LSE)

Welcome Hinsdale Central Class of 2013

University Biomechanics

John Birks University of Bergen, University College London, and University of Oxford