270 likes | 624 Views
A multi-word term extraction program for Arabic language. LREC 28-30 May 2008 = Marrakech. Siham Boulaknadel, Béatrice Daille and Driss Aboutajdine LINA University of Nantes GSCM_LRIT University of Rabat. Outline. Multi-word term Motivation Approach Comparing statistical methods
E N D
A multi-word term extraction program for Arabic language LREC 28-30 May 2008 = Marrakech Siham Boulaknadel, Béatrice Daille and Driss Aboutajdine LINA University of Nantes GSCM_LRIT University of Rabat
Outline • Multi-word term • Motivation • Approach • Comparing statistical methods • Conclusion and future work
Terms • Refer to a defined concept ... (ISO 704). • Represent a limited number of part of speech: nouns, verbs, adjectives, and adverbs. • Given subject domain
Multi-word terms تتكون أكاسيد النيتروجين كناتج لجميع عمليات الاحتراق التي تتم في درجات الحرارة العالية[wikipidea] Nitrogen oxides consists of all combustion processes taking place at high temperature MWTs extracted • أكاسيد النيتروجين • درجات الحرارة العالية • عمليات الاحتراق
Motivation • Frequent MWTs • Application • for building index from unstructured documents • for enhancing document retrieval system
MWT extraction systemConcept extraction Corpus Identification of Term Candidates - linguistic filtering (shallow parsing) Filtering of Term Candidates - statistical significance (LLR, FLR, MI3,T-score) Candidate list
MWT evaluation • unithood: measure the strengh of association of the constituents of MWU • United nations [environment domain] • Unithood • termhood: measure relatedness to existing domainspecific concepts. • Soil degradation [environment domain] • Termhood Unithood
MWT variations • Multiple forms for the same concept • Variations types • Inflexional morphology • Number • N1 N2 / N1 N2 + suffix(ات, ون) • تلوث المحيط «ocean pollution » • تلوث المحيطات« oceans pollution » • Definite form • N Adj / Prefix(ال) + N prefix(ال) + Adj • تلوث كيميائي « chemical polution » • التلوث الكيميائي « the chemical pollution » • Derivational morphosyntactic phenomena • N1 ADJ /N1 PREP N2 • بئر نفطي => بئر من النفط « oil well » • Syntactically (modification postposition) • N1 N2 / N1 N2 ADJ • درجة الحرارة« degree of temperature » • درجة الحرارة العالية« high degree of temperature »
Comparing statistical filtering • Mutual Information (MI3) (Daille, 1994) as baseline • Loglikelihood (Dunning, 1994) • t-Score (Church, 1991) • FLR (Nakagawa and Mori, 2003)
Experiment Data • Arabic specific domain corpus on environment • Compiled from the web “Al-Khat Alakhdar” “Akhbar Albiae” from 2004-2006 • 475,148 words • Motivation • The no-availability of Arabic specific domain corpora
Gold standard • Reference list • Arabic environment terminology : Agrovoc • Total: 65,000 unique known terms ( single and MWT) • Dynamic search • Eurodicautom
Preprocessing • Moving diacritics • Buckwalter’s transliteration • Diab’s parsing (Diab, 2004) • Input • wlm yHtsb AlHkm Almjry sAndwr bwl rklp jzA' SHyHp Avr Erqlp dAxl AlmnTqp mn qbl AlysAndrw. • Output • w/CC lm/RP yHtsb/VBP Al/DT Hkm/NN Al/DT mjry/JJ sAndwr/NNP bwl/NNP rklp/NN jzA'/NN SHyHp/JJ Avr/IN Erqlp/NN dAxl/IN Al/DT mnTqp/NN mn/IN qbl/NN Al/DT ysAndrw/NNP ./PUNC
Evaluation and results • For each association score • Examine the first candidates term • Compute precision (termhood) for 100 candidates term • Precision (termhood) is quotient of attested MWT and all extracted sequences. • the loglikelihood is the best measure
Summary & future work • Develop MWT extraction for Arabic • Define MWT patterns and variations • Obtain best results than european languages • Improvement of system • Adding new variation • Improve lemmatisation
Introduction • MWT’s are sufficiently informative to help human readers get a feel of the essential topics • Use in many text related applications: • Text clustering • Document similarity • Document summarization
Related Work • Linguistic Approach • Based on linguistic pre-processing and annotations (result of taggers, shallow parsers) • Detect recurrent syntactic term formation patterns • Noun + Noun • (Adj | Noun) + Noun
Systems based on linguistics • Ananiadou, S. (1994) recognises single-word terms from domain of Immunology based on morphological analysis of term formation patterns (internal term make up) • Justeson & Katz (1995, TERMS) extract complex terms based on two characteristics (which distinguishes them from non terms) • the syntactic patterns are restricted • terms appear with the same form throughout the text, omissions of modifiers are avoided
Systems based on linguistics • The text is tagged; a filter is applied to extract terms ((A|N) + | ((A|N) * (N P)?) (A|N)*) N AN / NA / AAN / ANN / NAN / NNN / NPN • Filtering based on simple POS pattern • A pattern must occur above a certain threshold to be considered a valid term pattern. • Recall: 71% Precision: 71% -- 96% • LEXTER (Bourigault, 1994) • Extracts French compound terms based on surface syntactic analysis and text heuristics • Terms are identified according to certain syntactic patterns
Uses a boundary method to identify the extent of terms • categories or sequences of categories that are not found in term patterns form the boundaries e.g. verbs, any preposition (except de and à) followed by a determiner. Non productive sequences become boundaries. • Precision: 95% although tests have shown that lots of noise is generated
Approaches using statistical information • Main measures used: • Frequency of occurrence • Mutual Information • C/NC value • Experiments also with loglike coefficient [Dunning, 1993]
Frequency of occurrence • Simplest and most popular method for Domain independent, requires no external resources • Some filtering is used in form of syntactic patterns • Systems using frequency of occurrence • Dagan & Church (TERMIGHT, 1994) • Enguehard & Pantera (1994) • Lauriston (TERMINO, 1996)
Mutual Information • ‘The amount of information provided by the occurrence of the event represented by yi about the occurrence of the event represented by xk is defined as’ I(xk,yi) log P(xk,yi) / P(xk) P(,yi) Fano (1961:27-28) • This measure is about how much a word tells us about the other. • Problems for MI come from data sparseness; • Damerau (1993) and Daille (1994) used MI for the extraction of candidate terms (only for two-word candidate terms)
C/NC value (Frantzi & Ananiadou) • C/value • total frequency of occurrence of string in corpus • frequency of string as part of longer candidate terms • number of these longer candidate terms • length of string (in number of words)
NC value NC-value(a) = 0.8 * C-value(a) + 0.2 * CF(a) a is the candidate term, C-value(a) is the C-value for the candidate term a, CF(a) is the context factor for the candidate term a • we obtain the CF by summing up the weights for its term context words, multiplied by their frequency appearing with this candidate term.
Hybrid approaches • Combination of linguistic information (filters), shallow parsing results and statistical measures • Daille, B., Frantzi & Ananiadou