200 likes | 292 Views
Arabic WordNet: Semi-automatic Extensions using Bayesian Inference. H. Rodríguez 1 , D. Farwell 1 , J. Farreres 1 , M. Bertran 1 , M. Alkhalifa 2 , M.A. Martí 2. 1 Talp Research Center, UPC, Barcelona, Spain 2 UB, Barcelona, Spain. Index of the talk. The AWN project
E N D
Arabic WordNet: Semi-automatic Extensions using Bayesian Inference H. Rodríguez1, D. Farwell1, J. Farreres1, M. Bertran1, M. Alkhalifa2, M.A. Martí2 1 Talp Research Center, UPC, Barcelona, Spain 2 UB, Barcelona, Spain LREC 2008 AWN
Index of the talk • The AWN project • Semi-automatic Extensions of AWN • Intuitive basis • Previous work using heuristics • Using Bayesian Networks • Empirical evaluation • Conclusions LREC 2008 AWN
The AWN project • USA REFLEX program funded (2005-2007) • Partners: • Universities • Princeton, Manchester, UPC, UB • Companies • Articulate Software, Irion • Description: • Black et al, 2006 • Elkateb et al, 2006 • Rodríguez et al, 2008 LREC 2008 AWN
The AWN project • Objectives • 10,000 synsets including some amount of domain specific data • linked to PWN 2.0 • finally to PWN 3.0 • linked to SUMO • + 1,000 NE • manually built (or revised) • vowelized entries • including root of each entry LREC 2008 AWN
The AWN project • Current figures Named entities: LREC 2008 AWN
Semi-automatic Extensions of AWN • Intuitive basis • In Arabic (and other Semitic Languages) many words having a common root (i.e. a sequence of typically three consonants) have related meanings and can be derived from a base verbal form by means of a reduced set of lexical rules LREC 2008 AWN
Semi-automatic Extensions of AWN LREC 2008 AWN
Semi-automatic Extensions of AWN • Lexical rules • regular verbal derivative forms • regular nominal and adjectival derivative forms • masdar (nominal verb) • masculine and feminine active and passive participles • inflected verbal forms LREC 2008 AWN
Semi-automatic Extensions of AWN • Procedure for generating a set of likely <Arabic word, English synset, score>: • produce an initial list of candidate word forms • filter out the less likely candidates from this list • generate an initial list of attachments • score the reliability of these candidates • manually review the best scored candidates and include the valid associations in AWN. LREC 2008 AWN
Semi-automatic Extensions of AWN • Resources • PWN • AWN • LOGOS database of conjugated Arabic verbs • NMSU bilingual Arabic-English lexicon • Arabic Gigaword Corpus • UN (2000-2002) bilingual Arabic-English Corpus LREC 2008 AWN
Semi-automatic Extensions of AWN • Score the reliability of the candidates • build a graph representing the words, synsets and their associations • associations synset-synset: • explicit in WN2.0 • path-based • apply a set of heuristic rules that use directly the structure of the graph • GWC 2008 • apply Bayesian inference • LREC 2008 LREC 2008 AWN
Using Bayesian Inference LREC 2008 AWN
Using Bayesian Inference LREC 2008 AWN
Using Bayesian Inference • Building the CPT for each node in the BN • edges EW AW • probabilities from statistical translation models built from the UN corpus using GIZA++ (word-word probabilities) filtered to avoid pairs having Arabic expressions with invalid Buckwalter encodings. • all the mass probability is distributed between pairs occurring in the BN • other edges (EW S, S S) • linear distribution on priors • noisy or model LREC 2008 AWN
Using Bayesian Inference • Performing Bayesian Inference in the BN • Assign probability 1 to nodes in layer 1 • Infer the probabilities of nodes in layer 3 • Select for each word in layer 1 select as candidates the synsets in layer 3 connected to it and with probability over a threshold • Score the candidate pair with this probability • Select the candidates scored over a threshold LREC 2008 AWN
Empirical Evaluation • 10 verbs randomly selected from AWN + درس LREC 2008 AWN
Empirical Evaluation • Results LREC 2008 AWN
Conclusions • the BN approach doubles the number of candidates of the previous HEU approach (554 vs 272). • The sample is clearly insufficient. • The overlaping of Heu + BN seems to improve the results • An analysis of the errors shows a substantial number were due to the lack of the shadda diacritic or the feminine ending form (ta marbuta, ة). LREC 2008 AWN
Further work • Repeat the entire procedure relying when possible on dictionaries containing diacritics • Refine the scoring procedure by assigning different weights to the different relations. • Include additional relations (e.g. path-based) • Use additional Knowledge Sources for weighting the relations: • related entries already included in AWN • SUMO • Magnini's domain codes LREC 2008 AWN
Thank you for your attention LREC 2008 AWN