1 / 20

Arabic WordNet: Semi-automatic Extensions using Bayesian Inference

Arabic WordNet: Semi-automatic Extensions using Bayesian Inference. H. Rodríguez 1 , D. Farwell 1 , J. Farreres 1 , M. Bertran 1 , M. Alkhalifa 2 , M.A. Martí 2. 1 Talp Research Center, UPC, Barcelona, Spain 2 UB, Barcelona, Spain. Index of the talk. The AWN project

kenny
Download Presentation

Arabic WordNet: Semi-automatic Extensions using Bayesian Inference

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Arabic WordNet: Semi-automatic Extensions using Bayesian Inference H. Rodríguez1, D. Farwell1, J. Farreres1, M. Bertran1, M. Alkhalifa2, M.A. Martí2 1 Talp Research Center, UPC, Barcelona, Spain 2 UB, Barcelona, Spain LREC 2008 AWN

  2. Index of the talk • The AWN project • Semi-automatic Extensions of AWN • Intuitive basis • Previous work using heuristics • Using Bayesian Networks • Empirical evaluation • Conclusions LREC 2008 AWN

  3. The AWN project • USA REFLEX program funded (2005-2007) • Partners: • Universities • Princeton, Manchester, UPC, UB • Companies • Articulate Software, Irion • Description: • Black et al, 2006 • Elkateb et al, 2006 • Rodríguez et al, 2008 LREC 2008 AWN

  4. The AWN project • Objectives • 10,000 synsets including some amount of domain specific data • linked to PWN 2.0 • finally to PWN 3.0 • linked to SUMO • + 1,000 NE • manually built (or revised) • vowelized entries • including root of each entry LREC 2008 AWN

  5. The AWN project • Current figures Named entities: LREC 2008 AWN

  6. Semi-automatic Extensions of AWN • Intuitive basis • In Arabic (and other Semitic Languages) many words having a common root (i.e. a sequence of typically three consonants) have related meanings and can be derived from a base verbal form by means of a reduced set of lexical rules LREC 2008 AWN

  7. Semi-automatic Extensions of AWN LREC 2008 AWN

  8. Semi-automatic Extensions of AWN • Lexical rules • regular verbal derivative forms • regular nominal and adjectival derivative forms • masdar (nominal verb) • masculine and feminine active and passive participles • inflected verbal forms LREC 2008 AWN

  9. Semi-automatic Extensions of AWN • Procedure for generating a set of likely <Arabic word, English synset, score>: • produce an initial list of candidate word forms • filter out the less likely candidates from this list • generate an initial list of attachments • score the reliability of these candidates • manually review the best scored candidates and include the valid associations in AWN. LREC 2008 AWN

  10. Semi-automatic Extensions of AWN • Resources • PWN • AWN • LOGOS database of conjugated Arabic verbs • NMSU bilingual Arabic-English lexicon • Arabic Gigaword Corpus • UN (2000-2002) bilingual Arabic-English Corpus LREC 2008 AWN

  11. Semi-automatic Extensions of AWN • Score the reliability of the candidates • build a graph representing the words, synsets and their associations • associations synset-synset: • explicit in WN2.0 • path-based • apply a set of heuristic rules that use directly the structure of the graph • GWC 2008 • apply Bayesian inference • LREC 2008 LREC 2008 AWN

  12. Using Bayesian Inference LREC 2008 AWN

  13. Using Bayesian Inference LREC 2008 AWN

  14. Using Bayesian Inference • Building the CPT for each node in the BN • edges EW  AW • probabilities from statistical translation models built from the UN corpus using GIZA++ (word-word probabilities) filtered to avoid pairs having Arabic expressions with invalid Buckwalter encodings. • all the mass probability is distributed between pairs occurring in the BN • other edges (EW  S, S  S) • linear distribution on priors • noisy or model LREC 2008 AWN

  15. Using Bayesian Inference • Performing Bayesian Inference in the BN • Assign probability 1 to nodes in layer 1 • Infer the probabilities of nodes in layer 3 • Select for each word in layer 1 select as candidates the synsets in layer 3 connected to it and with probability over a threshold • Score the candidate pair with this probability • Select the candidates scored over a threshold LREC 2008 AWN

  16. Empirical Evaluation • 10 verbs randomly selected from AWN + درس LREC 2008 AWN

  17. Empirical Evaluation • Results LREC 2008 AWN

  18. Conclusions • the BN approach doubles the number of candidates of the previous HEU approach (554 vs 272). • The sample is clearly insufficient. • The overlaping of Heu + BN seems to improve the results • An analysis of the errors shows a substantial number were due to the lack of the shadda diacritic or the feminine ending form (ta marbuta, ة). LREC 2008 AWN

  19. Further work • Repeat the entire procedure relying when possible on dictionaries containing diacritics • Refine the scoring procedure by assigning different weights to the different relations. • Include additional relations (e.g. path-based) • Use additional Knowledge Sources for weighting the relations: • related entries already included in AWN • SUMO • Magnini's domain codes LREC 2008 AWN

  20. Thank you for your attention LREC 2008 AWN

More Related