1 / 23

Automatically Extending NE coverage of Arabic WordNet using Wikipedia

This research paper discusses the use of Wikipedia to collect named entities (NEs) for Arabic WordNet, resulting in an improved coverage of NEs. The approach includes manual validation and automatic evaluation.

elvinl
Download Presentation

Automatically Extending NE coverage of Arabic WordNet using Wikipedia

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Automatically Extending NE coverage of Arabic WordNet using Wikipedia Musa Alkhalifa2, Horacio Rodríguez1 1 Talp Research Center, UPC, Barcelona, Spain 2 UB, Barcelona, Spain Citala 2009

  2. Index of the presentation • Introduction & motivation • AWN • NEs • Wikipedia • Collecting NEs in AWN • Collecting NEs from Wikipedia • Our system • Empirical evaluation • Conclusions Citala 2009

  3. Introduction & motivation: AWN • USA REFLEX program funded (2005-2007) • Partners: • Universities • Princeton, Manchester, UPC, UB • Companies • Articulate Software, Irion • Description: • Black et al, 2006 • Elkateb et al, 2006 • Rodríguez et al, 2008a • Rodríguez et al, 2008b Citala 2009

  4. Introduction & motivation: AWN • Objectives • 10,000 synsets including some amount of domain specific data • linked to PWN 2.0 • finally to PWN 3.0 • linked to SUMO • + 1,000 NE • manually built (or revised) • vowelized entries • including root of each entry Citala 2009

  5. Introduction & motivation: AWN • Current figures Named entities: Citala 2009

  6. Introduction & motivation: NEs • Importance of NEs for NLP tasks & applications • Mention detection, Coreference resolution, Textual Entailment, ... • IR, Q&A, Summarization, ... • Lack of sufficient coverage in WN (and AWN) • Additional sources • The Web • Wikipedia Citala 2009

  7. Introduction & motivation: Wikipedia • Importance of Wikipedia • Size • English: 2 683 000+ articles • Deutsch: 847 000+ • Español: 431 000+ • Français: 746 000+ • Italiano: 527 000+ • Português: 449 000+ • ... • > 200 languages • Collaborative effort • Exponential growing Citala 2009

  8. Introduction & motivation: Wikipedia • The Arabic version (AWP) • has over 65,000 articles (about 1% of the total size of WP) • Among all the different languages, Arabic has a rank of 29, just above Serbian and Slovenian. • The growing of AWP is very high (more than 100% of last year) Citala 2009

  9. Collecting NEs in AWN • Objectives • 1,000 synsets • variety of types (locations, persons, organizations, ... ) • Approach • Selection of the candidates • Manual validation. Citala 2009

  10. Collecting NEs in AWN • Selection of the candidates • sources • GEONAMES • FAO • NMSU Arabic/English lexicon Citala 2009

  11. Collecting NEs in AWN • Selection of the candidates • Identifying synsets corresponding to instances • Obtaining the generic types • 371 generic types • such as 'capitals', 'cities', 'countries', 'inhabitants' or 'politicians' • Filter out those not linked to AWN • Obtaining NMSU entries corresponding to the variants in instance synsets • Formatting and merging the results of the three sources Citala 2009

  12. A fragment of GEONAMES database Citala 2009

  13. Collecting NEs in AWN • Manual validation • Deciding the acceptance or rejection of the pair. • Modifying Arabic form if needed. • Adding diacritics. • Completing attachments to PWN2.0 if possible. Citala 2009

  14. Collecting NEs in AWN • Results • 1,147 synsets • 1,659 variants • 31 generic types. Citala 2009

  15. Collecting NEs in AWN Citala 2009

  16. Collecting NEs from Wikipedia • Using Wikipedia for NLP tasks • see a tutorial in my page: • http://www.lsi.upc.edu/horacio • ... • multilingual tasks • using Interwiki links • Richman and Schone, 2008 • Ferrández et al, 2007 • ... • software • Iryna Gurevych's (U. Darmstadt) JWPL system • ... Citala 2009

  17. Collecting NEs from Wikipedia • Crude approach: • English NE -> Arabic interwiki link -> Arabic NE • But ... • Which English NEs have to be looked for? • How to deal with polysemy? • vowelization (recovering diacritics) Citala 2009

  18. Collecting NEs from Wikipedia • Our approach: • Which English NEs have to be looked for? • Same approach used in building AWN • How to deal with polysemy? • use of disambiguation pages when available in EWP • comparing with (using Vectorial Space Model) : • the set of variants (senses) of each generic type • the set of words occurring in the gloss (after stopwords and example removing) • the topic signature, • vowelization (recovering diacritics) • comparison with other interwiki links Citala 2009

  19. Our approach Citala 2009

  20. Results • Our approach: • We started with 16,873 English NE occurring as instances in PWN2.0 • From them 14,904 occurs as well in EWP as article titles. This is a really nice coverage (88%) • 3,854 Arabic words corresponding to 2,589 English synsets were recovered following our approach. The coverage (26%) is really high taking into account the small size of AWP • From the recovered synsets only 496 belonged to the set of NEs already included in AWN. Citala 2009

  21. Results • Our approach: • Automatic evaluation • From the 496 synsets included in both sets 464 were the same and 32 differed • 93.4% accuracy • Manual validation • From the 3,854 proposed assignments, 3,596 (93.3%) were considered correct, 67 (1.7%) were considered wrong and 191 (5%) were not known Citala 2009

  22. Conclusions • We have presented an approach for automatically attaching Arabic NEs to English NEs using AWN, PWN, AWP and EWP as Knowledge sources • The system is fully automatic, quite accurate, and has been applied to a substantial enrichment of the NE set in AWN Citala 2009

  23. Thank you for your attention Citala 2009

More Related