Automatically Extending NE coverage of Arabic WordNet using Wikipedia

Automatically Extending NE coverage of Arabic WordNet using Wikipedia Musa Alkhalifa2, Horacio Rodríguez1 1 Talp Research Center, UPC, Barcelona, Spain 2 UB, Barcelona, Spain Citala 2009

Index of the presentation • Introduction & motivation • AWN • NEs • Wikipedia • Collecting NEs in AWN • Collecting NEs from Wikipedia • Our system • Empirical evaluation • Conclusions Citala 2009

Introduction & motivation: AWN • USA REFLEX program funded (2005-2007) • Partners: • Universities • Princeton, Manchester, UPC, UB • Companies • Articulate Software, Irion • Description: • Black et al, 2006 • Elkateb et al, 2006 • Rodríguez et al, 2008a • Rodríguez et al, 2008b Citala 2009

Introduction & motivation: AWN • Objectives • 10,000 synsets including some amount of domain specific data • linked to PWN 2.0 • finally to PWN 3.0 • linked to SUMO • + 1,000 NE • manually built (or revised) • vowelized entries • including root of each entry Citala 2009

Introduction & motivation: AWN • Current figures Named entities: Citala 2009

Introduction & motivation: NEs • Importance of NEs for NLP tasks & applications • Mention detection, Coreference resolution, Textual Entailment, ... • IR, Q&A, Summarization, ... • Lack of sufficient coverage in WN (and AWN) • Additional sources • The Web • Wikipedia Citala 2009

Introduction & motivation: Wikipedia • Importance of Wikipedia • Size • English: 2 683 000+ articles • Deutsch: 847 000+ • Español: 431 000+ • Français: 746 000+ • Italiano: 527 000+ • Português: 449 000+ • ... • > 200 languages • Collaborative effort • Exponential growing Citala 2009

Introduction & motivation: Wikipedia • The Arabic version (AWP) • has over 65,000 articles (about 1% of the total size of WP) • Among all the different languages, Arabic has a rank of 29, just above Serbian and Slovenian. • The growing of AWP is very high (more than 100% of last year) Citala 2009

Collecting NEs in AWN • Objectives • 1,000 synsets • variety of types (locations, persons, organizations, ... ) • Approach • Selection of the candidates • Manual validation. Citala 2009

Collecting NEs in AWN • Selection of the candidates • sources • GEONAMES • FAO • NMSU Arabic/English lexicon Citala 2009

Collecting NEs in AWN • Selection of the candidates • Identifying synsets corresponding to instances • Obtaining the generic types • 371 generic types • such as 'capitals', 'cities', 'countries', 'inhabitants' or 'politicians' • Filter out those not linked to AWN • Obtaining NMSU entries corresponding to the variants in instance synsets • Formatting and merging the results of the three sources Citala 2009

A fragment of GEONAMES database Citala 2009

Collecting NEs in AWN • Manual validation • Deciding the acceptance or rejection of the pair. • Modifying Arabic form if needed. • Adding diacritics. • Completing attachments to PWN2.0 if possible. Citala 2009

Collecting NEs in AWN • Results • 1,147 synsets • 1,659 variants • 31 generic types. Citala 2009

Collecting NEs in AWN Citala 2009

Collecting NEs from Wikipedia • Using Wikipedia for NLP tasks • see a tutorial in my page: • http://www.lsi.upc.edu/horacio • ... • multilingual tasks • using Interwiki links • Richman and Schone, 2008 • Ferrández et al, 2007 • ... • software • Iryna Gurevych's (U. Darmstadt) JWPL system • ... Citala 2009

Collecting NEs from Wikipedia • Crude approach: • English NE -> Arabic interwiki link -> Arabic NE • But ... • Which English NEs have to be looked for? • How to deal with polysemy? • vowelization (recovering diacritics) Citala 2009

Collecting NEs from Wikipedia • Our approach: • Which English NEs have to be looked for? • Same approach used in building AWN • How to deal with polysemy? • use of disambiguation pages when available in EWP • comparing with (using Vectorial Space Model) : • the set of variants (senses) of each generic type • the set of words occurring in the gloss (after stopwords and example removing) • the topic signature, • vowelization (recovering diacritics) • comparison with other interwiki links Citala 2009

Our approach Citala 2009

Results • Our approach: • We started with 16,873 English NE occurring as instances in PWN2.0 • From them 14,904 occurs as well in EWP as article titles. This is a really nice coverage (88%) • 3,854 Arabic words corresponding to 2,589 English synsets were recovered following our approach. The coverage (26%) is really high taking into account the small size of AWP • From the recovered synsets only 496 belonged to the set of NEs already included in AWN. Citala 2009

Results • Our approach: • Automatic evaluation • From the 496 synsets included in both sets 464 were the same and 32 differed • 93.4% accuracy • Manual validation • From the 3,854 proposed assignments, 3,596 (93.3%) were considered correct, 67 (1.7%) were considered wrong and 191 (5%) were not known Citala 2009

Conclusions • We have presented an approach for automatically attaching Arabic NEs to English NEs using AWN, PWN, AWP and EWP as Knowledge sources • The system is fully automatic, quite accurate, and has been applied to a substantial enrichment of the NE set in AWN Citala 2009

Thank you for your attention Citala 2009

Automatically Extending NE coverage of Arabic WordNet using Wikipedia