230 likes | 243 Views
This research paper discusses the use of Wikipedia to collect named entities (NEs) for Arabic WordNet, resulting in an improved coverage of NEs. The approach includes manual validation and automatic evaluation.
E N D
Automatically Extending NE coverage of Arabic WordNet using Wikipedia Musa Alkhalifa2, Horacio Rodríguez1 1 Talp Research Center, UPC, Barcelona, Spain 2 UB, Barcelona, Spain Citala 2009
Index of the presentation • Introduction & motivation • AWN • NEs • Wikipedia • Collecting NEs in AWN • Collecting NEs from Wikipedia • Our system • Empirical evaluation • Conclusions Citala 2009
Introduction & motivation: AWN • USA REFLEX program funded (2005-2007) • Partners: • Universities • Princeton, Manchester, UPC, UB • Companies • Articulate Software, Irion • Description: • Black et al, 2006 • Elkateb et al, 2006 • Rodríguez et al, 2008a • Rodríguez et al, 2008b Citala 2009
Introduction & motivation: AWN • Objectives • 10,000 synsets including some amount of domain specific data • linked to PWN 2.0 • finally to PWN 3.0 • linked to SUMO • + 1,000 NE • manually built (or revised) • vowelized entries • including root of each entry Citala 2009
Introduction & motivation: AWN • Current figures Named entities: Citala 2009
Introduction & motivation: NEs • Importance of NEs for NLP tasks & applications • Mention detection, Coreference resolution, Textual Entailment, ... • IR, Q&A, Summarization, ... • Lack of sufficient coverage in WN (and AWN) • Additional sources • The Web • Wikipedia Citala 2009
Introduction & motivation: Wikipedia • Importance of Wikipedia • Size • English: 2 683 000+ articles • Deutsch: 847 000+ • Español: 431 000+ • Français: 746 000+ • Italiano: 527 000+ • Português: 449 000+ • ... • > 200 languages • Collaborative effort • Exponential growing Citala 2009
Introduction & motivation: Wikipedia • The Arabic version (AWP) • has over 65,000 articles (about 1% of the total size of WP) • Among all the different languages, Arabic has a rank of 29, just above Serbian and Slovenian. • The growing of AWP is very high (more than 100% of last year) Citala 2009
Collecting NEs in AWN • Objectives • 1,000 synsets • variety of types (locations, persons, organizations, ... ) • Approach • Selection of the candidates • Manual validation. Citala 2009
Collecting NEs in AWN • Selection of the candidates • sources • GEONAMES • FAO • NMSU Arabic/English lexicon Citala 2009
Collecting NEs in AWN • Selection of the candidates • Identifying synsets corresponding to instances • Obtaining the generic types • 371 generic types • such as 'capitals', 'cities', 'countries', 'inhabitants' or 'politicians' • Filter out those not linked to AWN • Obtaining NMSU entries corresponding to the variants in instance synsets • Formatting and merging the results of the three sources Citala 2009
A fragment of GEONAMES database Citala 2009
Collecting NEs in AWN • Manual validation • Deciding the acceptance or rejection of the pair. • Modifying Arabic form if needed. • Adding diacritics. • Completing attachments to PWN2.0 if possible. Citala 2009
Collecting NEs in AWN • Results • 1,147 synsets • 1,659 variants • 31 generic types. Citala 2009
Collecting NEs in AWN Citala 2009
Collecting NEs from Wikipedia • Using Wikipedia for NLP tasks • see a tutorial in my page: • http://www.lsi.upc.edu/horacio • ... • multilingual tasks • using Interwiki links • Richman and Schone, 2008 • Ferrández et al, 2007 • ... • software • Iryna Gurevych's (U. Darmstadt) JWPL system • ... Citala 2009
Collecting NEs from Wikipedia • Crude approach: • English NE -> Arabic interwiki link -> Arabic NE • But ... • Which English NEs have to be looked for? • How to deal with polysemy? • vowelization (recovering diacritics) Citala 2009
Collecting NEs from Wikipedia • Our approach: • Which English NEs have to be looked for? • Same approach used in building AWN • How to deal with polysemy? • use of disambiguation pages when available in EWP • comparing with (using Vectorial Space Model) : • the set of variants (senses) of each generic type • the set of words occurring in the gloss (after stopwords and example removing) • the topic signature, • vowelization (recovering diacritics) • comparison with other interwiki links Citala 2009
Our approach Citala 2009
Results • Our approach: • We started with 16,873 English NE occurring as instances in PWN2.0 • From them 14,904 occurs as well in EWP as article titles. This is a really nice coverage (88%) • 3,854 Arabic words corresponding to 2,589 English synsets were recovered following our approach. The coverage (26%) is really high taking into account the small size of AWP • From the recovered synsets only 496 belonged to the set of NEs already included in AWN. Citala 2009
Results • Our approach: • Automatic evaluation • From the 496 synsets included in both sets 464 were the same and 32 differed • 93.4% accuracy • Manual validation • From the 3,854 proposed assignments, 3,596 (93.3%) were considered correct, 67 (1.7%) were considered wrong and 191 (5%) were not known Citala 2009
Conclusions • We have presented an approach for automatically attaching Arabic NEs to English NEs using AWN, PWN, AWP and EWP as Knowledge sources • The system is fully automatic, quite accurate, and has been applied to a substantial enrichment of the NE set in AWN Citala 2009
Thank you for your attention Citala 2009