150 likes | 300 Views
Identification of bilingual named entities from Wikipedia using a pair Hidden Markov Model. Peter Nabende Alfa Informatica, Center for Language and Cognition Groningen, University of Groningen email: p.nabende@rug.nl. Introduction.
E N D
Identification of bilingual named entities from Wikipedia using a pair Hidden Markov Model Peter Nabende Alfa Informatica, Center for Language and Cognition Groningen, University of Groningen email: p.nabende@rug.nl The 19th Meeting of Computational Linguistics in the Netherlands (CLIN 2009)
Introduction • Growing availability of comparable text on Wikipedia has led to the development of techniques to automatically extract new bilingual terms including named entities from Wikipedia (Erdmann et al., 2008) • Extraction of matching bilingual named entities is important for enriching bilingual dictionaries and providing lexicons for various NLP applications including MT, CLIR, and CLIE • Apart from extracting named entity pairs from Wikipedia Link structure (Adafre and Rijke (2006); Bouma et al. (2006); Erdmann et al., 2008), unlinked article texts in Wikipedia are also used as a major source for extracting bilingual named entities The 19th Meeting of Computational Linguistics in the Netherlands (CLIN 19)
Introduction • A pair-HMM is used to simplify similarity estimation between candidate named entities for languages using different alphabets, hence enabling simplified extraction of matching transliterations • Added applications for this work include: • Identification of transliteration pairs for linking in Wikipedia. A third of the pairs from article pairs on “Yoweri Museveni” were not linked • Identification of confusable drug names (Kondrak and Dorr, 2004) The 19th Meeting of Computational Linguistics in the Netherlands (CLIN 19)
Proposed Identification Approach Fig. 1: Approach for identification of bilingual named entities from Wikipedia Inter language link Source Language (SL) Wikipedia Article Corresponding Target Language (TL) Article SL candidate named entities TL candidate named entities Similarity Estimation Extract first n or n% highest ranking pairs Bilingual named entities The 19th Meeting of Computational Linguistics in the Netherlands (CLIN 19)
xi X 1-εx- τx –λx εx xi yj δx λx τx M τm END λy τy 1-δx- δx-τm δy Y 1-εy- τY –λy εy yj Similarity Estimation • Figure 2 shows the pair-HMM used for similarity estimation Fig.2 pair-HMM for similarity estimation The 19th Meeting of Computational Linguistics in the Netherlands (CLIN 19)
f n j e f e r s o M Y M Y M M M M M M M н а д ж е ф ф е р с о Similarity Estimation • Example illustrating an alignment from which similarity is estimated using the pair-HMM between the English name “jefferson” and its Russian transliteration “джефферсона” Fig.3:Illustration of alignment for an english name “jefferson” and its Russian transliteration following the pair-HMM concept The 19th Meeting of Computational Linguistics in the Netherlands (CLIN 19)
Similarity Estimation parameters • Two sets of parameters are estimated for the pair-HMM: • Transition parameters • Emission parameters • Starting parameters are derived from the parameters associated with moving from the match state • The Baum-Welch algorithm is used for estimating the parameters of the pair-HMM The 19th Meeting of Computational Linguistics in the Netherlands (CLIN 19)
Experimental Setup • Training Data • Language pair: English-Russian • Source of the data: English Wikipedia database dump download on 8th Dec 2008 • Training data was extracted using the Wikipedia inter-language links. The search was restricted to only person names existing in the English Wikipedia infoboxes • In total 6596 pairs were extracted person names with their corresponding Russian representations. • Alphabets used by the pair-HMM were obtained from the extracted collection of person names for each language. For English, 95 characters were used and for Russian, 72 characters were used The 19th Meeting of Computational Linguistics in the Netherlands (CLIN 19)
Experimental Setup • Test Data • Five online English Wikipedia articles with the titles “Abraham Lincoln”, “Dennis Bergkamp”, “Guus Hiddink”, “Marat Safin”, “Yoweri Museveni” and corresponding (Interlanguage) Russian Wikipedia articles were identified • Transliteration candidates were then extracted from the article pairs and prepared for input to the pair-HMM. In total, 2101 English candidates and 1310 candidates were used • Matching transliterations from each of the article pairs were hand-picked to form a gold standard set with 200 transliteration pairs, a third of these were variations The 19th Meeting of Computational Linguistics in the Netherlands (CLIN 19)
Experiment • Testing the pair-HMM • Two methods were investigated for computing the similarity scores: the forward algorithm and forward log-odds algorithm • The forward log-odds algorithm normalizes the estimation made by the forward algorithm through comparison to a random model • Results • The measure used for evaluating the pair-HMM is Precision, p The 19th Meeting of Computational Linguistics in the Netherlands (CLIN 19)
Results • Table 1 illustrates sample results for precision at the 1st rank Table 1: Sample results for precision at 1st rank The 19th Meeting of Computational Linguistics in the Netherlands (CLIN 2009)
Results Table 2: Precision values after forward and forward log-odds algorithm for N = 200 • Forward log-odds algorithm performs better than the forward algorithm • Forward log-odds did well even for differently ordered named entities at higher ranks The 19th Meeting of Computational Linguistics in the Netherlands (CLIN 19)
Conclusion • Precision values show a promising application of the pair-HMM in extracting bilingual named entities most importantly for languages that use different alphabets • There is need to evaluate other structures of the pair HMM, for example by either increasing or reducing the number of nodes in the model to determine if there are improvements • For named entities that are ordered differently between two languages, a more sophisticated model or added component for representing such a problem will be required The 19th Meeting of Computational Linguistics in the Netherlands (CLIN 19)
THANKS ! Questions? The 19th Meeting of Computational Linguistics in the Netherlands (CLIN 19)
References • Maike Erdmann, Kotaro Nakayama, Takahiro Hara and Shojiro Nishio. 2008. An Approach for Extracting Bilingual Terminology from Wikipedia. In J.R. Haritos, R Kotagiri, and V. Pudi (Eds.): DAFSAA 2008, LNCS 4947, pp. 380-392. • S.F. Adafre and de M. Rijke. 2006. Finding Similar Sentences across Multiple Languages in Wikipedia. In Proceedings of the EACL Workshop on NEW TEXT Wikis and blogs and other dynamic text sources, pp. 62-69. • G. Bouma, I. Fahmi, J.Mur, van G. Noord, van der L. Plas, and J. Tiedemann: The University of Groningen at QA@CLEF 2006 Using Syntactic Knowledge for QA, In Working Notes for the Cross Language Evaluation Forum Workshop. • T. Declerck, A.G. Perez, O. Vela, Z. Ganter, and D. Manzano-Macho: Multilingual Lexical Semantic Resources for Ontology Translation, In Proceedings of the International Conference on Language Resources and Evaluation (LREC), pp. 1492-1495. • Jorg Tiedemann. 1999. Automatic construction of Weighted String Similarity Measures. In EMNLP-VLC, pages 213-219. • Grzegorz Kondrak and Bonnie J. Dorr. 2004. Identification of Confusable Drug Names: A new Approach and Evaluation Methodology. In Proceedings of COLING 2004, pages 952-958. The 19th Meeting of Computational Linguistics in the Netherlands (CLIN 19)