Pair Hidden Markov Model for Named Entity Matching

Pair Hidden Markov Model for Named Entity Matching Peter Nabende, Jörg Tiedemann, John Nerbonne Department of Computational Linguistics Center for Language and Cognition Groningen, University of Groningen, Netherlands {p.nabende, j.tiedemann, j.nerbonne}@rug.nl

Introduction • Three types of named entities: entity names, temporal expressions and number expressions • Entity names refer to organization, person, and location names • There exists many entity names across different languages necessitating proper handling • Bi-lingual lexicons comprise a very tiny percentage of entity names • An MT system would perform poorly for unseen entity names that have translations or transliterations in a target language • In addition to MT, similarity measurement between cross-lingual entity names is important for CLIR and CLIE applications

Recent Work on Named entity matching • Divided into approaches that consider phonetic information and those that do not • Lam et al. (2007) argue that many NE translations involve both semantic and phonetic clues • Their approach is formulated as a bipartite weighted graph matching problem • Hsu et al. (2006) measure the similarity between two transliterations by comparing physical sounds • A Character Sound Comparison (CSC) method is used that involves the construction of a speech sound similarity database and a recognition stage • Pouliquen et al. (2006) compute similarity between pairs of names using letter n-gram similarity without using phonetic transliterations • We propose the use of a pair-HMM that has been successfully used for Dutch dialect similarity measurement by Wieling et al. (2007), and for word similarity by Mackay and Kondrak (2005)

qt -1 qt qt+1 … … State sequence ot-1 ot ot+1 Observations O1 t-1 O1 t O1 t+1 … qt-1 qt qt+1 … O2 t-1 O2 t O2 t+1 pair-HMM • The pair-HMM belongs to a family of models called Hidden Markov Models (HMMs) • The pair-HMM originates from work on Biological Sequence Analysis by Durbin et al. (1998) • Difference with standard HMMs lies in the observation of a pair of sequences or a pairwise alignment instead of a single sequence (Fig. 1 and Fig. 2) Fig. 1: An Instantiation of standard HMM Fig. 2: An Instantiation of pair-HMM

xi X ε 1-ε- τXY -λ xi yj δ λ τXY M τM END λ τXY 1-2δ- τM δ Y 1-ε- τXY -λ ε yj pair-HMM used in previous NLP Tasks Fig. 3: pair-HMM used in previous work (Wieling et al. (2007) ; Mackay and Kondrak (2005))

xi X 1-εx- τx –λx εx xi yj δx λx τx M τm END λy τy 1-δx- δx-τm δy Y 1-εy- τY –λy εy yj Proposed pair-HMM Fig. 4: Diagram illustrating proposed modifications to parameters of the pair-HMM

Name Matching using pair-HMM • The pair-HMMis used to compute similarity scores for two input sequences of strings • The similarity scores can be used for different purposes • In this paper, for identification of pairs of highly similar strings • The model uses the values of initial, transition, and emission parameters that can be determined through a training process • Example on next slide illustrates different parameters required for computing the similarity scores

Name Matching using pair-HMM TABLE 1 Illustration of an alignment between same name representation in different languages • Equation illustrates the parameters needed to calculate the score for the alignment above score = P(M0) * P(e(p:п) * P(M-M) * P(e:ё) * P(M-M) * P(t:т) * P(M-X) * P(e:_) * P(X-M) * P(r:р) * P(M-END)

Parameter estimation for pair-HMMs • Arribas-Gil et al. (2005) reviewed different parameter estimation approaches for pair-HMMs: • numerical maximization approaches, and Expectation Maximization (EM) algorithm with its variants (Stochastic EM, Stochastic Approximation EM) • An EM approach using the Baum-Welch algorithm had already been implemented and is maintained for the pair-HMMs that have been adapted in this work

pair-HMM training software • Wieling et al.s’ (2007) pair-HMM training software was adapted • The software was modified to consider use of different alphabets • Alphabets are generated automatically from the available data set that is to be used for training • For English-Russian dataset, we obtained 76 symbols for the English language alphabet and 61 symbols for the Russian language alphabet • For English-French dataset, we had 57 symbols for both languages • Another modification to Wieling’s version of the pair-HMM training software was converting the software so that it uses less files having the names to be used for training

pair-HMM Training Data • Training data comprises pairs of names from two different languages • English-French and English-Russian name pairs were obtained from the GeoNames data dump and Wikipedia data dump • full names with spaces in between were not considered, if there were full names, they had to be split and used with their corresponding matches in the other language • For the entity name matching task, 850 distinct English-French pairs of names were extracted, and 5902 distinct English-Russian pairs of names were extracted. • For English-French, 600 pairs were used for training (282 iterations) • For English-Russian, 4500 pairs were used for training (848 iterations)

pair-HMM scoring algorithms • Two algorithms implemented in the pair-HMM have been used for scoring: • Forward algorithm • Takes all possible alignments into account to calculate the probability of the observation sequence given the model • Viterbi algorithm • Considers only the best alignment when calculating the probability of the observation sequence given the model • There are also log versions for the two algorithms that compute the log value for the probability of the observation sequence

Evaluation Measures • Two measures have been considered for evaluating the pair-HMM algorithms: Average Reciprocal Rank (ARR) Equations for Average Rank (AR) and ARR follow from Voorhees and Tice (2000): TABLE 2 RANKING EXAMPLE AFTER USING FORWARD-LOG ALGORITHM • The computation for ARR, however, depends on the complexity of the evaluation set

Evaluation Measures Cross Entropy (CE) Used to compare the effectiveness of different language models and useful when we do not know actual probability that generated some data For the pair-HMM, CE is specified by: which is approximated to:

Results ARR Results TABLE 3 ARR RESULTS FOR ENGLISH-FRENCH DATA TABLE 4 ARR RESULTS FOR ENGLISH-RUSSIAN DATA

Results • ARR results show no significant difference between the accuracy of the two algorithms Cross Entropy Results TABLE 5 CROSS ENTROPY RESULTS FOR ENGLISH-RUSSIAN DATA • There is no significant difference in the accuracy of the Viterbi and Forward algorithms

Conclusion • A pair-HMM has been introduced for application in matching entity names • The evaluation carried out so far is not sufficient to give critical information regarding the performance of the pair-HMM • The results show no significant differences between the Viterbi and Forward algorithms • However, ARR results from the experiments are encouraging • It is feasible to use the pair-HMM in the generation of transliterations

Future Work • It should be interesting to create other structures associated with the pair-HMM; for example, so as to incorporate contextual information • The pair-HMM needs to be evaluated against other models • Alignment-based discriminative string similarity as proposed in Bergsma and Kondrak(2007) for the task (cognate identification) will be considered

THANKS ! Questions?

References • W. Lam, S-K. Chan and R. Huang, “Named Entity Translation Matching and Learning: With Application for Mining Unseen Translations,” ACM Transactions on Information Systems, vol. 25, issue 1, article 2, 2007. • C-C. Hsu., C-H. Chen, T-T. Shih and C-K. Chen, “Measuring Similarity between Transliterations against Noise and Data,” ACM Transactions on Asian Language Information Processing, vol. 6, issue 2, article 5, 2005. • M. Wieling, T. Leinonen and J. Nerbonne, “Inducing Sound Segment Differences using Pair Hidden Markov Models. In J. Nerbonne, M. Ellison and G. Kondrak (eds.), Computing and Historical Phonology: 9th Meeting of ACL Special Interest Group for Computational Morphology and Phonology Workshop, Prague, pp. 48-56, 2007. • W. Mackay and G. Kondrak, “Computing Word Similarity and Identifying Cognates with Pair Hidden Markov Models,” Proceedings of the Ninth Conference on Computational Natural Language Learning (CoNLL), pp. 40-47, Ann Arbor, Michigan, 2005. • R. Durbin, S.R. Eddy, A. Krogh and G. Mitchison, Biological Sequence Analysis: Probabilistic Models of Protein and Nucleic Acids. Cambridge University Press, 1998. • A. Arribas-Gil, E. Gassiat and C. Matias, “Parameter Estimation in Pair-hidden Markov Models,” Scandinavian Journal of Statistics, vol. 33, issue 4, pp. 651-671, 2006. • E.M. Voorhees and D.M. Tice. The TREC-8 Question Answering Track Report. In English Text Retrieval Conference (TREC-8), 2000. • C-J. Lee, J.S. Chang and J-S.R. Juang. A Statistical Approach to Chinese-to-English Back Transliteration. In Proceedings of the 17th Pacific Asia Conference, 2003. • S. Bergsma and G. Kondrak. Alignment-Based Discriminative String Similarity. In Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics, pages. 656-663, Prague, Czech Republic, June 2007.

Pair Hidden Markov Model for Named Entity Matching

Pair Hidden Markov Model for Named Entity Matching

Presentation Transcript

Hidden Markov Model

Lecture 2 Hidden Markov Model

Hidden Markov Model (HMM) - Tutorial

Hidden Markov Model

Hidden Markov Model

Hidden Markov Model: An Introduction

Hidden Markov Model

Hidden Markov Model

Hidden Markov Model

Hidden Markov model

Hidden Markov Model

Hidden Markov Model Cryptanalysis

Identification of bilingual named entities from Wikipedia using a pair Hidden Markov Model

Hidden Markov Model Continues …

Hidden Markov Model

Hidden Markov Model

Hidden Markov Model

Hidden Markov Model

Hidden Markov Model

Hidden Markov Model

Myanmar Named Entity Recognition with Hidden Markov Model