Learning Phonetic Similarity for Matching Named Entity Translations and Mining New Translations

Learning Phonetic Similarity for Matching Named Entity Translations and Mining New Translations Advisor : Dr. Hsu Graduate : Kuo-min Wang Authors :Wai Lam, Ruizhang Huang, Pik-Shan Cheung 2004 ACM .

Outline • Motivation • Objective • Introduction • Named entity matching model • Phonetic matching model • Learning phonetic similarity • Experiments on named entity matching model • Mining new entity translations from news • Experiments on mining new translations • Conclusions • Personal Opinion

Motivation • Many existing systems dealing with cross-language documents make use of bilingual dictionaries. • In all these systems, a fixed dictionary is used throughout the process implying that only those terms exist in the dictionary can be handled.

Objective • We propose a novel named entity matching model which considers both semantic and phonetic clues. • We also develop a mining framework for discovering new, unseen named entity translations from online daily web news.

Introduction • Many existing systems dealing with cross-language documents encounter difficulties when they process new or unseen terms which are common especially for named entities. • Exploits similarity at the phoneme level • We investigate three learning algorithms for obtaining the similarity information of basic phoneme units based on a set of training data.

Introduction(cont.) • This framework comparable news in different language based on an unsupervised learning technique using an existing bilingual dictionary. • A major advantage of our proposed is that our approach analyzes both semantic and phonetic information and formulates the problem as a number of optimization models.

Named entity matching model • The objective of our named entity matching model is to compute the similarity between two given named entities written in two languages. Input a pair of entities Named Entity Matching Model LDC 1. looked up TokenizationProcess 2. generated Phonetic matching model Hybrid semantic & phonetic matching a set of Chinese entity translations 3. scanning 4. group Find matched Chinese entity adjacent terms which do not involve in the dic

Named entity matching model (cont.) • Problem nature • Given a pair of named entities which are translation of each other, it is common to find part of the entity is matched based on semantic and the remaining part is based on phonetic clues. • Example • English entity “University of Akron” Chinese entity ”阿克倫大學” • Semantic clues we can match the term “University” with “大學” • Phonetic clues  we can match the term “Akron” with “阿克倫”

Named entity matching model (cont.) • Matching model investigation • An English entity E represented by terms <t1,…,tm0> and a Chinese entity C represented by Chinese characters <s1,…,sn0> • Let the matched word segments be represented as • Let the phonetically matched word segments be represented as

Named entity matching model (cont.) • The objective is to find a set of mapping between English terms and Chinese word segments such that the total weight is maximized.  • Example • English Terms E • <“Palo”, ”Alto”, “Chamber”, “of”, “Commerce”> • Chinese Entity C • <“帕”, “洛”, “阿”, “爾”, “托”, “商”, “會”>

Named entity matching model (cont.) • Tokenization • If the degree of this maximal matching exceeds or reaches a certain threshold , this word segment are treated as separate tokens. • Example • Commerce matches with the term “商” p=商(1)/商業(2)=0.5

Named entity matching model (cont.) • Tokenization • Group adjacent terms which do not involve in the dictionary mapping. • Example:帕洛阿爾托商會 • “Palo” and “Alto””Palo Alto” • “帕洛阿爾托” is a single token • Chinese tokens”帕洛阿爾托”，”商”，”會” • English tokens”Palo Alto”, “Chambe”, “of”, “commerce”

Named entity matching model (cont.) • Hybrid Semantic and Phonetic Matching Algorithm • We can formulate the matching poblem via an undirected bipartite weighted graph with vertex set V and edge set L. • Let the English entity E, be represented as tokens<e1,…,em> and the Chinese entity, C, be represent as tokens<c1,…,cn> • V is set to {VE ∪ VC } where VE = { e1,…,em}and VC={c1,…,cn} • Edge construction Process • Starts with considering the semantic mapping • Next , we consider phonetic mapping between tokens

Chinese tokens English tokens u(ei,cj) Named entity matching model (cont.) • Hybrid Semantic and Phonetic Matching Algorithm • The edge construction process • First, considering the semantic mapping as described in the tokenization process. • Next, we consider phonetic mapping between tokens.

Generate phonetic representation for each term 北京話廣東話 Pin-Yin Table Jyut- Ping Table English Mandarin PPS Table English Cantonese PPS Table Generate basic phoneme units Phonetic matching model • Example • “港”= gang3” • “爸”=baa1” • “Beckham””bE kx m” • “貝克漢姆”->”bei ke kx m” • A basic phoneme units consists of a consonant followed by a vowel. • If there is no consonant-vowel pattern, we extract the consonant. If there is no consonant, the vowel will be extracted.

Phonetic matching model (cont.) • Phonetic Matching Algorithm • We prepare a phoneme pronunciation similarity(PPS) table capturing the pronunciation similarity value between each possible English-Chinese basic phoneme unit pair. • Suppose an English term, A, is represented by basic phoneme unit sequence <a1,…,ama>. B, is represented by basic phoneme unit sequence <b1,…,bmb>

Learning phonetic similarity • We investigate several learning algorithms for obtaining the similarity values in the PPS table using a set of training data. • The goal is to obtain V such that this similarity score is as high as possible for each correct name pair.

Learning phonetic similarity (cont.) • The Widrow-Hoff Algorithm • Consider the difference of the computed similarity score Yk and the actual one Zk for the k-th name pair. • If the performance of the latest trained PPS table is not improved for three full iterations, the terminating condition is met.

Learning phonetic similarity (cont.) • The Exponentiated-Gradient Algorithm • It processes one training name pair at a time and updates the PPS table entries immediately. • Let • We define as: • The updating formula is given by :

Learning phonetic similarity (cont.) • The Genetic Algorithm • One way to view the learning problem is to formulate it as an optimization problem as follows: • Each gene in a chromosome corresponds to a particular element in the table. …

Experiments on named entity matching model (cont.) • The first set of experiments is to evaluate the phonetic similarity learning. • The second set of experiments is to evaluate the performance of the overall named entity matching model. • The average reciprocal rank (ARR) is used to measure the performance as follows:

Mining new entity translations from news • System architecture

Mining new entity translations from news (cont.) • News Preprocessing • Let S be a news story. The story representation comprises of four components, namely, people name component Rp(S), place name component Rl(S), organization name component Ro(S), and content term component Rc(S).

Mining new entity translations from news (cont.) • Gloss Translation • For each Chinese term, we look up a bilingual lexicon for the English translation. • The translated English terms replace the original Chinese terms to represent the story. • Term weights are computed so that more likely translated terms will receive higher weights.

Mining new entity translations from news (cont.) • Event Discovery • An event is also represented by a four-dimensional vector similar to the story representation. Nearest neighbor clustering is used for processing the stories. • We use a kind of cosine similarity measure to compute the similarity between an event and a story.

Mining new entity translations from news (cont.) • Name entity cognate generation • The candidate weight is designed to reflect the importance of the name in the corresponding event. • Consider a particular English cognate G, the cognate weight, n(pl), of each people name pl in the English named entity cognate is calcuated as follows:

Mining new entity translations from news (cont.) • Entity Matching • The matching makes use of the named entity matching model as well as cognate weight. • For a given Chinese name, the corresponding English names in the cognate will be returned according to the final similarity scores and sorted in descending order.

Experiments on mining new translations

Conclusions • We have developed a novel named entity matching model which considers both semantic and phonetic information. • The experimental results show that our hybrid model can handle named entity matching in a more flexible and comprehensive way. • We have also applied our named entity matching model on mining new unseen named entity. Translation not found in the dictionary can be effectively discovered.

Personal Opinion • …

Learning Phonetic Similarity for Matching Named Entity Translations and Mining New Translations

Learning Phonetic Similarity for Matching Named Entity Translations and Mining New Translations

Presentation Transcript

Translations

Translations

Translations

Translations

Translations

Translations

Translations

Translations

Translations

Translations

Translations

Translations

LEARNING WORD TRANSLATIONS

Learning Phonetic Similarity for Matching Named Entity Translations and Mining New Translations

Translations

Translations

Translations

Translations

Translations

Translations

Translations

Translations