310 likes | 575 Views
Name Ethnicity Classification and Ethnicity Sensitive Name Matching. Pucktada Treeratpituk and C. Lee Giles College of Information Sciences and Technology Penn State University. Outline. Name-Matching & Name-Ethnicity Problem Definition Motivation Previous Work
E N D
Name Ethnicity Classification and Ethnicity Sensitive Name Matching Pucktada Treeratpituk and C. Lee Giles College of Information Sciences and Technology Penn State University
Outline • Name-Matching & Name-Ethnicity • Problem Definition • Motivation • Previous Work • Ethnicity-Sensitive Name-Matching Framework • Name-Ethnicity Classification • Ethnicity Sensitive Name-Matching • Evaluation • Conclusion
Name Matching • Name matching • Pairwise people disambiguation based only on personal names • Problem: Can name1 and name2 refer to the same person? • Bill Gates = William Henry Gates ? • Mao Zedong = Mao Tse-Tung ? • Lots of applications • NLP, Information Integration, Social Network Analysis, etc. • Name matching is a special case of string matching • In string matching, the objects to match can be • product names, institution names, street addresses • Name matching focuses on just personal names • Want to take advantage of what make personal names differ from other types of names to improve the disambiguation result
Name and Ethnicity • What makes personal names different from other types of names??? • Personal names are very cultural (ethnicity-dependent) • Ethnicities are often identifiable from names • More importantly, for name matching, valid variations in names are dependent on ethnicities • English names • Use of nicknames and middle names • William Henry Gates • = Bill Gates, William H. Gates, William Gates
Name and Ethnicity (Cont) • Middle Eastern names • Extensive use of ancestral names • Khalid Bin HasanBin Ahmad al-Fulan • Khalid, Son of Hasan, Son of Ahmad, of Fulan family • Khalid Bin Hasan Bin Ahmadal-Fulan • = Khalid Bin Hasan al-Fulan drop grandfather names • = Khalid al-Fulan no both ancestral names • != Khalid Bin Ahmadal-Fulan cannot drop only father name • Spanish names • Use composite given names and two surnames (paternaland maternal) • Pedro Juan LópezRodríguez = Pedro López(can drop maternal surnames) • Juan Morales Garcia = JuanMorales • != Juan Garcia • William Henry Gates (Bill Gates) • != William Henry (17 century chemist – Henry’s Law) • For English names, cannot similarly drop the last surnames
Name and Ethnicity (Cont) • Chinese names • Multiple transliteration standards • Mao Zedong = Mao Tse-tung • Reverse ordering • Li Ming ~ Ming Li (more likely to have this kind of error than for English names) • Western nicknames that are closed to the original Chinese names, are often used • Heung-Yeung Shum = Harry Shum • Segmentation • Heung-YeungShum = HeungyeungShum = HY Shum = H Shum • Li KaShing != Li ShingKa is not a middle name, thus cannot be dropped • SO a name matching algorithm should be ethnicity sensitive !!!
Previous Work • Name-Matching • Phonetic-based – e.g. Soundex, Metaphone • Convert name-strings to phonetic codes then compare • Edit-distance (like) similarity • Winkler, Jaro-Winkler, Levenstein, Smith-Waterman • Name-Ethnicity Classification • Frequency-based method (Dictionary-based) • Certain names are more common in some ethnic groups, e.g. Rodriguez is a common Spanish last name, etc. • LDA-based model using US Census [ICWSM10] • HMM + Decision Tree [KDD09]
Outline • Name-Matching & Name-Ethnicity • Problem Definition • Motivation • Previous Work • Ethnicity-Sensitive Name-Matching Framework • Name-Ethnicity Classification • Ethnicity Sensitive Name-Matching • Evaluation • Conclusion
Ethnicity-Sensitive Name Matching: Framework 1. Identifying name-ethnicities 2. Computing the optimal alignment between names using ethnicity-dependent distance function Name 1 Name 2 e1, e2 Juan Gines Sanchez Moreno Name-Ethnicity Classifier Optimal Alignment G Lopez Moreno 3. Generating the feature vector of alignment profile Alignment Profile Me1,e2 f = <x1, x2, …, x7> Name Matching Model 4. Use an ethnicity-dependent model to compute the match probability based on the alignment profile Match Probability p = 0.78
Name-Ethnicity Classification • Goal: To infer one’s ethnicity from one’s name Personal Name Juan Gines Sanchez Moreno, etc. F = <f1, f2, f3, … > Features vector with 4 types of features – - sequence of characters - sequences of phonetic sound, … Multiclass Classifier Multinomial Logistic Regression Ethnicity Chinese, British, German, etc.
Name-Ethnicity Classification:4 Feature Types • nonASCII– diacritics characters • MineichirōAdachi => ō • Adriana Muñoz => ñ • charNgram– character ngrams • Pad token boundaries with ‘$’, and last name’s boundaries with ‘+’ • 2-gram, 3-gram, and 4-gram • soundex– phonetic encoded • Steven, Stephen, Stevenson => S315 • Steeve => S310 • dmpNgram– double metaphone ngrams • Double metaphone is designed to better handle non-English words, to deal with phonetic ambiguity • Schmidt => XMT and SMT • Steven, Stephen => STF Stevenson => STFNSN • Use similar padding scheme as charNgram
Multinomial Logistic Regression • Logistic Regression generalized to multi-classes • The set of coefficients {βk,0,βk}k=1…K-1 is estimated through iterative process • {y}k=1…K is the set of ethnicities
Ethnicity-Sensitive Name Matching Name 1 Name 2 e1, e2 Juan Gines Sanchez Moreno Name-Ethnicity Classifier ✔ Optimal Alignment G Lopez Moreno Done Alignment Profile M f = <x1, x2, …, x8> Name Matching Model Match Probability p = 0.78
Compute Optimal Alignment • Modify the Smith-Waterman algorithm to find the optimal alignment between two names • Smith–Waterman Algorithm • DNA sequence matching, e.g. between ‘ACAT’ and ‘AGCA’ • Use dynamic programming to calculate the scoring matrix H • Character alignment: A = a1a2…aMand B = b1b2…bN • H(i, j) = the maximum similarity score between a1…ai and b1…bj Match/Mismatch score W(ai, bj) = 1, if ai= bj = 0, otherwise Gap score W(ai, -) = W(-, bj) = 0
Smith–Waterman: example Fill the scoring matrix Husing dynamic programming 2. Use the traceback procedure to find the optimal path 3. Extract the optimal alignment traceback alignment
Extending Smith–Waterman 1. Word Match P = (p1,p2,…,pM) and Q = (q1,q2,…,qN) instead of character match word similarity 2. Fuzzy Match Edward = E. Kathy = Katharine Can use ethnicity-dependent nickname dict and transliteration rules 3. Span Match Al Hashim = Alhashim De Félice = DeFélice Zhao Hui Wu= Zhaohui Wu Address word-segmentation problem 4. Shift (None, Left, Right) Find the optimal alignment for all 3 permutations Min Seo Kim = Kim Min Seo
Example traceback alignment
Alignment Profile Define an alignment profile as a vector of 7 features fa = (0, 0, 1, 0, 0, 0, 0.91) 0.96 x 0.95 <skip> fb = (1, 0, 0, 0, 2, 0, 0.95) <skip> <con> 0.95
Match Probability • So far, we convert <name1, name2> pair to an alignment profile f=<x1,…,x8> • Now, need a function ΘE: f => [0,1], that convert an alignment profile to a probability • P = Probability that name1and name2 match = ΘE(f) • Let D1,…, D7 be the discounting factors for different types of misalignment • If we assume that the probability odd ratio (P/1-P) is proportional to Logistic Regression Then, the log odd ration can be rewritten in the form of a simple logistic regression
Outline • Name-Matching & Name-Ethnicity • Problem Definition • Motivation • Previous Work • Ethnicity-Sensitive Name-Matching Framework • Name-Ethnicity Classification • Ethnicity Sensitive Name-Matching • Evaluation • Name-Ethnicity Classification (via Wikipedia) • Ethnicity Sensitive Name-Matching (via DBLP data) • Conclusion
Evaluation: Name-Ethnicity Classification • Use Wikipedia as the data source • More fine grain • US Census only has 6 types of ethnic groups • White, African American, Hispanic, Asian+Pacific Islander, Multi-nationality, Others • Automatically crawl for names of various nationalities from Wikipedia categories • Use Breadth-First-Search starting from “<nationality> people” pages, up to the depth of 4 • Manually curated results with some heuristics • E.g. names of `British people of Indian descents’ are more likely to be names of Indian ethnicity than of British ethnicity
Wikipedia Data • 19 Nationalities • 12 Ethnic groups • 70/30 split for training and testing
Accuracy and Confusion Matrix • 85% overall accuracy, slightly drop to 84% if ignore nonASCII features • High confusion between MEA and IND, and between ENG, FRN, and GER (observation: countries with high immigration rates) • Asian names are fairly easy to identified, especially JAP
Top Identifiable Features • Top features (without diacritics) for each name-ethnicity classes according to the coefficients in the logistic regression models, e.g. • ‘bh’ sequence is mostly unique to Indian names, while names with ‘sch’ likely to be German names • Names ending with ‘ng’ are mostly Chinese names
Top Identifiable Features (Full) • Top features (including diacritics feature) for each name-ethnicity classes • While many diacritics features are highly ranked (especially for European names), removing them only hurt the accuracy slightly
Evaluation: Ethnicity Sensitive Name Matching • Data: DBLP10K person data set (10,000 pairs) • Manually labeled data from DBLP’s correction requests and heuristically detected errors • Lange, D., and Naumann, F. Frequency-aware Similarity Measures: Why Arnold Schwarzenegger is Always a Duplicate. CIKM 2011 • Select only the paper reference pairs from the same author with different name aliases (2,500 pairs) • Compare with 4 baselines (2 Basic and 2 Level2) • Basic • Levenstein, Jaro-Winkler • Level2 [Monge and Elkan, KDD96] • Recursive matching scheme for multi-fields strings (last names, forenames) • L2 Levenstein, L2 Jaro-Winkler • Ethnicity-Sensitive Name-Matching (4 Models) • Middle Eastern (MEA), Spanish (SPA), East Asian (CHI, JAP, KOR, VIE), and Default – (ALL)
Experiment Result • N x N comparison (N ~ 2,500) Levenstein F1=0.70 (R=0.6, P=0.81) Jaro-Winkler F1=0.75 (R=0.7, P=0.81) L2 Levenstein F1=0.77 (R=0.8, P=0.74) L2 Jaro-Winkler F1=0.80 (R=0.7, P=0.93) Our Algorithm F1=0.94 (+0.14) R=0.89 (+0.19) P=0.99 (+0.06) Error cases: Maria-FlorinaPopa Maria-FlorinaBalcan HedvigSidenbladhHedvigKjellstrom
EthnicSeer http://singularity.ist.psu.edu/ethnicity
Conclusion & Future Work • Name-ethnicity classification • 85% accuracy on 12 ethnicities on Wikipedia • Show that character/phonetic ngrams together with a logistic regression model can be used to effectively identify name-ethnicity • Ethnicity-sensitive name-matching • Improve performance, F1=0.94 (+14%), P=0.99 (+6%), on DBLP hard data set over the best baselines. • Future Work • Expand to more ethnicities, to finer grain classification (French in Quebec vs. in France). • Incorporate frequency knowledge + more syntactic knowledge • Ethnicity trends & prediction • Use finer grain name-ethnicity distance function • Naming convention between Spanish in Spain & Latin American differ somewhat