360 likes | 928 Views
Alias Detection in Link Data Sets. Master’s Thesis Paul Hsiung. Alias Definition. Alias of names Dubya = G.W. Bush Usama = Osama G.W.Bush = the President Osama bin Laden = the Emir, the Prince Misspelled words Unintentional (typos) Intentional : mortgage = m0rtg@ge (Spam).
E N D
Alias Detection in Link Data Sets Master’s Thesis Paul Hsiung
Alias Definition • Alias of names • Dubya = G.W. Bush • Usama = Osama • G.W.Bush = the PresidentOsama bin Laden = the Emir, the Prince • Misspelled words • Unintentional (typos) • Intentional : mortgage = m0rtg@ge (Spam)
In What Context Do Aliases Occur? • Newspaper articles • WebPages • Spam emails • Any collections of text
Link Data Set • A way to represent the context • Compose of set of names and links • Names are extracted from the text • Names can refer to the same entity (“Dubya” and “G.W.Bush”) • Links are collection of names and represent a relationship between names
Example Wanted al-Qaeda terror network chief Osama bin Laden and his top aide, Ayman al-Zawahri, have Moved out of Pakistan and are believed to have Crossed the mountainous border back into Afghanistan • (Osama bin Laden, Ayman al-Zawahri, al-Qaeda) • (Pakistan, Osama bin Laden) • (Afghanistan, Osama bin Laden)
Graph Representation Pakistan Afghanistan al-Qaeda Osama Ayman
Advantages • Link data set is easily understood by computers • Mimic the way intelligence communities gather data
Alias Detection • Given two names in a link data set, are they aliases (i.e. do they refer to the same entity?) • How to measure their alias-ness? • Semi-supervised learning
Orthographic Measures • String edit distance • Minimum number of insertions, deletions, and substitutions required to transform one name into the other • SED(Osama, Usama) = 2 • SED(Osama, Bush) = 7 • Intuitive measure
Some Orthographic Measures • String edit distance • Normalized string edit distance • Discretized string edit distance
Semantic Measures • But what about aliases such as the Prince and Osama? • Define friends of Osama as people who have occurred in same links with Osama • Through link data sets, number of occurrences of each friend can be collected • Intuition: friends of the Prince look like friends of Osama • Treat friends as probability vectors
Example of Friends al-Qaeda 10 CNN Osama 2 5 Islam
Comparing Two Friends Lists al-Qaeda 2 10 The Prince CNN 8 2 Osama 50 5 Islam Music
Some Semantic Measures • Dot Product: 10 * 2 + 2 * 8 • Normalized Dot Product • Common Friends: 2 (CNN, AlQaeda) • KL Distance:
Classifier • So we have a link data set • We have some measures of what aliases are • We can easily hand-pick some examples of aliases • Let’s build a classifier!
Classifier Training Set • Positive examples: hand-pick pairs of names in link data set that are known aliases • Negative examples: randomly pick pairs of names from the same link data set • Calculate measures for all the pairs and insert them as attributes into the training set
Classifier : Cross-Validation • Experimented with Decision Trees, k-Nearest Neighbors, Naïve Bayes, Support Vector Machines, and Logistic Regression • Logistic Regression performed the best
Prediction • Given a query name in the link data set with known aliases • Pair query name with ALL other names • Calculate attributes for all pairs • Run each pair through the classifier and obtain a score (how likely are they to be aliases?)
Prediction • Use the score to sort the pairs from most likely to be an alias to least likely • See where the true aliases lie in the sorted list and produce a ROC curve • Evaluate classifier based on ROC curve
Summary True alias pairs (no query name) Random pairs Query name Calc Attributes Calc Attributes Train Logistic Regression Run Classifier ROC curve
ROC Curve • Start from (0,0) on the graph • Go down the sorted list • If the name on the list is a true alias, move y by one unit • If the name on the list is not a true alias, move x by one unit
Perfect ROC Example 3 2 1 0 1 2 3
ROC Example 3 2 1 0 1 2 3
ROC: Normalize • Balance positive and negative examples • Area under curve(AUC) = 5/9 • Able to average multiple curves 1 0.6 0.3 0 0.3 0.6 1
Empirical Results • Test on one web page link data set and two spam link data sets • Hand pick aliases for each set
Empirical Results • Choose an alias from the set of hand pick aliases as a query name • Build classifier from other aliases that are not aliases with the query name • Do prediction and obtain ROC curve • Repeat for each alias in the set of hand pick aliases • Average all ROC curves by normalized axis
Evaluation • We want to know how significant is each group of attributes • Train one classifier with just orthographic attributes • Train another with just semantic attributes • Train a third with both sets of attributes • Compare curve and area under curve (AUC)
Terrorist Data Set • Manually extracted from public web pages • News and articles related to terrorism • Names mentioned in the articles are subjectively linked • Used 919 alias pairs for training
Spam Data Set • Collection of spam emails • Filter out html tags • All the words are converted to tokens with white spaces being the boundaries • Common tokens are filtered (e.g. “the” “a”) • Each email represents a link • Each link contains tokens from corresponding email
Example Subject:Mortgage rates as low as 2.95% Ref<suyzvigcffl>ina<swwvvcobadtbo>nce to<shecpgkgffa>day to as low as 2.<sppyjukbywvbqc>95% Sa<scqzxytdcua>ve thou<sdzkltzcyry>sa<sefaioubryxkpl>nds of dol<scarqdscpvibyw>l<sklhxmxbvdr>ars or b<skaavzibaenix>uy the <br> ho<solbbdcqoxpdxcr>me of yo<svesxhobppoy>ur dr<sxjsfyvhhejoldl>eams!<br> • Filtered to: (mortgage, rates, low, refinance, today, save, thousands, dollars, home, dreams)
Conclusion • Orthographic measures work well • Semantic sometimes better, sometimes worse than orthographic • Combining them produces the best • Future work includes adding other measures such as phonetic string edit distance • Larger question: many aliases to many names