230 likes | 348 Views
Large -Scale Cost-sensitive Online Social Network Profile Linkage. Background & Motivation. Foot prints in different social networks. User identification in social analysis. Privacy & security Commercial & government applications. Outline. Problem definition Related work Approach
E N D
Large-Scale Cost-sensitiveOnline Social Network Profile Linkage
Background & Motivation • Foot prints in different social networks. • User identification in social analysis. • Privacy & security • Commercial & government applications
Outline • Problem definition • Related work • Approach • Experiment • Conclusion & future work
Problem Definition • Terminology • Identity: Person • Profile/User: Your footprint on social media • Profile Linkage: Link your footprints together • Input & Output • Input: profiles of one site as QUERY and profiles of the other site as TARGET. • Output: all pairs of classified matched profiles.
Characteristics of profile • Name (semi vs. structured) • {“given name”: “haochen”, “family name”: “zhang”} • name: zhanghaochen • Semi-structured schema • Incompleteness & missing attributes • Privacy policy • Virtual identification • Free text description • Bio, About me, Tags • Multilingualism
Multilingualism • Top 5 languages in dataset of Facebook • English • Portuguese • Spanish • Chinese • French • Most frequent tokens in different languages • chris, john, michael • chen, wang, lee • carlos, garcia, daniel • sergey, olga, alexander • About 70% users are in English • 7.2% users register as different locales • Transliteration • 昊辰 => Haochen
Feature Acquisition • Network communication costs too much time. • Usage limit of the web service. • 1000 invocations per day for Google Maps API • Compute complexity comparing to string similarity. • Image processing algorithm.
Local Features • Username • Jaro Winkler Similarity • Language • JaccardSimlarity • Description, URL • Cosine similarity with TF×IDF • Popularity • Defined as the friend amount of a user. • Adopt following metric
External Features • Geographic Location • Values are diverse with different types. • Google Maps API: • string-represented location => geographic information • Spherical distance between two locations as the feature • Avatar • χ2 dissimilarityof the avatar’s gray-scale histogram.
Classification: learning • Probabilistic model derived from naïve bayes • Independent feature assumption
Classification: learning • Iterative inference • Terminate if S_n is discriminative. • Set up threshold by choosing the error rate in training set of each feature to determine whether S_n is discriminative • Order of the features
Classification: learning • Initial value • Estimate by the prior that two profiles sharing rarer tokens are more likely to be matched. • as the initial value
Dataset of experiment • Data source • 152,294 Twitter users • 154,379 LinkedIn users • Ground truth: 9,750 identities • 4,779 identities with both accounts. • 3,339 identities with only Twitter account. • 1,632 identities with only LinkedIn account.
Experiment: Performance on overall linkage • I-Acc(Identity Accuracy) • correctly identified identities / all identities in ground truth • Better than naïve learning method caused by adopting the prior. • Different performance on different learning methods.
Experiment: Cost-sensitive feature acquisition • 5% improvement of F1 by taking 148743 external feature acquisitions. • Different order of external features. • Rank by cost • Rank by distinguishability • Three sections divided by two inflection points.
Discussion: dataset construction • Dataset construction • Connections • Cannot correctly reflect the web-scale occasion. • Name is too significant. • People search • Difficult to construct the ground truth. • Solution?
Discussion: people search task • Query in LinkedIn by Twitter user’s name • Average 10 results for each query
Discussion: feature dependency • Compare features independently. • 2 people in Tsinghua with same name Li Peng • 2 people in NUS with same name Li Peng • Construct different IDF table for name in different locale. • Not generally • Not significantly effective
Conclusion • We proposed an supervised probabilistic to solve the identity linkage problem effectively. • Prior that users sharing rarer tokens are more likely matched improves the performance of the approach. • Iterative inference is able to reduce unnecessary feature acquisitions.