1 / 23

Large -Scale Cost-sensitive Online Social Network Profile Linkage

Large -Scale Cost-sensitive Online Social Network Profile Linkage. Background & Motivation. Foot prints in different social networks. User identification in social analysis. Privacy & security Commercial & government applications. Outline. Problem definition Related work Approach

hasad
Download Presentation

Large -Scale Cost-sensitive Online Social Network Profile Linkage

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Large-Scale Cost-sensitiveOnline Social Network Profile Linkage

  2. Background & Motivation • Foot prints in different social networks. • User identification in social analysis. • Privacy & security • Commercial & government applications

  3. Outline • Problem definition • Related work • Approach • Experiment • Conclusion & future work

  4. Problem Definition • Terminology • Identity: Person • Profile/User: Your footprint on social media • Profile Linkage: Link your footprints together • Input & Output • Input: profiles of one site as QUERY and profiles of the other site as TARGET. • Output: all pairs of classified matched profiles.

  5. Characteristics of profile • Name (semi vs. structured) • {“given name”: “haochen”, “family name”: “zhang”} • name: zhanghaochen • Semi-structured schema • Incompleteness & missing attributes • Privacy policy • Virtual identification • Free text description • Bio, About me, Tags • Multilingualism

  6. Multilingualism • Top 5 languages in dataset of Facebook • English • Portuguese • Spanish • Chinese • French • Most frequent tokens in different languages • chris, john, michael • chen, wang, lee • carlos, garcia, daniel • sergey, olga, alexander • About 70% users are in English • 7.2% users register as different locales • Transliteration • 昊辰 => Haochen

  7. Feature Acquisition • Network communication costs too much time. • Usage limit of the web service. • 1000 invocations per day for Google Maps API • Compute complexity comparing to string similarity. • Image processing algorithm.

  8. Overview of approach

  9. Canopy: design

  10. Canopy: efficiency

  11. Local Features • Username • Jaro Winkler Similarity • Language • JaccardSimlarity • Description, URL • Cosine similarity with TF×IDF • Popularity • Defined as the friend amount of a user. • Adopt following metric

  12. External Features • Geographic Location • Values are diverse with different types. • Google Maps API: • string-represented location => geographic information • Spherical distance between two locations as the feature • Avatar • χ2 dissimilarityof the avatar’s gray-scale histogram.

  13. Classification: learning • Probabilistic model derived from naïve bayes • Independent feature assumption

  14. Classification: learning • Iterative inference • Terminate if S_n is discriminative. • Set up threshold by choosing the error rate in training set of each feature to determine whether S_n is discriminative • Order of the features

  15. Classification: learning • Initial value • Estimate by the prior that two profiles sharing rarer tokens are more likely to be matched. • as the initial value

  16. Dataset of experiment • Data source • 152,294 Twitter users • 154,379 LinkedIn users • Ground truth: 9,750 identities • 4,779 identities with both accounts. • 3,339 identities with only Twitter account. • 1,632 identities with only LinkedIn account.

  17. Experiment: Performance on overall linkage • I-Acc(Identity Accuracy) • correctly identified identities / all identities in ground truth • Better than naïve learning method caused by adopting the prior. • Different performance on different learning methods.

  18. Experiment: Cost-sensitive feature acquisition • 5% improvement of F1 by taking 148743 external feature acquisitions. • Different order of external features. • Rank by cost • Rank by distinguishability • Three sections divided by two inflection points.

  19. Discussion: dataset construction • Dataset construction • Connections • Cannot correctly reflect the web-scale occasion. • Name is too significant. • People search • Difficult to construct the ground truth. • Solution?

  20. Discussion: people search task • Query in LinkedIn by Twitter user’s name • Average 10 results for each query

  21. Discussion: feature dependency • Compare features independently. • 2 people in Tsinghua with same name Li Peng • 2 people in NUS with same name Li Peng • Construct different IDF table for name in different locale. • Not generally • Not significantly effective

  22. Conclusion • We proposed an supervised probabilistic to solve the identity linkage problem effectively. • Prior that users sharing rarer tokens are more likely matched improves the performance of the approach. • Iterative inference is able to reduce unnecessary feature acquisitions.

  23. Thank you

More Related