Large -Scale Cost-sensitive Online Social Network Profile Linkage

Large-Scale Cost-sensitiveOnline Social Network Profile Linkage

Background & Motivation • Foot prints in different social networks. • User identification in social analysis. • Privacy & security • Commercial & government applications

Outline • Problem definition • Related work • Approach • Experiment • Conclusion & future work

Problem Definition • Terminology • Identity: Person • Profile/User: Your footprint on social media • Profile Linkage: Link your footprints together • Input & Output • Input: profiles of one site as QUERY and profiles of the other site as TARGET. • Output: all pairs of classified matched profiles.

Characteristics of profile • Name (semi vs. structured) • {“given name”: “haochen”, “family name”: “zhang”} • name: zhanghaochen • Semi-structured schema • Incompleteness & missing attributes • Privacy policy • Virtual identification • Free text description • Bio, About me, Tags • Multilingualism

Multilingualism • Top 5 languages in dataset of Facebook • English • Portuguese • Spanish • Chinese • French • Most frequent tokens in different languages • chris, john, michael • chen, wang, lee • carlos, garcia, daniel • sergey, olga, alexander • About 70% users are in English • 7.2% users register as different locales • Transliteration • 昊辰 => Haochen

Feature Acquisition • Network communication costs too much time. • Usage limit of the web service. • 1000 invocations per day for Google Maps API • Compute complexity comparing to string similarity. • Image processing algorithm.

Overview of approach

Canopy: design

Canopy: efficiency

Local Features • Username • Jaro Winkler Similarity • Language • JaccardSimlarity • Description, URL • Cosine similarity with TF×IDF • Popularity • Defined as the friend amount of a user. • Adopt following metric

External Features • Geographic Location • Values are diverse with different types. • Google Maps API: • string-represented location => geographic information • Spherical distance between two locations as the feature • Avatar • χ2 dissimilarityof the avatar’s gray-scale histogram.

Classification: learning • Probabilistic model derived from naïve bayes • Independent feature assumption

Classification: learning • Iterative inference • Terminate if S_n is discriminative. • Set up threshold by choosing the error rate in training set of each feature to determine whether S_n is discriminative • Order of the features

Classification: learning • Initial value • Estimate by the prior that two profiles sharing rarer tokens are more likely to be matched. • as the initial value

Dataset of experiment • Data source • 152,294 Twitter users • 154,379 LinkedIn users • Ground truth: 9,750 identities • 4,779 identities with both accounts. • 3,339 identities with only Twitter account. • 1,632 identities with only LinkedIn account.

Experiment: Performance on overall linkage • I-Acc(Identity Accuracy) • correctly identified identities / all identities in ground truth • Better than naïve learning method caused by adopting the prior. • Different performance on different learning methods.

Experiment: Cost-sensitive feature acquisition • 5% improvement of F1 by taking 148743 external feature acquisitions. • Different order of external features. • Rank by cost • Rank by distinguishability • Three sections divided by two inflection points.

Discussion: dataset construction • Dataset construction • Connections • Cannot correctly reflect the web-scale occasion. • Name is too significant. • People search • Difficult to construct the ground truth. • Solution?

Discussion: people search task • Query in LinkedIn by Twitter user’s name • Average 10 results for each query

Discussion: feature dependency • Compare features independently. • 2 people in Tsinghua with same name Li Peng • 2 people in NUS with same name Li Peng • Construct different IDF table for name in different locale. • Not generally • Not significantly effective

Conclusion • We proposed an supervised probabilistic to solve the identity linkage problem effectively. • Prior that users sharing rarer tokens are more likely matched improves the performance of the approach. • Iterative inference is able to reduce unnecessary feature acquisitions.

Thank you

Large -Scale Cost-sensitive Online Social Network Profile Linkage

Large -Scale Cost-sensitive Online Social Network Profile Linkage

Presentation Transcript

e-Infrastructure for Large-Scale Social Simulation

Social Influence Analysis in Large-scale Networks

Complete Network Analysis Network Connections: Large-Scale network structure

Evaluating Similarity Measures: A Large-Scale Study in the orkut Social Network

Cost-Sensitive Learning for Large-Scale Hierarchical Classification of Commercial Products

Large-Scale Social Network Analysis – The STACC Experience

Large-Scale Network Dynamics: A New Frontier

Matchbox Large Scale Online Bayesian Recommendations

LARGE SCALE

Defending against large-scale crawls in online social networks

A Large-Scale Network Testbed

The Large Scale Clinical Trial Network

Large – Scale Sensor network

Large scale

Large -Scale Entity-Based Online Social Network Profile Linkage

Service Deployment in Large Scale Network

HiMap: Adaptive Visualization of Large-Scale Online Social Networks

NetSearch : Googling Large-scale Network Management Data

Large Scale Network Growth Techniques

Large Scale Metabolic Network Alignments by Compression

Large-Scale and Cost-Effective Video Services

Large-Scale and Cost-Effective Video Services