Clickstream Models & Sybil Detection

Clickstream Models & Sybil Detection • Gang Wang (王刚) • UC Santa Barbara • gangw@cs.ucsb.edu

Modeling User Clickstream Events • User-generated events • E.g. profile load, link follow, photo browse, friend invite • Assume we have event type, userID, timestamp • Intuition: Sybil users act differently from normal users • Sybil users act differently from normal users • Goal-oriented: focus on specific actions, less “extraneous” events • Time-limited: focused on efficient use of time, smaller gaps? • Forcing Sybil users to mimic users  win?

System Overview Sequence Clustering Cluster Coloring Known Good Users Clickstream Log Incoming Clickstream Legit ? Sybils

Clickstream Models • Clickstream log • user clicks (click type) with timestamp • Modeling Clickstream • Event-only Sequence Model: order of events • e.g. ABCDA • Time-based Model: sequence of inter-arrival time • e.g. {t1, t2, t3, …} • Hybrid Model: sequence of click events with time • e.g. A(t1)B(t2)C(t3)D(t4)A

Clickstream Clustering • Similarity Graph • Vertices: users (or sessions) • Edges: weighted by the similarity score of two user’s clickstream • Clustering Similar Clickstreams together • Graph partitioning using METIS Q: How to compare two clickstreams?

Distance Functions Of Each Model • Click Sequence (CS) Model • Ngram overlap • Ngram+count • Time-based Model • Compare the distribution of inter-arrival time • K-S test • Hybrid Model • Bucketize inter-arrival time • Compute 5grams (similar with CS Model) S1= AAB S2= AAC ngram1= {A, B, AA, AB, AAB} ngram2= {A, C, AA, AC, AAC} Euclidean Distance S1= AAB S2= AAC ngram1= {A(2), B(1), AA(1), AB(1), AAB(1)} ngram2= {A(2), C(1), AA(1), AC(1), AAC(1)} V1=(2,1,0,1,0,1,1,0)/6 V2=(2,0,1,1,1,0,0,1)/6

Detection In A Nutshell • Inputs: • Trained clusters • Input sequences for testing • Methodology: given a test sequence A • K nearest neighbor: find the top-k nearest sequences in the trained cluster • Nearest Cluster: find the nearest cluster based on average distance to sequences in the cluster • Nearest Cluster (center): pre-compute the center(s) of cluster, find the nearest cluster center ?

Clustering Sequences How well can each method separate Sybils from legitimate users?

Detection Accuracy • Basics • Training on one group of users, and test on the other group of users. • Clusters trained using Hybrid Model • Key takeaways • High accuracy with 50 clicks in the test sequence • Nearest Cluster (Center) method achieves high accuracy with minor computation overhead

Can Model Be Effective Over Time? • Experiment method • Using first two-week data to train the model • Testing on the following two-week data

Still Ongoing Work • With broad interest and applications • As Sybil detection tool • Code being tested internally at Renren • Trained with 10K users (2-week log) • Testing on 1 Million users (1-week log) • 5 Sybil clusters 22K suspicious profiles • Further improvement • Training with longer clickstream (half users have <5 clicks in 2-week) • More conservative in labeling Sybil clusters. • As user modeling tool • Code being tested by LinkedIn as user profiler

Some Useful Tools • Graph Partitioning • Metis http://glaros.dtc.umn.edu/gkhome/metis/metis/overview • Community Detection • Louvain code https://sites.google.com/site/findcommunities/

Other Ongoing Works/Ideas • Fighting against crowdturfing • Crowdturfing: real users are paid to spam • How to detect these malicious real users • User behavior model • Network-wised temporal anomaly detection • Information Dissemination • Content sharing visa social edges • How often will user click on the content • How often will user comment on the content • Sybil detection, target ad placement

Thank You! Questions? http://current.cs.ucsb.edu

Clickstream Models & Sybil Detection