140 likes | 224 Views
Clickstream Models & Sybil Detection. Gang Wang ( 王刚 ) UC Santa Barbara gangw@cs.ucsb.edu. Modeling User Clickstream Events. User-generated events E.g. profile load, link follow, photo browse, friend invite Assume we have event type, userID , timestamp
E N D
Clickstream Models & Sybil Detection • Gang Wang (王刚) • UC Santa Barbara • gangw@cs.ucsb.edu
Modeling User Clickstream Events • User-generated events • E.g. profile load, link follow, photo browse, friend invite • Assume we have event type, userID, timestamp • Intuition: Sybil users act differently from normal users • Sybil users act differently from normal users • Goal-oriented: focus on specific actions, less “extraneous” events • Time-limited: focused on efficient use of time, smaller gaps? • Forcing Sybil users to mimic users win?
System Overview Sequence Clustering Cluster Coloring Known Good Users Clickstream Log Incoming Clickstream Legit ? Sybils
Clickstream Models • Clickstream log • user clicks (click type) with timestamp • Modeling Clickstream • Event-only Sequence Model: order of events • e.g. ABCDA • Time-based Model: sequence of inter-arrival time • e.g. {t1, t2, t3, …} • Hybrid Model: sequence of click events with time • e.g. A(t1)B(t2)C(t3)D(t4)A
Clickstream Clustering • Similarity Graph • Vertices: users (or sessions) • Edges: weighted by the similarity score of two user’s clickstream • Clustering Similar Clickstreams together • Graph partitioning using METIS Q: How to compare two clickstreams?
Distance Functions Of Each Model • Click Sequence (CS) Model • Ngram overlap • Ngram+count • Time-based Model • Compare the distribution of inter-arrival time • K-S test • Hybrid Model • Bucketize inter-arrival time • Compute 5grams (similar with CS Model) S1= AAB S2= AAC ngram1= {A, B, AA, AB, AAB} ngram2= {A, C, AA, AC, AAC} Euclidean Distance S1= AAB S2= AAC ngram1= {A(2), B(1), AA(1), AB(1), AAB(1)} ngram2= {A(2), C(1), AA(1), AC(1), AAC(1)} V1=(2,1,0,1,0,1,1,0)/6 V2=(2,0,1,1,1,0,0,1)/6
Detection In A Nutshell • Inputs: • Trained clusters • Input sequences for testing • Methodology: given a test sequence A • K nearest neighbor: find the top-k nearest sequences in the trained cluster • Nearest Cluster: find the nearest cluster based on average distance to sequences in the cluster • Nearest Cluster (center): pre-compute the center(s) of cluster, find the nearest cluster center ?
Clustering Sequences How well can each method separate Sybils from legitimate users?
Detection Accuracy • Basics • Training on one group of users, and test on the other group of users. • Clusters trained using Hybrid Model • Key takeaways • High accuracy with 50 clicks in the test sequence • Nearest Cluster (Center) method achieves high accuracy with minor computation overhead
Can Model Be Effective Over Time? • Experiment method • Using first two-week data to train the model • Testing on the following two-week data
Still Ongoing Work • With broad interest and applications • As Sybil detection tool • Code being tested internally at Renren • Trained with 10K users (2-week log) • Testing on 1 Million users (1-week log) • 5 Sybil clusters 22K suspicious profiles • Further improvement • Training with longer clickstream (half users have <5 clicks in 2-week) • More conservative in labeling Sybil clusters. • As user modeling tool • Code being tested by LinkedIn as user profiler
Some Useful Tools • Graph Partitioning • Metis http://glaros.dtc.umn.edu/gkhome/metis/metis/overview • Community Detection • Louvain code https://sites.google.com/site/findcommunities/
Other Ongoing Works/Ideas • Fighting against crowdturfing • Crowdturfing: real users are paid to spam • How to detect these malicious real users • User behavior model • Network-wised temporal anomaly detection • Information Dissemination • Content sharing visa social edges • How often will user click on the content • How often will user comment on the content • Sybil detection, target ad placement
Thank You! Questions? http://current.cs.ucsb.edu