1 / 14

Clickstream Models & Sybil Detection

Clickstream Models & Sybil Detection. Gang Wang ( 王刚 ) UC Santa Barbara gangw@cs.ucsb.edu. Modeling User Clickstream Events. User-generated events E.g. profile load, link follow, photo browse, friend invite Assume we have event type, userID , timestamp

enan
Download Presentation

Clickstream Models & Sybil Detection

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Clickstream Models & Sybil Detection • Gang Wang (王刚) • UC Santa Barbara • gangw@cs.ucsb.edu

  2. Modeling User Clickstream Events • User-generated events • E.g. profile load, link follow, photo browse, friend invite • Assume we have event type, userID, timestamp • Intuition: Sybil users act differently from normal users • Sybil users act differently from normal users • Goal-oriented: focus on specific actions, less “extraneous” events • Time-limited: focused on efficient use of time, smaller gaps? • Forcing Sybil users to mimic users  win?

  3. System Overview Sequence Clustering Cluster Coloring Known Good Users Clickstream Log Incoming Clickstream Legit ? Sybils

  4. Clickstream Models • Clickstream log • user clicks (click type) with timestamp • Modeling Clickstream • Event-only Sequence Model: order of events • e.g. ABCDA • Time-based Model: sequence of inter-arrival time • e.g. {t1, t2, t3, …} • Hybrid Model: sequence of click events with time • e.g. A(t1)B(t2)C(t3)D(t4)A

  5. Clickstream Clustering • Similarity Graph • Vertices: users (or sessions) • Edges: weighted by the similarity score of two user’s clickstream • Clustering Similar Clickstreams together • Graph partitioning using METIS Q: How to compare two clickstreams?

  6. Distance Functions Of Each Model • Click Sequence (CS) Model • Ngram overlap • Ngram+count • Time-based Model • Compare the distribution of inter-arrival time • K-S test • Hybrid Model • Bucketize inter-arrival time • Compute 5grams (similar with CS Model) S1= AAB S2= AAC ngram1= {A, B, AA, AB, AAB} ngram2= {A, C, AA, AC, AAC} Euclidean Distance S1= AAB S2= AAC ngram1= {A(2), B(1), AA(1), AB(1), AAB(1)} ngram2= {A(2), C(1), AA(1), AC(1), AAC(1)} V1=(2,1,0,1,0,1,1,0)/6 V2=(2,0,1,1,1,0,0,1)/6

  7. Detection In A Nutshell • Inputs: • Trained clusters • Input sequences for testing • Methodology: given a test sequence A • K nearest neighbor: find the top-k nearest sequences in the trained cluster • Nearest Cluster: find the nearest cluster based on average distance to sequences in the cluster • Nearest Cluster (center): pre-compute the center(s) of cluster, find the nearest cluster center ?

  8. Clustering Sequences How well can each method separate Sybils from legitimate users?

  9. Detection Accuracy • Basics • Training on one group of users, and test on the other group of users. • Clusters trained using Hybrid Model • Key takeaways • High accuracy with 50 clicks in the test sequence • Nearest Cluster (Center) method achieves high accuracy with minor computation overhead

  10. Can Model Be Effective Over Time? • Experiment method • Using first two-week data to train the model • Testing on the following two-week data

  11. Still Ongoing Work • With broad interest and applications • As Sybil detection tool • Code being tested internally at Renren • Trained with 10K users (2-week log) • Testing on 1 Million users (1-week log) • 5 Sybil clusters 22K suspicious profiles • Further improvement • Training with longer clickstream (half users have <5 clicks in 2-week) • More conservative in labeling Sybil clusters. • As user modeling tool • Code being tested by LinkedIn as user profiler

  12. Some Useful Tools • Graph Partitioning • Metis http://glaros.dtc.umn.edu/gkhome/metis/metis/overview • Community Detection • Louvain code https://sites.google.com/site/findcommunities/

  13. Other Ongoing Works/Ideas • Fighting against crowdturfing • Crowdturfing: real users are paid to spam • How to detect these malicious real users • User behavior model • Network-wised temporal anomaly detection • Information Dissemination • Content sharing visa social edges • How often will user click on the content • How often will user comment on the content • Sybil detection, target ad placement

  14. Thank You! Questions? http://current.cs.ucsb.edu

More Related