1 / 18

Analyzing Large, Dynamic Communication Graphs

Homer. Bank. Marge. Blog. Blog. Maggie. Bank. Bart. Lisa. Analyzing Large, Dynamic Communication Graphs. Telecommunications. Information Networks. Application: who are web spammers? Who is the message board troll?. Application: who is a repetitive debtor?.

anstice
Download Presentation

Analyzing Large, Dynamic Communication Graphs

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Homer Bank Marge Blog Blog Maggie Bank Bart Lisa Analyzing Large, Dynamic Communication Graphs Telecommunications Information Networks Application:who are web spammers? Who is the message board troll? Application: who is a repetitive debtor? We analyze communication graphs. Students: Smriti Bhagat, Irina Rozenbaum, Hongyi Xue, Yinhua Wu Researchers: Graham Cormode (AT&T) and S. Muthukrishnan (Rutgers)

  2. Summary of Work • Learning on communication graphs • Some blogs specify age or the location of the bloggers. Other do not. We have designed methods for inferring age, location and other profile information of blogs when they are not specified. • Signatures for communication • We designed signatures and signature properties for finding masqueradors. • Ongoing: • Profile browsers for blogs: point to a blog and determine its “community”, ie., blogs that interact with it either explicitly via links or implicitly via topics, interests. We have built a preliminary browser, and will have a demonstrable version by summer 08. • Continuous, monitoring algorithms for detecting changes, anomalies, etc. These algorithms minimize the communication between sensors and central data collection and processing servers. ICWSM07,KDD07 ICDE08 SODA08 More details next

  3. Blogs and Trustworthiness “Blogs” are a rich type of open source data: • React rapidly to major news, defining opinion and identifying articles of interest • Raise problems of trustworthiness, finding leaders, classifying for expertise and bias • Intersect with web, email, chat data, social networks…

  4. Blog Analysis: Our approach Our approach: model as dynamic multigraphs, study these problems as labeling on these structures Motivating example: from links between blogs and a few known labels, can we accurately estimate the age of a blogger? Webpage 31 22 33 Cannot always trust the input data… apparently some people lie about their age ? Blog Blog Entry

  5. Blog Analysis Status • Collected several million blogs and links from three blog hosts • Data anonymized for storage • Implemented multiple algorithms • Local: based on neighbors • Global: find similar structure elsewhere • Accuracy: 90% of assigned labels ±2 years of true age • Ongoing work: • Analyzing impact of geographic information (“metros”) • Include additional features: comments, timestamps etc. • Extending algorithms, applying to other problems

  6. Communication Signatures • Very valuable to define signaturesof users or people • Identify anomalies if the signature suddenly changes • Find same person when they adopt a new identity • The nature of useful signatures varies across networks • Telecoms: top 10 callees defines a community of interest (mom, dad, cat…) • Web: top 10 sites accessed not a goodsignature (yahoo news, google, msn…) • Need a new methodology to find and evaluate signature schemes n1 w15 w12 w13 n5 w14 n2 n3 n4 wrong number father mother sister

  7. Signatures – Our Approach • Define 3 fundamental properties of signatures: • Persistence: my signature yesterday is same today • Uniqueness: my signature is different from yours • Robustness: unchanged by small changes in behavior • Develop methods for testing these properties of a generic scheme for any application • The evaluation methodology more important to us than any particular scheme n1 n5 n8 n2 n3 n4 n7 communication graph n6

  8. Signatures – Our Status • Defined signatures for a variety of settings • Example: internet activity • Find most accessed sitesscaled by global popularity • Finds more “unexpected” sites to give better discrimination • Developed algorithms for testing properties of signatures • Working on streams of new data • Seeking new signature definitions and new applications n1 w15 w12 w13 n5 w14 n2 n3 n4 Google IMDB Cute-pets Cat-web web graph

  9. Analyzing Large, Dynamic Communication Graphs • We are interacting with the Intelligence Bureau of the New Jersey Office of Homeland Security and Preparedness. • OHSP is concentrating on open source data and needs methods like ours.

  10. Appendix These are detailed slides if needed.

  11. Signatures in Graphs s(1) = {5, 8} • Input: a directed, weighted communication graph G=<V,E> • A communication graph signatures(v) for a node v is a subset of nodes with top-k associated weights. That is w12=0.3 1 2 w18=0.7 w13=0.5 3 8 8 5 5 4 w15=0.8 w14=0.4 • Previous work focused on signatures for particular tasks: • Find repetitive debtors in Telecommunications Networks [Cortes et al.] • Identify authors in paper citation graphs [Hill et al.] • Our focus: a systematic study of principles behind signatures to any task via desired signature properties per application.

  12. Persistence: my signature yesterday is the same as today’s. • Uniqueness:my signature is different from yours. • Robustness: signatures remain unchanged by small changes in behavior. robustness Anomaly Detection high Multi-usage Detection Label Reassignment medium Label Masquerading low medium high persistence medium high uniqueness Signature Properties

  13. 1 C[1, 8] 18 8 • Relevance measure: • Example: telecommunication networks • Signature: the set of people I called most 13 20 wrong number father mother sister s(1) = {8, 13, 20} One-hop Scheme: Top Talkers (TT) Scheme Graph Characteristics Associated Properties robustness edge weights TT uniqueness graph hop distance

  14. 1 18 8 13 20 Google IMDB Cute-pets Cat-web s(1) = {8, 13, 20} One-hop Scheme: Unexpected Talkers (UT) • Relevance measure: • Example: web traffic graphs • Signature: the set of web pages I am interested in reading. Scheme Graph Characteristics Associated Properties uniqueness edge weights UT node degrees graph hop distance

  15. Multi-hop Scheme: Random Walk 1 13 18 h-hop random walk with resets (RWRh) 3 8 10 • RWR relevance measure: • RWRh = RWR + h hops • Example: social networks • Signature: my interests 5 7 12 11 s(1) = {5, 8, 13} Scheme Graph Characteristics Associated Properties edge weights RWR and RWRh persistence robustness # of connecting paths RWRh graph hop distance uniqueness

  16. Application 1: Multi-usage Detection • Goal: to identify sets of individuals exhibiting unique behaviors; • Uniqueness and robustness are needed for this application; • Top Talkers (TT) outperforms UT and RWRh.

  17. Application 2: Label Masquerading • Example: repetitive debtors • High persistence and uniqueness are desired for this application • On the assumption that such masquerades are relatively rare, RWRh outperforms the other two schemes

  18. Summary of Graph Modeling via Signatures • There is no single signature scheme that is good for all applications. Different signatures are needed, depending on what balance of the three properties they provide. • Signature properties provide insights into suitable signature schemes for desired applications. • Semi-streaming algorithms make use of existing sketching techniques – distinct counts, heavy hitters – to estimate signatures.

More Related