1 / 22

Approximate Object Location and Spam Filtering on Peer-to-Peer Systems

This paper proposes a spam filtering solution for peer-to-peer systems using an approximate signature scheme. The scheme aims to minimize false positives and effectively detect modified spam emails. The paper also explores the scalability and resilience of the signature repository in a structured peer-to-peer network.

Download Presentation

Approximate Object Location and Spam Filtering on Peer-to-Peer Systems

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Approximate Object Location and Spam Filtering on Peer-to-Peer Systems Feng Zhou, Li Zhuang, Ben Y. Zhao,Ling Huang, Anthony D. Josephand John D. Kubiatowicz University of California, Berkeley

  2. The Problem of Spam • Spam • Unsolicited, automated emails • Radicati Group: $20B cost in 2003, $198B in 2007 • Proposed solutions • Economic model for spam prevention • Attach cost to mass email distribution • Weakness: needs wide-spread deployment, prevent but not filter • Bayesian network / machine learning (independent) • “Train” mailer with spam, rely on recognizing words / patterns • Weakness: key words can be masked (images, invis. characters) • Collaborative filtering • Store / query for spam signatures on central repository • Other users query signatures to filter out incoming spam • Weakness: central repository limited in bandwidth, computation ravenben@eecs.berkeley.edu

  3. Our Contribution • Can signatures effectively detect modified spam? • Goals: • Minimize false positives (marking good email as spam) • Recognize modified/customized spam as same as original • Present signature scheme based on approx. fingerprints • Evaluate against random text and real email messages • Can we build a scalable, resilient signature repository • Leverage structured peer-to-peer networks • Constrain query latency and limit bandwidth usage • Orthogonal issues we do not address: • Preprocessing emails to extract content • Interpreting collective votes via reputation systems ravenben@eecs.berkeley.edu

  4. Outline • Introduction • An Approximate Signature Scheme • Evaluation using random text and real emails • Approximate object location • Similarity search on P2P systems • Constraining latency and bandwidth usage • Conclusion ravenben@eecs.berkeley.edu

  5. query response store signature Collaborative Spam Filtering 1. Spam sent to user A 2. Signatures stored 3. Spam sent to user B 4. B queries repository 5. B filters out spam ravenben@eecs.berkeley.edu

  6. Randomize Choose N Sort by value Vector An Approximate Signature Scheme • How to match documents with very similar content • Calculate checksums of all substrings of length L • Select deterministic set of N checksums • A matches B iff |sig(A)  sig(B)| > Threshold • Computation tput: 13MByte/s on P-III 1Ghz A B C D E F checksum Email Message ravenben@eecs.berkeley.edu

  7. 10000 random text documents, size = 5KB, calculate 10 signatures Compare signatures of before and after modifications Analytical results match experimental results Accuracy of Signature Vectors ravenben@eecs.berkeley.edu

  8. Compare pair-wise signatures between 10000 random docs None matched 3 of 10 signatures (100,000,000 pairs) Eliminating False positives ravenben@eecs.berkeley.edu

  9. Evaluation on Real Messages • 29631 Spam Emails from www.spamarchive.org • Processed visually by project members • 14925 (unique), 86% of spam ≤ 5K • Robustness to modification test • Most popular 39 msgs have 3440 modified copies • Examine recognition between copies and originals ravenben@eecs.berkeley.edu

  10. False Positive Test • Non-spam emails • 9589 messages: 50% newsgroup posts + 50% personal emails • Compare against 14925 unique spam messages • Sweet spot, using threshold of 3/10 signatures • Recognition rate > 97.5% • False positive rate < 1 in 140 million pairs ravenben@eecs.berkeley.edu

  11. query response store signature A Distributed Signature Repository? • How do we build a scalable distributed repository? • How do we limit bandwidth consumption and latency? ravenben@eecs.berkeley.edu

  12. Structured Peer-to-Peer Overlays • Storage / query via structured P2P overlay networks • Large sparse ID space N (160 bits: 0 – 2160) • Nodes in overlay network have nodeIDs  N • Given k  N, overlay deterministically maps kto its root node (a live node in the network) • E.g. Chord, Pastry, Tapestry, Kademlia, Skipnet, etc… • Decentralized Object Location and Routing (DOLR) • Objects identified by Globally Unique IDs (GUIDs)  N • Decentralized directory service for endpoints/objectsRoute messages to nearest available endpoint • Object location with locality:routing stretch (overlay location / shortest distance)  O(1) ravenben@eecs.berkeley.edu

  13. DOLR on Tapestry Routing Mesh Client Root(O) Object O Client ravenben@eecs.berkeley.edu Server

  14. Simpler Queries More Powerful QLs Databases DHT/DOLRs Search on Features More Than Just Unique Identifiers • Objects named by Globally Unique ID (GUID) • Application maps secondary characteristics to ID:versioning, modified replicas, app-specific info • Simplify the search problem • out of m search fields, or “features,” find objects matching at least n exactly ravenben@eecs.berkeley.edu

  15. approxPublish(FV,GUID) approxSearch(FV) route (FV, msg) Publish (GUID) route (GUID, msg) route (IPAddr, msg) A Layered Perspective ADOLR layer • Introduce naming mapping from feature vector to GUIDs • Rely on overlay infrastructure for storage • Abstraction of feature vectors as approximate names for object(s) Approximate DOLR/DHT Structured P2P Overlay Network (DOLR) IP Network Layer ravenben@eecs.berkeley.edu

  16. S1 E3 S1 E3, E2 X X S2 E2 S3  E2 Marking a New Spam Message • Signatures stored as inverted index (feature object) inside overlay • User on C gets spam E2, calculates signatures S: {S1, S2, S3} • For each feature in S, if feature object exists, add E2 • If no feature object exists, create one locally and publish node AE3 put (S1, E2) Overlay put (S2, E2) put (S3, E2) node CE2 ravenben@eecs.berkeley.edu

  17. S1 E3, E2 X S4 S1 S2 E2’ E2 S2 E2 S3  E2 Filtering New Emails for Spam • User at node D receives new email E2’ with signatures {S1, S2, S4} • Queries overlay for signatures, retrieve matching GUIDs for each • Threshold = 2/3, contact GUIDs that occur in 2 of 3 result sets • Contact E2 via overlay for any additional info (votes etc) node AE3 node D Overlay node CE2 ravenben@eecs.berkeley.edu

  18. Constraining Bandwidth and Latency • Need to constrain bandwidth and latency • Limit signature query to h overlay hops • Return null set if h hops reached without result • Simulation on transit-stub topologies • 5K nodes, 4K overlay nodes, diameter = 400ms • Each spam message only reaches small group • For each message: % of users seen and marked = mark rate • Measure tradeoff between latency and success rate of locating known spam, for different mark rates ravenben@eecs.berkeley.edu

  19. Simulation Result network diameter ravenben@eecs.berkeley.edu

  20. Feature-based Queries • Approximate Text Addressing • Text objects: features  hashed signatures • Applications: plagiarism detection, replica management • Other feature-based searching • Music similarity search • Extract musical characteristics • Signatures: {hash(field1=value1), hash(field2=value2)…} • E.g. Fourier transform values, # of wavetable entries • Image similarity search • Locate files with similar geometric properties, etc. ravenben@eecs.berkeley.edu

  21. Finally… • Status • ADOLR infrastructure implemented on Tapestry • SpamWatch: P2P spam filterimplemented, including Microsoft Outlook plug-inAvailable for download:http://www.cs.berkeley.edu/~zf/spamwatch • Tapestryhttp://www.cs.berkeley.edu/~ravenben/tapestry • Thank you…Questions? ravenben@eecs.berkeley.edu

  22. Backup Slides Follow ravenben@eecs.berkeley.edu

More Related