1 / 18

Clustering Spam

Clustering Spam. MIT Spam Conference 2008 Phil Tom. Simple Clustering Algorithm. Clustering pseudocode. Expand clusters to include similar messages: Identical originating IP addresses. Identical subject lines. Identical message bodies. for each cluster in clusters expand cluster

wanda-todd
Download Presentation

Clustering Spam

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Clustering Spam MIT Spam Conference 2008 Phil Tom

  2. Simple Clustering Algorithm Clustering pseudocode Expand clusters to include similar messages: • Identical originating IP addresses. • Identical subject lines. • Identical message bodies. for each cluster in clusters expand cluster for each message in unclustered messages create a new cluster add message to cluster expand cluster

  3. Dimensional Model

  4. Expand Cluster By IP update sdbf_message set cluster_id = ? where (cluster_id <> ? or cluster_id is null) and sender_ip_id in (select sender_ip_id from sdbf_message where cluster_id = ?)

  5. Expand Cluster By Body update sdbf_message m set cluster_id = ? from sdbd_body b where (m.cluster_id <> ? or m.cluster_id is null) and m.body_id in (select body_id from sdbf_message where cluster_id = ?) and m.body_id = b.body_id and b.size_in_bytes > 25

  6. Expand Cluster By Subject update sdbf_message m set cluster_id = ? from sdbd_subject s where (m.cluster_id <> ? or m.cluster_id is null) and m.subject_id in (select subject_id from sdbf_message where cluster_id = ?) and m.subject_id = s.subject_id and (s.word_count > 1 or length(s.subject) > 10)

  7. Test Data Set • Dec 22, 2007 - Dec 29, 2007 • Single “Received:” header tag only • No multi-part messages • 1.7 million messages • Roughly 20%

  8. Cluster Results

  9. Messages per Cluster Size *Not including the big cluster

  10. Top Clusters by IPs cluster_id | messages | subject | bodies | ips | networks | countries ------------+----------+---------+--------+--------+----------+----------- 1 | 1436206 | 99836 | 330852 | 325660 | 8940 | 177 62 | 26623 | 451 | 25992 | 1313 | 57 | 2 59 | 11322 | 19 | 15 | 962 | 4 | 1 68 | 1065 | 2 | 1065 | 609 | 12 | 4 69 | 4476 | 59 | 85 | 514 | 17 | 1 10477 | 5521 | 5 | 9 | 283 | 4 | 1 953 | 722 | 149 | 333 | 275 | 16 | 1 175 | 307 | 2 | 306 | 208 | 179 | 26 379 | 240 | 7 | 9 | 184 | 4 | 1 18219 | 5581 | 15 | 5212 | 153 | 119 | 26 3924 | 2934 | 20 | 2934 | 150 | 1 | 1 144 | 377 | 22 | 377 | 125 | 3 | 1 242 | 307 | 4 | 3 | 124 | 5 | 1 134 | 3399 | 48 | 169 | 114 | 17 | 1 209 | 156 | 4 | 155 | 105 | 96 | 19 198 | 1117 | 174 | 1100 | 101 | 4 | 1

  11. The Big One Cluster 1 summary messages | subject | bodies | ips | networks | countries ----------+---------+--------+--------+----------+----------- 1436206 | 99836 | 330852 | 325660 | 8940 | 177 Top 10 countries by IP count messages | subjects | bodies | ips | networks | country_name ----------+----------+--------+-------+----------+--------------------- 254948 | 30854 | 62772 | 27464 | 1453 | United States 75969 | 5110 | 27366 | 27446 | 170 | Germany 114328 | 6558 | 39312 | 26758 | 147 | Spain 78378 | 4705 | 29291 | 25263 | 48 | Turkey 91527 | 4624 | 29926 | 20930 | 209 | United Kingdom 51708 | 3194 | 19983 | 16842 | 42 | Peru 52652 | 2848 | 19644 | 15533 | 148 | Columbia 39475 | 3059 | 13344 | 10129 | 152 | Chile 34827 | 5063 | 12790 | 9664 | 12 | Brazil 40144 | 4381 | 13368 | 9372 | 126 | Italy

  12. Clustering the Big One • Create clusters on subject and body messages | cluster_id | ips | subjects | bodies ----------+------------+--------+----------+-------- 740447 | 34641 | 131024 | 34 | 136 fake watches 111122 | 34643 | 79419 | 330 | 59166 penis enlargement 76521 | 34642 | 59112 | 27 | 55129 online casino 55421 | 34644 | 44772 | 55 | 25023 fake name brand goods 27789 | 34653 | 7190 | 81 | 16225 viagra 26815 | 34646 | 11099 | 20 | 19680 valium 25679 | 34656 | 5990 | 14846 | 25644 online pharmacy 12953 | 34649 | 3391 | 45 | 5 stock investment 12924 | 34645 | 4149 | 3 | 5 porn 12919 | 34648 | 3483 | 9 | 12332 software 10071 | 34650 | 9240 | 17 | 9273 russian dating 1099737 messages 284493 unique IPs

  13. Clustering the Big One (cont) Number of overlapping IPs between clusters

  14. Am I Bot or Not? cluster_id | messages | subjects | bodies | ips | networks | countries ------------+----------+----------+--------+-------+----------+----------- 62 | 26623 | 451 | 25992 | 1313 | 57 | 2 messages | subjects | bodies | ips | networks | country_name ----------+----------+--------+-------+----------+--------------- 1246 | 87 | 1246 | 5 | 3 | Canada 25377 | 443 | 24746 | 1308 | 54 | United States • Subject content widely varied • Many blocks of consecutive IPs • Some blocks are entire or most of a /24

  15. Failure is Success Delivery Notification cluster: cluster_id | messages | subject | bodies | ips | networks | countries ------------+----------+---------+--------+--------+----------+----------- 68 | 1065 | 2 | 1065 | 609 | 12 | 4 Subject Detail messages | subject ----------+------------------ 613 | Delivery failure 452 | failure delivery • Delivery notification from legitimate mail servers • Not clustered with spam or sources of spam

  16. Chinese Spam Top 10 Chinese Clusters cluster_id | messages | subject | bodies | ips | networks | countries ------------+----------+---------+--------+--------+----------+----------- 59 | 11322 | 19 | 15 | 962 | 4 | 1 3534 | 9987 | 1803 | 8 | 19 | 3 | 1 12 | 8054 | 9 | 8 | 26 | 1 | 1 10477 | 5521 | 5 | 9 | 283 | 4 | 1 69 | 4476 | 59 | 85 | 514 | 17 | 1 134 | 3399 | 48 | 169 | 114 | 17 | 1 121 | 2347 | 10 | 10 | 1 | 1 | 1 456 | 2187 | 21 | 73 | 41 | 6 | 1 56 | 2047 | 29 | 45 | 61 | 14 | 1 4621 | 1944 | 3 | 4 | 5 | 1 | 1 All Chinese messages messages | ips | networks | clusters | country_name ----------+------+----------+----------+--------------- 92235 | 5179 | 197 | 922 | China 139 | 2 | 1 | 2 | Thailand 78 | 12 | 3 | 4 | United States 5 | 4 | 1 | 2 | Germany

  17. Small Clusters • Varied subjects and bodies. • Manual clustering of “online pharmacy” spam Example subjects: Buy sugar pills online cheap!!!!11one Buy sugar pills online cheap!!!1cos(0) Buy sugar pills online cheap!111pi^0 Coalesced clusters: messages | ips | subjects | bodies | clusters ----------+------+----------+--------+---------- 30333 | 9685 | 19453 | 30298 | 3651

  18. What’s Next? • Improve the similarity metrics • Cluster a population or random sample • Add time to the analysis

More Related