180 likes | 293 Views
Clustering Spam. MIT Spam Conference 2008 Phil Tom. Simple Clustering Algorithm. Clustering pseudocode. Expand clusters to include similar messages: Identical originating IP addresses. Identical subject lines. Identical message bodies. for each cluster in clusters expand cluster
E N D
Clustering Spam MIT Spam Conference 2008 Phil Tom
Simple Clustering Algorithm Clustering pseudocode Expand clusters to include similar messages: • Identical originating IP addresses. • Identical subject lines. • Identical message bodies. for each cluster in clusters expand cluster for each message in unclustered messages create a new cluster add message to cluster expand cluster
Expand Cluster By IP update sdbf_message set cluster_id = ? where (cluster_id <> ? or cluster_id is null) and sender_ip_id in (select sender_ip_id from sdbf_message where cluster_id = ?)
Expand Cluster By Body update sdbf_message m set cluster_id = ? from sdbd_body b where (m.cluster_id <> ? or m.cluster_id is null) and m.body_id in (select body_id from sdbf_message where cluster_id = ?) and m.body_id = b.body_id and b.size_in_bytes > 25
Expand Cluster By Subject update sdbf_message m set cluster_id = ? from sdbd_subject s where (m.cluster_id <> ? or m.cluster_id is null) and m.subject_id in (select subject_id from sdbf_message where cluster_id = ?) and m.subject_id = s.subject_id and (s.word_count > 1 or length(s.subject) > 10)
Test Data Set • Dec 22, 2007 - Dec 29, 2007 • Single “Received:” header tag only • No multi-part messages • 1.7 million messages • Roughly 20%
Messages per Cluster Size *Not including the big cluster
Top Clusters by IPs cluster_id | messages | subject | bodies | ips | networks | countries ------------+----------+---------+--------+--------+----------+----------- 1 | 1436206 | 99836 | 330852 | 325660 | 8940 | 177 62 | 26623 | 451 | 25992 | 1313 | 57 | 2 59 | 11322 | 19 | 15 | 962 | 4 | 1 68 | 1065 | 2 | 1065 | 609 | 12 | 4 69 | 4476 | 59 | 85 | 514 | 17 | 1 10477 | 5521 | 5 | 9 | 283 | 4 | 1 953 | 722 | 149 | 333 | 275 | 16 | 1 175 | 307 | 2 | 306 | 208 | 179 | 26 379 | 240 | 7 | 9 | 184 | 4 | 1 18219 | 5581 | 15 | 5212 | 153 | 119 | 26 3924 | 2934 | 20 | 2934 | 150 | 1 | 1 144 | 377 | 22 | 377 | 125 | 3 | 1 242 | 307 | 4 | 3 | 124 | 5 | 1 134 | 3399 | 48 | 169 | 114 | 17 | 1 209 | 156 | 4 | 155 | 105 | 96 | 19 198 | 1117 | 174 | 1100 | 101 | 4 | 1
The Big One Cluster 1 summary messages | subject | bodies | ips | networks | countries ----------+---------+--------+--------+----------+----------- 1436206 | 99836 | 330852 | 325660 | 8940 | 177 Top 10 countries by IP count messages | subjects | bodies | ips | networks | country_name ----------+----------+--------+-------+----------+--------------------- 254948 | 30854 | 62772 | 27464 | 1453 | United States 75969 | 5110 | 27366 | 27446 | 170 | Germany 114328 | 6558 | 39312 | 26758 | 147 | Spain 78378 | 4705 | 29291 | 25263 | 48 | Turkey 91527 | 4624 | 29926 | 20930 | 209 | United Kingdom 51708 | 3194 | 19983 | 16842 | 42 | Peru 52652 | 2848 | 19644 | 15533 | 148 | Columbia 39475 | 3059 | 13344 | 10129 | 152 | Chile 34827 | 5063 | 12790 | 9664 | 12 | Brazil 40144 | 4381 | 13368 | 9372 | 126 | Italy
Clustering the Big One • Create clusters on subject and body messages | cluster_id | ips | subjects | bodies ----------+------------+--------+----------+-------- 740447 | 34641 | 131024 | 34 | 136 fake watches 111122 | 34643 | 79419 | 330 | 59166 penis enlargement 76521 | 34642 | 59112 | 27 | 55129 online casino 55421 | 34644 | 44772 | 55 | 25023 fake name brand goods 27789 | 34653 | 7190 | 81 | 16225 viagra 26815 | 34646 | 11099 | 20 | 19680 valium 25679 | 34656 | 5990 | 14846 | 25644 online pharmacy 12953 | 34649 | 3391 | 45 | 5 stock investment 12924 | 34645 | 4149 | 3 | 5 porn 12919 | 34648 | 3483 | 9 | 12332 software 10071 | 34650 | 9240 | 17 | 9273 russian dating 1099737 messages 284493 unique IPs
Clustering the Big One (cont) Number of overlapping IPs between clusters
Am I Bot or Not? cluster_id | messages | subjects | bodies | ips | networks | countries ------------+----------+----------+--------+-------+----------+----------- 62 | 26623 | 451 | 25992 | 1313 | 57 | 2 messages | subjects | bodies | ips | networks | country_name ----------+----------+--------+-------+----------+--------------- 1246 | 87 | 1246 | 5 | 3 | Canada 25377 | 443 | 24746 | 1308 | 54 | United States • Subject content widely varied • Many blocks of consecutive IPs • Some blocks are entire or most of a /24
Failure is Success Delivery Notification cluster: cluster_id | messages | subject | bodies | ips | networks | countries ------------+----------+---------+--------+--------+----------+----------- 68 | 1065 | 2 | 1065 | 609 | 12 | 4 Subject Detail messages | subject ----------+------------------ 613 | Delivery failure 452 | failure delivery • Delivery notification from legitimate mail servers • Not clustered with spam or sources of spam
Chinese Spam Top 10 Chinese Clusters cluster_id | messages | subject | bodies | ips | networks | countries ------------+----------+---------+--------+--------+----------+----------- 59 | 11322 | 19 | 15 | 962 | 4 | 1 3534 | 9987 | 1803 | 8 | 19 | 3 | 1 12 | 8054 | 9 | 8 | 26 | 1 | 1 10477 | 5521 | 5 | 9 | 283 | 4 | 1 69 | 4476 | 59 | 85 | 514 | 17 | 1 134 | 3399 | 48 | 169 | 114 | 17 | 1 121 | 2347 | 10 | 10 | 1 | 1 | 1 456 | 2187 | 21 | 73 | 41 | 6 | 1 56 | 2047 | 29 | 45 | 61 | 14 | 1 4621 | 1944 | 3 | 4 | 5 | 1 | 1 All Chinese messages messages | ips | networks | clusters | country_name ----------+------+----------+----------+--------------- 92235 | 5179 | 197 | 922 | China 139 | 2 | 1 | 2 | Thailand 78 | 12 | 3 | 4 | United States 5 | 4 | 1 | 2 | Germany
Small Clusters • Varied subjects and bodies. • Manual clustering of “online pharmacy” spam Example subjects: Buy sugar pills online cheap!!!!11one Buy sugar pills online cheap!!!1cos(0) Buy sugar pills online cheap!111pi^0 Coalesced clusters: messages | ips | subjects | bodies | clusters ----------+------+----------+--------+---------- 30333 | 9685 | 19453 | 30298 | 3651
What’s Next? • Improve the similarity metrics • Cluster a population or random sample • Add time to the analysis