310 likes | 483 Views
Spam Clustering using Wave Oriented K Means. Claudiu MUSAT, Ionut GRIGORESCU, Carmen MITRICA, Alexandru TRIFAN. You’ll be hearing quite a lot about…. Spam signatures Previous approaches Spam Features Clustering K-Means K-Medoids Stream clustering Constraints.
E N D
Spam Clustering using Wave Oriented K Means Claudiu MUSAT, Ionut GRIGORESCU, Carmen MITRICA, Alexandru TRIFAN
You’ll be hearing quite a lot about… • Spam signatures • Previous approaches • Spam Features • Clustering • K-Means • K-Medoids • Stream clustering • Constraints
You’ll be hearing quite a lot about… • Spam signatures • Previous approaches • Spam Features • Clustering • K-Means • K-Medoids • Stream clustering • Constraints
You’ll be hearing quite a lot about… • Spam signatures • Previous approaches • Spam Features • Clustering • K-Means • K-Medoids • Stream clustering • Constraints
But the essence is… "A nation that forgets its past is doomed to repeat it." Winston Churchill
Spam signatures • Strong relation with dentistry • Necessary Evil ? • Last resort
Spam signatures (2) • Most annoying problem is that they are labor intensive • An extension of filtering email by hand • More automation is badly needed to make signatures work
Spam features • The ki of the spam business • Its DNA • Everything and yet nothing • Anything that has a constant value in a given spam wave
Email Layout • We noticed then that though spammers tend to change everything in an email to conceal the fact that it’s actually spam, they tend to preserve a certain layout. • We encoded the layout of a message in a string of tokens such as 141L2211. • This later evolved in a message summary such as BWWWLWWNWWE • To this day, message layout is the most effective feature • We also use variations of this feature for the MIME parts, for the paragraph contents and so on.
Other Spam Features - headers • Subject length, the number of separators, the maximum length of any word • The number of received fields(turned out we were drunk and high when we chose this one) • Whether it had a name in the from field • A quite nice example is the stripped date format • Take the date field • Strip it of all alpha-numeric characters • Store what’s left • “ , :: - ()” or “, :: +” or “, :: + ” • Any more suggestions?
Other Spam Features – body • Its length; the number of lines; whether it has long paragraphs or not; the number of consecutive blank lines; • Basically any part of the email layout that we felt was more important than the average • The number of links/email addresses/phone numbers • Bayes poison • Attatchments • Etc.
Combining features (1) • One stick is easy to break • The Roman fasces symbolized power and authority • The symbol of strength through unity from the Roman Empire to the U.S. • The most obvious problem – our sticks are different. • Strings, integers, bools • I’ll stress this later fasces lictoriae (bundles of the lictors)
Combining features (2) • If it’s an A and at the same time a B then it’s spam • The idea of combining features never died out • Started with its relaxed form – adding scores • if it has “Viagra” in it – increase its spam score by 10%. • Evolution came naturally National Guard Bureau insignia
Why cluster spam? • A “well doh” kind of slide • To extract the patterns we want • How do we combine spam traits to get a reliable spam pattern ? • And which are the traits that matter most? • Agglomerative clustering is just one of many options • Neural Networks • ARTMap worked beautifully on separating ham from spam
So why agglomerative? • Because the problem stated before is wrong • We don’t just want spam patterns. • We want patterns for that spam wave alone • Most neural nets make a binary decision. We want a plurality of classes. • Still there are other options, like SVM’s. • They don’t handle well on clustering strings • We want something that accepts just about any feature as long as you can compute a distance
K-means and K-medoids • So we chose the simplest of methods – the widely popular K-Means • In a given feature space each item to be classified is a point. • The distance between the points indicates the resemblance of the original items. • From a given set of instances to be clustered, it creates k classes based on their similarity • For spaces where the mean of two points cannot be computed, there is a variety of k-means: k-medoids. • This actually solves the different stick problem • As usual by solving a problem we introduce a whole range of others. • Combining them
An Example • Is it a line or a square? • What about string features?
Our old model • Focus mainly on correctly defining some powerful spam features • We totally neglected the clustering part • So we used the good old fashioned k-means and k-medoids. • And they have serious drawbacks • A fixed number of classes. • Work only with an offline corpus • The results were... Unpredictable. • Luck played a major role.
WOKM – Wave oriented K-Means • By using the simple k-means we could only cluster individual sets of emails • We now needed to cluster the whole incoming stream of spam • We also want to store a history of the clusters we extract • And use that information to detect spam on the user side. • And also to help us better classify in the future • Remember Churchill?
WOKM – How does it work ? • Takes snapshots of the incoming spam stream • Takes in only what is new • Train it on those messages • Store the clusters for future reference
The spam corpus • All the changes originate here • All messages have an associated distance • The distance from them to the closest stored cluster in the cluster history • New clusters must be closer than old ones • Constrained K-Means • Wagstaff&Cardie, 2001 • “must fit” or “must not fit” • A history constraint
The training phase • While a solution has not been found: • Unassigned all the given examples • Assign all examples • Create a given number of clusters • Assign what you can • Create some more and repeat the process • Recompute centers • Merge adjacent(similar) clusters • Counters the cluster inflation brought by the assign phase • Test solution
What’s worth remembering • Accepts just about any kind of feature – Booleans, integers and strings. • K-means is limited because you have to know the number of classes a priori. • WOKM determines the optimum number of classes automatically • New messages will not be assigned to clusters that are not considered close enough • Has a fast novelty detection phase, so it can train itself only with new spam. • Can use the triangle inequality to speed things up. • (Future work) Allows us to keep track of the changes spammers make in the design of their products. • By watching clusters that are close to each other
Results • Perhaps the most exciting results – the cross language spam clusters
Results(2) • Then in spanish • We were surprised to find that this is not an isolated case. YouTube, Microsoft, Facebook fraud attempts also were found in multiple languages
Results(3) • Then again in french (different though)