130 likes | 467 Views
Fighting Spam: An Innovative Enhancement to Outlook Express Zhengxiang Pan & Yuanbo Guo Target: Outlook Express Current anti-spam functionalities in OE: Blocked senders list Mail rules Limitations: Limited Rule-based filter Difficulties in generate rules Lack of flexibility
E N D
Fighting Spam: An Innovative Enhancement to Outlook Express Zhengxiang Pan & Yuanbo Guo
Target: Outlook Express • Current anti-spam functionalities in OE: • Blocked senders list • Mail rules • Limitations: Limited Rule-based filter • Difficulties in generate rules • Lack of flexibility • Not adaptive: spam mutate! • Free -> F r e e -> F*r*e*e
What did we design? • An Intelligent Spam Identification Component (ISIC) that use IDSS techniques, specifically CBR. • Absorb ideas from rule-based and statistical filter • Featuring dynamical attributes selection and heuristic-guided case base maintenance
Case Representation • Attribute-Value Pairs • possible values: Yes and No • Two sets of attributes • 51 predefined attributes • about specific properties of an email • selected from http://www.spamassassin.org • 100 dynamically determined attributes • About word occurrences in the email
Dynamically Determined Attributes • Attribute Selection • Use Odd-Ratio as the indicator of the predicative power of a word for the categories (spam, non-spam) and rank them • Select the top 50words from each vocabulary of spam emails and non-spam emails as the attributes lots of details in the paper
An Example Case Case 1: (predefined attributes) … CHARSET_FARAWAY = No TO_EMPTY = Yes FROM_AND_TO_SAME = Yes LOTS_OF_CC_LINE = Yes MISSING_HEADERS = Yes … (dynamically selected attributes) Free = Yes Guaranteed = Yes Debt = Yes Hello = No … (solution) Spam = Yes
Similarity Measurement • Simple Matching Coefficiency (SMC) based on Hamming Distance SIMH (P, C) = ∑i=1..NEQ(Xi, Yi) / N EQ(Xi, Yi) = 1 if Xi = Yi; 0 otherwise.
Case Retrieval • K-Nearest Neighbor like algorithm • For a new email P, calculate its similarity SIMH to each case in the case base, and pick out the top K cases with the largest SIMH values. • If the majority of those chosen cases are labeled as spam, the new email will be classified as spam too; otherwise non-spam; • e.g. K = 5
Case Base Maintenance • Initially spam and non-spam base each has 200 cases • When case base size reaches 300 • restore the case base size back using a mechanism which removes those cases that are • Old (to keep the freshness of cases so that they reflect the trend) • Close to “Center Case” (in an attempt to boost the variety of cases) • Introduced a new concept “Center Case”. Defined in the paper. • Redo attribute selection based on current cases
Outlook Express API GUI Case Base Case Base Manager Classifier Attribute Selector Parser Email Repository Manager Email Repository Components of the ISIC system Architecture
Use enhanced Outlook Express Same UI as OE
Conclusion • Highlights: • Localized & easy to construct • Personalized • Easy to use • Adaptive • Limitations • Initial cases limit personalization • Not for standalone use: on top of current OE