130 likes | 314 Views
Characteristic Identifier Scoring and Clustering for Email Classification. By Mahesh Kumar Chhaparia. Email Clustering. Given a set of unclassified emails, the objective is to produce high purity clusters keeping the training requirements low. Outline:
E N D
Characteristic Identifier Scoring and Clustering for Email Classification By Mahesh Kumar Chhaparia
Email Clustering • Given a set of unclassified emails, the objective is to produce high purity clusters keeping the training requirements low. • Outline: • Characteristic Identifier Scoring and Clustering (CISC), • Identifier Set • Scoring • Clustering • Directed Training • Comparison of CISC with some of the traditional ideas in email clustering • Comparison of CISC with POPFile (Naïve-Bayes classifier), • Caveats • Conclusion
Evaluation • Evaluation on Enron Email Dataset for the following users (purity measured w.r.t the grouping already available):
CISC: Identifier Set • Sender and Recipients • Words from the subject starting with uppercase • Tokens from the message body • Word sequences with each word starting in uppercase (length [2,5] only) split about stopwords (excluding them) • Acronyms (length [2,5] only) • Words followed by an apostrophe and ‘s’ e.g. TW’s extracted to TW • Words or phrases in quotes e.g. “Trans Western” • Words where any character (excluding first is in uppercase) e.g. eSpeak, ThinkBank etc.
CISC: Scoring • Sender: • Initial idea: generate clusters of email addresses with frequency of communication above some threshold, • (+) Identifies “good” clusters of communication • (-) Difficult to score when an email has addresses spread across more than one cluster • (-) Fixed partitioning and difficult to update
CISC: Scoring (Contd…) • Sender: • Need a notion of soft clustering with both recipients and content • Generate a measure of its non-variability with respect to the addresses it co-occurs with or the content it discusses in emails • Example: • 1 {2,3} {3,4} {2,3,4} in Folder 1 • 2 {1} {3} {4} {1} {3} {1,3} in Folder 2 • Emphasizes social clusters {1,2,3} {1,3,4} • Classify 2 {1,3,4} • Traditionally: Folder 2 (address frequency based) • CISC: Folder 1 (social cluster based) • Difficult to say upfront which is better ! • Efficacy discussed later
CISC: Scoring (Contd…) • Words or Phrases: • Generate a measure of its importance • Using context captured through the co-occurring text • Sample scenarios for score generation: • Different functional groups in a company mentioning “Conference Room” Low score • A single shipment discussion for company “CERN” High score • Several different topic discussions (financial, operational etc.) for company “TW” Low score • Clustering: Pair with highest similarity message and merge clusters sharing atleast one message to produce disjoint clusters • Directed Training: • For each cluster, identify a message likely to belong to majority class • Suggest the user to classify this message
Efficacy of TF-IDF Cosine Similarity • Clustering using the traditional TF-IDF cosine similarity measure for emails not very effective ! Note: • Both TF-IDF and CISC figures with only word and phrase tokens • Number of clusters is different in both cases, but the purity figures indicate the discriminative capability of the respective algorithms
Efficacy of Social Cluster Based Scoring • Results
CISC vs. POPFile • Results • Purity may sometimes (marginally) decrease with increasing training set in POPFile !
Conclusion • Given a set of unclassified emails, the proposed strategy obtains higher clustering purity with lower training requirements than POPFile and TF-IDF based method. • Key differentiators: • Incorporates a combination of communication cluster and content variability based scoring for senders instead of the usual tf-idf scoring or naïve-bayes word model (POPFile), • Picks a set of high-selectivity features for final message similarity model than retaining most content of messages (i.e. all non-stopwords), • Observes and uses the fact that any email in a class may be “close” to only a small number of emails than to all in that class, • Finally, helps lower training requirements through “directed training” than indiscriminate training over as many emails as possible.
Future Work • Design and evaluation for non-corporate datasets • Tuning of message similarity scoring • Different weights for the score components • Different range normalization for different components to boost proportionally • Test feature score proportional to its length • Richer feature set • Phrases following ‘the’ • Test with substring-free collection e.g. “TW Capacity Release Report” and “TW” are replaced with “Capacity Release Report” and “TW” • Hierarchical word scoring to change granularity of clustering • Online classification using training directed feature extraction • Merging high purity clusters effectively to further reduce training requirements