1 / 27

Adding Semantics to Email Clustering

Adding Semantics to Email Clustering. Hua Li, Dou Shen, Benyu Zhang, Zheng Chen, Qiang Yang Microsoft Research Asia, Beijing, P.R.China Department of Computer Science and Engineering, Hong Kong University of Science and Technology (ICDM 2006). Date: 2008/10/16 Speaker: Cho Chin Wei

mrinal
Download Presentation

Adding Semantics to Email Clustering

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Adding Semantics to Email Clustering Hua Li, Dou Shen, Benyu Zhang, Zheng Chen, Qiang Yang Microsoft Research Asia, Beijing, P.R.China Department of Computer Science and Engineering, Hong Kong University of Science and Technology (ICDM 2006) Date: 2008/10/16 Speaker: Cho Chin Wei Advisor: Dr. Koh, JiaLing

  2. Abstract • This paper presents a novel algorithm to cluster emails according to their contents and the sentence styles of their subject lines. • improve the clustering performance • provide meaningful descriptions of the resulted clusters

  3. Introduction • Emails become an important medium of communication. • A user may receive tens or even hundreds of emails every day. Handling these emails takes much time. • It is necessary to provide some automatic approaches to relieve the burden of processing the emails.

  4. Introduction • Supervised methods need a predefined taxonomy. • Require users to update the taxonomy manually. • Preparation of training data is time consuming and expensive. • Unsupervised technique such as clustering is an attractive alternative.

  5. Introduction • Conventionally, email clustering is based on the representation of bag-of-words. • This simplistic approach cannot take full advantage of valuable linguistic features inherent in the semi-structured emails, which may result in unsatisfactory performance.

  6. Method • In this paper, we present a novel technique to cluster emails according to the sentence patterns discovered from the subject lines of the emails. • Each subject line is treated as a sentence and parsed through natural language processing techniques.

  7. Method • Automatically generate meaningful generalized sentence patterns (GSPs) from subjects of emails. • Natural Language Processing techniques • frequent itemset mining techniques • Then, put forward a novel unsupervised approach • treats GSPs as pseudo class labels • conduct email clustering in a supervised manner • no human labeling is involved

  8. What is GSP? • GSP (Generalized Sentence Pattern) • It helps summarize the subjects of a large number of similar emails which results in a semantic representation of the subject lines. • Ex: the GSP is {“person”, “seminar”, “date”} which means that someone (“person”) gives a “seminar” on someday (“date”).

  9. Generalization of Terms - NLPWin • A Natural Language Parser, Microsoft’s NLPWin tool takes a sentence as input and builds a syntactic tree for the sentence

  10. Generalization of Terms - NLPWin • NLPWin tool can generate factoids for noun phrases, e.g. person names, date and place.

  11. Mine GSP • For each email subject, • Remove stop words • NLPWin is used to generate its syntactic tree • factoids of the nodes are added into the email subjects • The resultant email subjects are called as generalized sentences. • Subsets of GSPs are still GSPs. • EX. {welcome, Bruce, Lee, person} {welcome, Bob, Brill, person} are the generalized sentences.

  12. Mine GSP • Definition: Generalized Sentence Pattern (GSP) Given a set of generalized sentences S={s1 ,s2 ,...,sn } and a generalized sentence p , the support of p in S is defined as sup(p) =| {si | si∈S & p ⊆ si} | Given a minimum support min_sup, if sup(p) > min_sup , then p is called as a generalized sentence pattern, or GSP for short. • EX. {welcome, person} and {interview, date} • Only closed GSPs are to be mined in our experiment.

  13. GSPs grouping and selection • Grouping similar GSPs together to reduce the number of GSPs. • GSPs in the same group will present the same cluster. • The similarity between two GSPs p and q is defined based on their subset-superset relationship and support. • Sim(p,q) = 1 ; p⊃ q : { sup(p) / sup(q) > min_conf } = 0 ; otherwise Min_conf is between 0.5 and 0.8 by experimental result. too low=> group more GSPs, noises too high=> fail to group

  14. GSPs grouping and selection • The number of GSP groups can be several times larger than the actual number of clusters. We want to select some representative GSP groups by some heuristic rules. • sort the GSP groups in descending order of length. • sort them in descending order of support. • select the first sp_num GSP groups for clustering. • A parameter sp_num is used to control how many GSP groups are select for clustering. • Longer GSP is more confident then a shorter one.

  15. GSP as pseudo class label • Based on GSPs, we proposed a novel clustering algorithm to form a pseudo class for the emails matching the same GSP group, and then use a discriminative variant of Classification Expectation Maximization algorithm (CEM) to get the final clusters. • When CEM is applied to document clustering, the high-dimension usually causes the inaccurate model estimation and degrades the efficiency. Here linear SVM is used as the underlying classifier.

  16. GSP as pseudo class label • The proposed GSP-PCL algorithm uses GSP groups to construct initial pseudo classes. • Only the emails not matching any GSP group are classified.

  17. GSP as pseudo class label • A threshold is defined to control whether an email is put into a pseudo class. Only when the maximal posterior probability of an email is greater than the given threshold, the email will be put into the class with the maximal posterior probability. • otherwise the email will be put into a special class Dother.

  18. Algorithm: GSP-PCL GSP-PCL (k,GSP groups G1 ,G2,…Gsp_num,email set D) 1.Construct sp_num pseudo classes using GSP groups, Di 0 ={d | d∈ D and d match Gi},i=1,2,….,sp_num; 2. D’ =D -∪sp_numi=1Di0 3. Iterative until coverage. For the J-th iteration,j>0: a) Train a SVM classifier based on Dij-1i=1,…,sp_num. b) For each email d∈D’ ,classify d into Dij-1 if P(Dij-1|d) is the maximal posterior probability and P(Dij-1|d) > min_class_prob. 4. Dother = D - ∪sp_numi=1Di j 5. Use basic k-means to partition Dother into (k - sp_num) clusters.

  19. Experiments 1.Enron email dataset Email archive from many of the senior management of Enron Co. 2.Private email dataset

  20. Experiments • Let C={C1 …Cm} be a set of clusters generated by a clustering algorithm on a data set X ,and B={B1…Bn} be a set of predefined classes on X. • Each pair of two objects (Xi ,Xj) from the data set X belongs to one of the four possible cases:

  21. Experiments • precision and recall and F1-Measure are calculated as following:

  22. Experiments

  23. Experiments

  24. Cluster Naming • In GSP-PCL, cluster names are generated as follows • If emails in one cluster match one or more GSP groups, the cluster is named by the GSP with the highest support, otherwise it is named by the top five words sorted based on the scores computed as follows Cj denotes the cluster di is an email tkis a word tkiis the weight of wordtk in the email di

  25. Cluster Naming

  26. Cluster Naming

  27. Conclusions • In this paper, we proposed a novel approach to automatically extract embedded knowledge from the email subjects to help improve email clustering. • Natural language processing technique and the frequent closed itemset mining technique are employed to generate generalized sentence patterns (GSP for short) from email subjects, which can be used to assist clustering as well as serve as good cluster descriptors.

More Related