320 likes | 337 Views
This report explores a methodology for personalized email prioritization, leveraging social networks to determine the importance of messages. It introduces a supervised classification framework and proposes a semi-supervised importance propagation algorithm.
E N D
MiningSocialNetworksforPersonalizedEmailPrioritization ShinjaeYoo,YimingYang,FrankLin,II-ChulMoon [KDD’09] Advisor: Dr. Koh Jia-Ling Reporter: Che-Wei, Liang Date: 2009/08/25
Outline • Introduction • SocialClustering • MeasuringSocialImportance • Semi-supervisedImportancePropagation • Experiments • ConclusionsandFuturework
Introduction • Email • Oneofthemostprevalentpersonalandbusinesscommunicationtools • Asynchronous • Process a large volume of email messages of differing importanceis BURDEN!
Introduction • Informationoverloadproblem • Needtodevelopsystemsthat automatically • learn personal priorities for each user • Identify personally interesting • Identify important messages for user’s attention
Introduction • Many statistical learning techniques have been studied in supportof Email-based prediction tasks • Spam identification, folder recommendation, recipient reminding, action-item identification, social group analysis • BUT, Personalized email prioritization • Remains an under-explored problem • Mainly due to privacy issues in collecting personal data
Introduction • This paper • Create a new collection of anonymized personal email data with importance levels • Proposed a fully personalized methodology for technical development and evaluation • Developed a supervised classification framework • For model personal priorities over messages, and predicting importance levels for new messages
Outline • Introduction • SocialClustering • MeasuringSocialImportance • Simi-supervisedImportancePropagation • Experiments • ConclusionsandFuturework
Motivation • Sender information • One of most indicative features • Messages sent by the members of the same group tend to share similar priority level • Capturing sender groups would be informative for predicting the importance of messages • If a sender who does not have any labeled instances • Based on unsupervised clustering, infer that user’s importance from other group members
Personalized Social Network • For each user, a personalized social network is • constructed by using the email data of that user • Practicality • Personalization • Email contact network • Represent by graph G=(V, E) • V: email contacts (users) • E: message sending among users, un-weighted (Eij=1 if there is a message from user i to user j,Eij=0 otherwise.)
Clustering • NewmanClustering • Beusedtosuccessfullyfindsocialstructures • Definesedge-betweenness • Alinkhasahighscoremeansthatthelinkiscrucialbetweentwoboundarynodesoftwoclusters • Deletelinkswithhighedge-betweennessscores,resultsindisconnectcomponentsasclusters A G E F J D I B L C H R
Outline • Introduction • SocialClustering • MeasuringSocialImportance • Semi-supervisedImportancePropagation • Experiments • ConclusionsandFuturework
MeasuringSocialImportance • Linkrelationsprovidesusefulinformationaboutthecentralityofeachcontact
MeasuringSocialImportance • In-degreecentrality • Out-degreecentrality • Total-degreecentrality B E D A C
Measuring Social Importance • ClusteringCoefficient • Measureconnectivityamongtheneighborhoodofthenode • CliqueCount • Clique:fullyconnectedsub-graph • Alargecliquecountofnodevmeans • Itconnectstolargeandwell-connectedsub-graphs • Itislocatedinthecenterofthesub-graphs B E D A F C
Measuring Social Importance • Betweennesscentrality • Percentageofexistingshortestpathsoutofallpossiblepathsthatgoesthroughthenodev σjk:number of shortest path between j and k σjk(i):number of shortest path between j and k that goesthrough i
Measuring Social Importance • HITSAuthority • Hyperlink-Induced Topic Search, also known as Hubs and authorities • measurestheglobalimportanceofnode • Definition: Adjacency matrix XN-by-N, can be calculated by • Finding the principle eigenvectorr of matrix, where • r satisfies , • λis the largest eigenvalue
Measuring Social Importance • PCCAnalysis • Pearson Correlation Coefficient • Compute PCC of each social metric with human-labeled importance levels of email messages • Indicative about “How useful each metric for predicting the importance of email messages”
Outline • Introduction • SocialClustering • MeasuringSocialImportance • Semi-supervisedImportancePropagation • Experiments • ConclusionsandFuturework
Semi-supervisedImportancePropagation • Semi-supervised Importance Propagation (SIP) • Propagate the importance values of labeled email messages (the training examples) to other messagesand corresponding contact persons
SIP Algorithm • Use a bipartite graph • to represent the interactions between email contacts and email messages • Let N= number of email contacts, M = number of messages • Using matrix to represent two types of edge, matrix A(N by M)and matrix B(N by M) • Ai,j=1 if person isends message j, and Ai,j=0 otherwise • Bi,j=1 if person ireceived message j, and Bi,j=0 otherwise
SIP Algorithm • Treat each importance label (1~5) as a category • Use vector(M by 1) to indicate the labels of message, • xk,i=1 if message i belongs to category k, xk,i=0 otherwise • Importance propagation frommessagestopersons (receivers) is calculated as • Importance propagation from persons (senders) to messages is calculated as
PropagationExample ????? 432?? • Messagestopersons (receivers) • Persons (senders)to messages
SIP Algorithm • Updating of the importance values for contact persons at each time step (t) is calculated by: ????? 432??
SIP Algorithm • is a linear transformation of • If is irreducible, and t is large stabilizes at the principal eigenvector of C • Irreducible property is not always guaranteed • If so, its principal eigenvector is insensitive to the starting vector
SIP Algorithm • A linear interpolation • Define , and normalize by sum of vector • Define importance-sensitive matrix • columns are identical, each column is equivalent to • Normalize matrix C to C’ • α = [0,1] • Ek is irreducible and importance-sensitive
SIP Algorithm • Finally, • SIP method is define iteratively as: () ( ) • Ek is irreducible , yk stabilizes when t is large • yk consists of the expected importance score of each person after iterative SIP
Outline • Introduction • SocialClustering • MeasuringSocialImportance • Semi-supervisedImportancePropagation • Experiments • ConclusionsandFuturework
Experiments • Data • Recruited 25 experimental subjects • Each subjects was requested to label non-spam messages • Preprocessing • Email address canonicalization • Word tokenization and stemming • didn’t remove stop words from title and body text
Experiments • Features • Basic features are tokens in from, to, cc, title, and body text, use a v-dimensional vector to represent • Social-network based features • Use a m-dimensional sub-vector to represent NC features • Sub-vector (7-dims) to represent the social importance (SI) • 5-dimensional sub-vector to represent five SIP scores per user
Experiments • Classifiers • Use five linear SVM classifiers for prediction of importance level per email message • Use the standard SVMlight software package • Metric • N = number of messages • yi = the true importance level of message i • = the predicted importance level for that message
Conclusions and Future Work • Future work • Collection of more data • from a larger number of users in a longer time period • Comparative study on • different clustering algorithms, and • graph-mining techniques with respect to effectiveness