1 / 59

Machine Learning for Personal Information Management

Machine Learning for Personal Information Management. William W. Cohen Machine Learning Department and Language Technologies Institute School of Computer Science Carnegie Mellon University. and. Vitor Carvalho, Einat Minkov, Tom Mitchell, Andrew Ng (Stanford).

lena
Download Presentation

Machine Learning for Personal Information Management

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Machine Learning for Personal Information Management William W. Cohen Machine Learning Department and Language Technologies Institute School of Computer Science Carnegie Mellon University and Vitor Carvalho, Einat Minkov, Tom Mitchell, Andrew Ng (Stanford) and Ramnath Balasubramanyan

  2. ML for email [Cohen, AAAI Spring Symposium on ML and IR 1996] Starting point: Ishmail, an emacs RMAIL extension written by Charles Isbell in summer ’95 (largely for Ron Brachman) Could manually write mailbox definitions and filtering rules in Lisp

  3. Foldering tasks Rule-learning method [Cohen, ICML95] [Rocchio, 71]

  4. Machine Learning in Email • Why study learning for email ? • Email has more visible impact than anything else you do with computers. • Email is hard to manage: • People get overwhelmed. • People lose important information in email archives. • People make horrible mistakes.

  5. Machine Learning in Email • Why study learning for email ? • For which tasks can learning help ? • Foldering • Spam filtering • Search: beyond keyword search • Recognizing errors • Help for tracking tasks search don’t sort! important and well-studied “Oops, did I just hit reply-to-all?” “Dropping the ball”

  6. Learning to Search Email [SIGIR 2006, CEAS 2006, WebKDD/SNA 2007] CALO Term In Subject Sent To William graph proposal CMU 6/17/07 6/18/07 einat@cs.cmu.edu

  7. Q: “what are Jason’s email aliases?” Basic idea: learning to search email is learning to query a graph for information Sent To einat@cs.cmu.edu “Jason” Has terminv. einat Sent-to Msg18 Msg5 Msg 2 JasonErnst Sent toEmail Sent fromEmail EmailAddressOf jernst@cs.cmu.edu jernst@andrew.cmu.edu Similar to

  8. How do you pose queries to a graph? • An extended similarity measure via graph walks:

  9. How do you pose queries to a graph? • An extended similarity measure via graph walks: • Propagate “similarity” from start nodes through edges in the graph – accumulating evidence of similarity over multiple connecting paths.

  10. How do you pose queries to a graph? • An extended similarity measure via graph walks: • Propagate “similarity” from start nodes through edges in the graph – accumulating evidence of similarity over multiple connecting paths. • Fixed probability of halting the walk at every step – i.e., shorter connecting paths have greater importance (exponential decay)

  11. How do you pose queries to a graph? • An extended similarity measure via graph walks: • Propagate “similarity” from start nodes through edges in the graph – accumulating evidence of similarity over multiple connecting paths. • Fixed probability of halting the walk at every step – i.e., shorter connecting paths have greater importance (exponential decay) • In practice we can approximate with a short finite graph walk, implemented with sparse matrix multiplication

  12. How do you pose queries to a graph? • An extended similarity measure via graph walks: • Propagate “similarity” from start nodes through edges in the graph – accumulating evidence of similarity over multiple connecting paths. • Fixed probability of halting the walk at every step – i.e., shorter connecting paths have greater importance (exponential decay) • In practice we can approximate with a short finite graph walk, implemented with sparse matrix multiplication • The result is a list of nodes, sorted by “similarity” to an input node distribution (final nodeprobabilities).

  13. A query language:Q: { , } • Email, contacts etc: a graph • Graph nodes are typed, edges are directed and typed • Multiple edges may connect two given nodes. • Every edge type is assigned a fixed weight—which determines probability of being followed in a walk: e.g., uniform Returns a list of nodes (of type ) ranked by the graph walk probs. = query “terms” • Random walk with restart, graph kernels, heat diffusion kernels, diffusion processes, Laplacian regularization, graph databases (BANKS, DbExplorer, …), graph mincut, associative Markov networks, …

  14. Tasks that are like similarity queries Person namedisambiguation [ term “andy”file msgId ] “person” Threading • What are the adjacent messages in this thread? • A proxy for finding “more messages like this one” [ file msgId ] “file” Alias finding What are the email-addresses of Jason ?... [ term Jason ] “email-address” Meeting attendees finder Which email-addresses (persons) should I notify about this meeting? [ meeting mtgId ] “email-address”

  15. Learning to search better Task T (query class) • Standard set of features used for x on each problem: • Edge n-grams in all paths from Vq to x • Number of reachable source nodes • Features of top-ranking paths (e.g. edge bigrams) … Query q Query a Query b + Rel. answers a + Rel. answers b + Rel. answers q GRAPH WALK • node rank 1 • node rank 2 • node rank 3 • node rank 4 • … • node rank 10 • node rank 11 • node rank 12 • … • node rank 50 • node rank 1 • node rank 2 • node rank 3 • node rank 4 • … • node rank 10 • node rank 11 • node rank 12 • … • node rank 50 • node rank 1 • node rank 2 • node rank 3 • node rank 4 • … • node rank 10 • node rank 11 • node rank 12 • … • node rank 50

  16. Learning Node re-ordering: train task Feature generation Learnre-ranker Re-rankingfunction Graph walk

  17. Node re-ordering: Feature generation Learnre-ranker Re-rankingfunction Graph walk Graph walk Feature generation Score byre-ranking function Boosting Learning Approach train task test task Voted Perceptron; RankSVM; PerceptronCommittees; … [Joacchim KDD 2002, Elsas et al WSDM 2008] [Collins & Koo, CL 2005; Collins, ACL 2002]

  18. Tasks that are like similarity queries Person namedisambiguation [ term “andy”file msgId ] “person” Threading • What are the adjacent messages in this thread? • A proxy for finding “more messages like this one” [ file msgId ] “file” Alias finding What are the email-addresses of Jason ?... [ term Jason ] “email-address” Meeting attendees finder Which email-addresses (persons) should I notify about this meeting? [ meeting mtgId ] “email-address”

  19. Corpora and datasets Corpora PERSON NAME DISAMBIGUATION Person names • Nicknames: • Dave for David, • Kai for Keiko, • Jenny for Qing • Common names are ambiguous

  20. CSpace Email: • collected at CMU • 15,000+ emails from semester-line management course • students formed groups that acted as “companies” and worked together • dozens of groups with some known social connections (e.g., “president”)

  21. Results Mgmt. game PERSON NAME DISAMBIGUATION

  22. Results Mgmt. game PERSON NAME DISAMBIGUATION

  23. Results Mgmt. game PERSON NAME DISAMBIGUATION

  24. Results Mgmt. game PERSON NAME DISAMBIGUATION

  25. Results On All Three Problems Mgmt. Game Enron: Sager-E PERSON NAME DISAMBIGUATION Enron: Shapiro-R

  26. Tasks Person namedisambiguation [ term “andy”file msgId ] “person” Threading • What are the adjacent messages in this thread? • A proxy for finding “more messages like this one” [ file msgId ] “file” Alias finding What are the email-addresses of Jason ?... [ term Jason ] “email-address” Meeting attendees finder Which email-addresses (persons) should I notify about this meeting? [ meeting mtgId ] “email-address”

  27. Threading: Results Mgmt. Game 73.8 71.5 60.3 58.4 50.2 MAP 36.2 Header & Body Subject Reply lines Header & Body Subject - Header & Body - - 79.8 Enron:Farmer 65.7 65.1 MAP 36.1 Header & Body Subject Reply lines Header & Body Subject - Header & Body - -

  28. Learning approaches Edge weight tuning: Graph walk Weightupdate Theta*

  29. Node re-ordering: Feature generation Learnre-ranker Re-rankingfunction Graph walk Graph walk Feature generation Score byre-ranking function Boosting; Voted Perceptron Learning approaches Edge weight tuning: [Diligenti et al, IJCAI 2005; Toutanova & Ng, ICML 2005; … ] Graph walk Weightupdate Theta* Graph walk task Question: which is better?

  30. Results (MAP) Namedisambiguation • Reranking and edge-weight tuning are complementary. • Best result is usually to tune weights, and then rerank • Reranking overfits on small datasets (meetings) * * * + + * * Threading * * * * * * * + + + Alias finding

  31. Machine Learning in Email • Why study learning for email ? • For which tasks can learning help ? • Foldering • Spam filtering • Search beyond keyword search • Recognizing errors • Help for tracking tasks “Oops, did I just hit reply-to-all?” “Dropping the ball”

  32. http://www.sophos.com/

  33. Preventing errors in Email [SDM 2007] • Idea • Goal: to detect emails accidentally sent to the wrong person • Generate artificial leaks: Email leaks may be simulated by various criteria: a typo, similar last names, identical first names, aggressive auto-completion of addresses, etc. • Method: Look for outliers. Email Leak: email accidentally sent to wrong person Email Leak

  34. P(rect) Most likely outlier Rec6 Rec2 … RecK Rec5 Least likely outlier Preventing Email Leaks • Method • Create simulated/artificial email recipients • Build model for (msg.recipients): train classifier on real data to detect synthetically created outliers (added to the true recipient list). • Features: textual(subject, body), network features (frequencies, co-occurrences, etc). • Rank potential outliers - Detect outlier and warn user based on confidence. P(rect) =Probability recipient t is an outlier given “message text and other recipients in the message”.

  35. Enron Data Preprocessing 1 • Realistic scenario • For each user, 10% (most recent) sent messages will be used as test • Construct Address Books for all users • List of all recipients in the sent messages.

  36. Simulating Leaks • Several options: • Frequent typos, same/similar last names, identical/similar first names, aggressive auto-completion of addresses, etc. • In this paper, we adopted the 3g-address criteria: • On each trial, one of the msg recipients is randomly chosen and an outlier is generated according to: 1-α Marina.wang @enron.com 1 α 2 Random non-address-book address 3 Else: Randomly select an address book entry

  37. Experiments: using Textual Features only • Three Baseline Methods • Random • Rank recipient addresses randomly • Rocchio/TfIdf Centroid [Rocchio 71] • Create a “TfIdf centroid” for each user in Address Book. A user1-centroid is the sum of all training messages (in TfIdf vector format) that were addressed to user user1. For testing, rank according to cosine similarity between test message and each centroid. • Knn-30 [Yang and Chute, SIGIR 94] • Given a test msg, get 30 most similar msgs in training set. Rank according to “sum of similarities” of a given user on the 30-msg set.

  38. Experiments: using Textual Features only Email Leak Prediction Results: Prec@1 in 10 trials. On each trial, a different set of outliers is generated

  39. Network Features • How frequent a recipient was addressed • How these recipients co-occurred in the training set

  40. Using Network Features • Frequency features • Number of received messages (from this user) • Number of sent messages (to this user) • Number of sent+received messages • Co-Occurrence Features • Number of times a user co-occurred with all other recipients. • Max3g features • For each recipient R, find Rm (=address with max score from 3g-address list of R), then use score(R)-score(Rm) as feature. • Combine with text-only scores using voted-perceptron reranking, trained on simulated leaks

  41. α = 0 Precision at rank 1

  42. Finding Real Leaks in Enron • How can we find it? • Grep for “mistake”, “sorry” or “accident”. • Note: must be from one of the Enron users • Found 2 good cases: • Message germanyc/sent/930, message has 20 recipients, leak is alex.perkins@ • kitchen-l/sent items/497, it has 44 recipients, leak is rita.wynne@

  43. Results on real leaks • Kitchen-l has 4 unseen addresses out of the 44 recipients, • Germany-c has only one.

  44. The other kind of recipient error [ECIR 2008] • How accurately can you fill in missing recipients, using the message text as evidence? Mean average precision over 36 users, after using thread information

  45. Current prototype (Thunderbird plug-in) Leak warnings: hit x to remove recipient Suggestions: hit + to add Pause or cancel send of message Timer: msg is sent after 10sec by default Classifier/rankers written in JavaScript

  46. Machine Learning in Email • Why study learning for email ? • For which tasks can learning help ? • Foldering • Spam filtering • Search beyond keyword search • Recognizing errors • Help for tracking tasks “Dropping the ball”

  47. Dropping the Ball

  48. Speech Acts for Email [EMNLP 2004, SIGIR 2005, ACL Acts WS 2006]

More Related