220 likes | 384 Views
An Experimental Framework for Email Categorization and Management. Kenrick Mock kenrick@uaa.alaska.edu. Project Overview. Motivation: Email Overload Potential solution: Automatic categorization and management techniques
E N D
An Experimental Framework for Email Categorization and Management Kenrick Mock kenrick@uaa.alaska.edu
Project Overview • Motivation: Email Overload • Potential solution: Automatic categorization and management techniques • Problem: The potential solution is very experimental. Email use and user interaction is difficult to model, requiring a prototype that users can try on actual email • The purpose of this work is to present a Microsoft Outlook 2000TM add-in that: • Can be used as a first step toward more experimental research into automatic email management techniques • Helps manage the inbox via classification and relevancy-based search
What’s the Problem with Email? • Too much • 6/26/2001 USA Today • “Workers polled this year by market researcher Gartner spent an average of 49 minutes a day on e-mail, 30% to 35% more time than they did a year ago. Ferris Research estimates management-level workers will spend four hours a day on e-mail by 2002.”
Solutions? • Educate users • Don’t send so much mail, don’t subscribe to lists • Use technology in some way • Current efforts are toward some type of classification system that learns New SIGIR email New Miss Cleo email Training: System learns what email belongs to “Conferences” Folder “Conferences” with emails regarding conferences Classify into “Conferences” Classify into “Trash”
This Project • An architecture for exploring automatic email management techniques • Built on Outlook 2000 • Primary code in Visual Basic • Produces DLL add-in for Outlook • Visual C++ DLL component • Hashes strings to longs (logical operators not available in VB) • Referenced from VB • Not tested with Outlook 2002!
Architectural Overview VB Add-In DLL Outlook Outlook Object Model Events Message Class AddTerms() Display() Get Vals CompareMsg() Outlook / Class Interface Glue Folder Class AddMsg() GetMessages via Dictionary CompareMsg() C++ Helper DLL (Hash Strings)
Add-In Interface : Messages • Message Class • Mail folders scanned on startup, class instance created for each mail item (except Trash, Sent Items). • Message text is tokenized and stoplisted using • Sender • Recipients • Subject • Text Body (possible to use more fields if desired) • Text tokens are hashed to 32-bit longs to save space, greatly increase token comparison time • Hash function by Bob Jenkins • 2 collisions on 87111 dictionary words • 10x faster to compare longs vs. strings via strcmp on Pentium II • CompareMsg function computes similarity between two email messages
Add-In Interface : Folders • Folder Class • User-created mail folders are scanned on startup and a folder instance created for each mail folder (except Trash, Sent Items). • Messages that the user has placed in each folder are added to the folder’s classifier for training • CompareMsg function computes similarity between a new message and the classifier for the folder • i.e. can use to classify a new message into folders
Classifier Implementation • CompareMsg • It is the goal of this project to experiment with different classifiers and algorithms as the implementation of CompareMsg to find out what works and what doesn’t • A simple classification scheme is implemented for now • Nearest Neighbor, common terms & frequencies • Others schemes that have been examined in the past: • TF-IDF, Neural Networks, Bayesian, Rule Induction, SVM • What should the classifier do when new email arrives? • Some options • Move new email directly to classified folder • Annotate email with a category tag
Classifier Usage Challenges • In previous work, we built a proprietary rule induction and tf-idf classifier into Outlook and GroupWise that classified messages into categories. It was tested on managers and developers. • Problems we encountered were usage-driven: • The need for constant re-training to keep up with dynamically changing categories. • Classification errors are puzzling and instill distrust on behalf of the users. • Insufficient data may be available as training examples. • It is difficult for a user to examine or manually edit a classifier.
Challenge 1: Categories Change • Common for Categories to change over time; “Topic Drift” as in Newsgroups • Project ends or changes direction • Conversation slowly changes topics • General discussion might turn more technical • Problems for learning algorithms • Classifiers need to be re-trained; how well can they handle it? How fast is it? • Our users were willing to wait seconds, not minutes • Most classifiers are not incremental; require re-training using all positive/negative examples, not just new ones • Often too slow for many algorithms (e.g. rule induction) • Vector-based classifiers • Fast to re-train but may have problems with threshold calculations or new vocabulary not in the vector
Challenge 2: Classifiers Make Errors, Destroy User Trust • Users tolerate few errors • Want immediate corrections so the same error won’t happen again • Vector classifier may require several examples before centroid shifts enough to include similar message • Rule classifiers need explicit retrain • Classification errors are inevitable • Classifier may over-generalize or be too specific • Errors could “break” users hard work setting up a folder • In some cases it’s more work to fix errors than the savings the tool is intended to provide! • Trust is easy to lose, users abandon the system
Challenge 3: Insufficient Data Available • Many classifiers require a large amount of training data, e.g. statistical-based classifiers • May not have enough email available • Users expect system to work well given only 6-12 training examples • Effort to find more examples typically too high • One solution: Bootstrap using data in existing folders • What about negative examples? Can be problematic for some classification algorithms
Challenge 4: Model Editing and Understanding • Some users want to manually fix or edit the classifier • These are naïve users, not programmers! • Easy to understand, modify • Rule-based classifiers • More difficult • Vector classifiers, may have many keywords • Very difficult • Neural Network • SVM
Current Implementation • Publicly available source, binaries for open development purposes • Simple nearest-neighbor classifier for Folders • Speed, easy to train and classify • May help classify user-created folders that really encompass multiple sub-folders (e.g. “work” where there are many work projects) better than classification techniques that rely on global data • Individual term frequencies of sub-folders topics will be low • But message-to-message comparison may be high • Don’t need negative examples • Tag messages with category rather than move into a folder • Hopefully not too critical when misclassification occur
Current Implementation : User Interface Upon startup of Outlook : Scan outlook folders, create classifiers and messages View inbox grouped by category
Current Interface : New Email New email automatically classified into the Best-matching folder (but not moved, only grouped)
Current Interface : Related Email • Interface also supports finding other email similar to the current one • Iterate through all email message class objects invoking the comparison function • Simple term-frequency comparison of both emails for now • Linear time, but not too bad • 300 of the author’s messages scanned per second on 400Mhz PII
Current Interface: Related Email Select a message, Click on button List of similar messages displayed, click to open
Comments on Personal Use • No formal user studies performed yet • But, I’ve been using it…some anecdotes: • Nearest Neighbor classifier OK, could be better • Would be useful to index trash or sent-items • If not indexed, there is no folder to classify into when junk mail arrives so it gets put somewhere else • Temporary solution: Make a “Trash” folder with examples • But indexing trash could be a lot of messages… • Grouping if incoming email useful? • Not really needed for frequent email reading • Useful when returning from a trip and need to triage the mail • Relevant email • Useful for finding uncoupled email threads • Sent-Items would be useful to index here
Lots of Work To Do • Experiment with other classifiers • Need to see relation with users on training issues, speed, etc. not just classification accuracy • Latch onto more events • Better mail detection, drag & drop events • Clean up code implementation • Support persistence, speed issues on startup scan • Implementation issues • Compatibility with Outlook 2002, VB .NET • Other forms of visualization / categorization • E.g., color, thread information, graphical techniques • Extend to other forms of Outlook data • Calendaring, Notes, Files
Try It Out • Source Code & Binaries available online • http://www.math.uaa.alaska.edu/~afkjm/emailaddin/ • Only tested with Windows 2000 & Outlook 2000 • Feel free to use or modify code as you see fit • Warning: Developer docs and code cleanup still needs to be done! • But I’ll be glad to answer any questions!