100 likes | 113 Views
CEES: Intelligent Access to Public Email Conversations William Lee, Hui Fang, Yifan Li, ChengXiang Zhai University of Illinois at Urbana-Champaign. Clustering and Similarity Function. Goal: Find commonly-discussed topics from a set of conversations (threads)
E N D
CEES: Intelligent Access to Public Email Conversations William Lee, Hui Fang, Yifan Li, ChengXiang Zhai University of Illinois at Urbana-Champaign Clustering and Similarity Function • Goal: Find commonly-discussed topics from a set of conversations (threads) • Use agglomerative clustering with complete link • Learn similarity functions from different “perspectives” of threads: • authors, date, subject, contents, contents without quote, first message, reply, reply without quote. • Use Linear and Logistic Regression to learn the combined similarity function Public Email Conversations Existing technologies Architecture Browse Search • Information within newsgroups or mailing lists has largely been underutilized. • For now, access to those data restricted to traditional searching and browsing. • Mail traffic also grows exponentially. • Can we access those information more effectively? • CEES = Conversation Extraction and Evaluation Service • Provide a general framework for email-related research • Integrate with popular open source projects such as Lucene, Hibernate, Tapestry, Weka, and more… • Features object-to-relational mapping of mail metadata, mail threading, flexible indexing, conversation clustering, and a web-base GUI Conclusion Clustering Results Conversation Map • Clustering • Learning the similarity function can be effective in combing different “perspectives” • Conversation Map • Can give overview of a conversation group • Effective use of 2-D space • Future Work • Derive better algorithms to learn the similarity function • Faster clustering algorithms that work on mining patterns in conversations • Summarization of conversation clusters • Use 3 Computer Science class newsgroups from Univ. of Illinois at Urbana-Champaign for corpus • Three different human taggers to group messages into subtopics • 3-way cross validation, using one group’s judgment file as training set and test on the other two. • Use class and cluster entropy as comparison metric • Conversation visualization derived from Treemap • Clusters sorted by two extra time dimensions: Intra-Cluster Time and Inter-Cluster Time • Allows user to adjust the similarity threshold -- “zoom” to the more similar threads
Build Autonomic Data Integration Systems! Autonomic Data Integration Systems Autonomic Computing • Shift workload from administrators & developers onto system • Self-tuning, self-maintaining, self-recovering from failures, self-improving • Examples: • Self-tuning databases and query optimizers • Self-profiling software for bug and bottleneck detection • Self-recovering distributed systems Data Integration Systems • Many complex components: • Global schema • Sources, wrappers, and source schemas • Semantic mappings between source and global schemas • Currently built (semi-)manually in error-prone and laborious process • Extremely difficult to maintain over changing sources
The AIDA Project Automatic Integration of DAta • Fast system deployment • Minimal human effort • Automatic adjustment to changes • Continuous improvement • Improving Automatic Methods • Schema & ontology matching [SIGMOD-01, WWW-02, SIGMOD-04] • Entity matching & integration [IJCAI-03, IEEE Intelligent-03] • Global interface construction [SIGMOD-04] • Reducing Costs of System Construction • Mass collaboration to build systems [WebDB-03, IJCAI-03] • Monitoring Data Sources and Maintaining DI Systems • Recognition of changes in source data • Detection and repair of failures in DI system components The Focus of This Talk
The MOBS Approach Mass COllaboration to Build Systems • Shift workload from developers onto user population • Build system accurately with low individual effort Developers Automatic Techniques System Initialization Source Discovery Form Recognition Query Translation Attribute Matching MOBS MOBS User Population
MOBS Applied to the Deep Web • MOBS for Query Interface Matching • Decompose task into binary statements • Initializesmall functioning system • Solicit and merge user answers to expand the system • Statements for Matching “Writer” • “Writer = Author” ? • “Writer = Title” ? • “Writer = Price” ? Author: Title: Price: Year: Title: Author: Title: Writer: Title: Authors: Cost: Price: Price: Pub:
How to Solicit User Answers • Incentive Models • Leverage a monopoly or better-service system • Piggy-back on a helper application • Deploy in a volunteer or community environment 1 0 HOOP 3 Is this form a Book Sales source? NO YES Barnes & Noble Author Title 2 Pub 0 Price
How to Merge User Answers • Bayesian Learning • Use a dynamic Bayesian network as a generative feedback model • Estimate user behavioral parameters from evaluation answers • Converge statements from teaching answers Author: Title: Price: Year: Title: Author: Title: Writer: Title: Cost: Authors: Price: Price: Pub:
Applicability of the MOBS Approach We Have Applied MOBS in Various Settings… • Scale: from a small community intranet to a highly trafficked website • Users: from cooperative expert volunteers to unpredictable novice users … and to Several DI Tasks • Deep Web: Form Recognition, Interface Matching • Surface Web: Hub Discovery, Data Extraction, Mini-CiteSeer
Conclusion & Future Work MOBS is an Effective Data Integration Tool • Requires small start-up and administrative costs • Solicits minimal effort per user • Constructs system accurately • Complements existing DI techniques • Applies to various scenarios and DI domains Future Work • Leverage implicit feedback • Intelligently maintain system • More tightly integrate with existing DI techniques • Deploy compelling real-world applications http://anhai.cs.uiuc.edu/home/projects/aida.html