480 likes | 612 Views
Evolving and Self-Managing Data Integration Systems. AnHai Doan Dept. of Computer Science Univ. of Illinois at Urbana-Champaign Spring 2004. Data Integration at UIUC. Two main players Kevin Chang and AnHai Doan 10 students, 30001 cups of coffees, 3 SIGMOD-04 papers
E N D
Evolving and Self-Managing Data Integration Systems AnHai Doan Dept. of Computer Science Univ. of Illinois at Urbana-Champaign Spring 2004
Data Integration at UIUC • Two main players • Kevin Chang and AnHai Doan • 10 students, 30001 cups of coffees, 3 SIGMOD-04 papers • Four supporting players • Chengxiang Zhai: IR, bioinformatics, text/data integration • Dan Roth: AI, question answering, text/data integration • Jiawei Han: data mining • Marianne Winslett: security/privacy issues in data sharing • Many supporting departments and local organizations • NCSA, Information Science, Genome Institute, Fire Service Institute
Data Integration Challenge Find houses with 3 bedrooms priced under 300K New faculty member realestate.com homeseekers.com homes.com
Architecture of Data Integration System Find houses with 3 bedrooms priced under 300K price, num-beds, location, agent-name mediated schema list-price, bdrms, address source schema 1 source schema 2 source schema 3 wrapper wrapper wrapper realestate.com homeseekers.com homes.com Think “comparison shopping systems on steroid” ...
The Need for Data Integration is Ubiquitous! • In virtually all domains • data are distributed & stored in heterogeneous formats • WWW • hundreds/thousands of sources in bioinformatics, real estate, book,etc. • Enterprises • avg. organization has 49 databases [Ives-01] • organizations frequently merge, exchange data • Government: e.g., digital government initiatives • Military, cultural & international exchange, Semantic Web, information agents, etc. • Long-standing challenge in the database community • recent explosion of distributed data adds urgency
Current State of Affairs • Vibrant research & industrial landscape • Research • dated back to the 70-80s, accelerated in the 90s • Stanford, UPenn, AT&T Labs, Maryland, UWashington, Wisconsin, IBM Almaden, ISI, Arizona State U, Ireland, CMU, etc. • many workshops in AI and DB communities: e.g., SIGMOD/VLDB-04 • focused on • conceptual & algorithmic aspects • building systems in specific domains (bio, geo-spatial, rapid emergency response, virtual organization, etc.) • Industry • more than 50 startups in 2001, new startups in 2004 Despite much R&D activities, however …
Current State of Affairs (cont.) • … Most DI systems are still built & maintained manually • Manual deployment is extremely labor-intensive ... • construct mediated- & source schemas, • find semantic mappings between schemas, • constantly monitor & adjust to changes at hundreds or thousands of data sources, ... • ... and has become a key bottleneck • Emerging technologies • XML, Web services, Semantic Web, ... will further fuel DI applications & exacerbate the problem Slashing the astronomical cost of ownership is now crucial!
The AIDA Project • Recently started at Univ of Illinois • AIDA = Automatic Integration of Data • Goal: evolving and self-managing data integration systems • Easy to start • takes hours instead of weeks or months • perhaps with just a few sources • Learn to continuously improve • expand to cover new sources • add novel query capabilities, better query performance • Adjust automatically to changes • detect and fix broken wrappers, semantic matches, etc. • Require minimal efforts from system admin • some efforts at the start • far less as system has been learning more and more
The AIDA Project (cont.) • In line with trends in broader computing landscape • autonomic systems (IBM initiative) • recovery-oriented computing (Berkeley) • cognitive computer systems (DARPA) • from cycles to RASS (Stanford) • self-tuning databases (MSR, IBM Almaden, Oracle) • Key differences • applied to distributed data management systems • must attack difficult semantics/meta-data issues • heavy involvement of human • must handle large scale • Need techniques from multiple fields • databases, machine learning, AI, IR, data mining
Project Overview • Thrust 1: automate current labor-intensive tasks • schema matching • mediated schema construction • entity matching • Thrust 2: develop new capabilities • entity integration • Thrust 3: monitor & adjust to changes • Thrust 4: reduce cost of system admin • by leveraging the mass of users • Thrust 5: design sources for interoperability
Schema Matching Mediated-schema price agent-name address 1-1 match complex match homes.com listed-price contact-name city state 320K Jane Brown Seattle WA 240K Mike Smith Miami FL
Why Schema Matching is Difficult • Schema & data never fully capture semantics! • not adequately documented • Must rely on clues in schema & data • using names, structures, types, data values, etc. • Such clues can be unreliable • same names => different entities: area => location or square-feet • different names => same entity: area & address => location • Intended semantics can be subjective • house-style = house-description? • military apps require committees! • Cannot be fully automated, needs work from system admin!
Current State of Affairs • Largely done by hand • labor intensive & error prone • data integration at GTE [Li&Clifton, 2000] • 40 databases, 27000 elements, estimated time: 12 years • Need semi-automatic approaches to scale up! • Numerous prior & current research projects • Databases: SemInt (Northwestern), DELTA (MITRE), IBM Almaden, Microsoft Research, Wisconsin, Toronto, UC-Irvine, BYU, George Mason, U of Leipzig, ... • AI: Stanford, Karlsruhe University, NEC Japan, ISI, ... • Many startups
Our Prior & Ongoing Work [2000-date] • Joint work with • Robin Dhamanka, Yoonkyong Lee, Wensheng Wu, Rob McCann, Warren Shen, Alex Kramnik, Olu Sobulo, Vanitha Varadarajan (Illinois), Pedro Domingos, Alon Halevy (U Washington) • Learning 1-1 matches for relational & XML schemas • LSD (Learning Source Description) system [WebDB-00, SIGMOD-01, Machine Learning Journal-03] • Learning 1-1 & complex matches for ontologies • GLUE [WWW-02, VLDB Journal-03, Ontology Handbook-03] • Learning 1-1 matches by mass collaboration • MOBS [WebDB-03, IJCAI-03 Workshop] • Learning complex matches for relational schemas: iMAP [SIGMOD-04] • Large-scale matching via clustering: IceQ[SIGMOD-04] • Corpus-based schema matching[submitted] • Further resources • brief survey talk at http://anhai.cs.uiuc.edu/home/talks/isi-matching.ppt • "Learning to Match Structured Representations of Data" [book by Springer-Velag, to appear]
Mediated Schema Construction • Joint work with • Wensheng Wu (UIUC), Clement Yu (UIC), Weiyi Meng (SUNY Binghamton) • ICeQ project • given a set of source query interfaces • construct a mediated schema • Step 1: find matches among sourcequery interfaces • use clustering [SIGMOD-04] • Step 2: use the found matches to construct mediated schema (ongoing work) • Future work • given lot of text in the domain, construct a mediated schema
Project Overview • Thrust 1: automate current labor-intensive tasks • schema matching • mediated schema construction • entity matching • Thrust 2: develop new capabilities • entity integration • Thrust 3: monitor & adjust to changes • Thrust 4: reduce cost of system admin • by leveraging the mass of users • Thrust 5: design sources for interoperability
Entity Matching (400K, Queen Ann – Seattle, 206-616-1842, Mike Brown) ... PRICE LOCATION PHONE NAME (400K, Queen Ann – Seattle, 206-616-1842, M. Brown) (320K, S. W. Champaign, 217-727-1999, Jane Smith) ... PRICE LOCATION PHONE NAME (250K, Decatur, 317-727-2459, P. Robertson) (400K, Seattle, 616-1842, Mike Brown) ...
Prior Work • Very active area of research • databases: [Hernandez&Stolfo,SIGMOD-95], [Cohen,SIGMOD-98], [Elfeky&Verykios&Elmagarmid,ICDE-02], ... • AI: [Cohen&Richman,KDD-02],[Bilenko&Mooney,02], Dan Roth group, [Tejada et. al., 01],[Tejada et. al. KDD-02], [Michalowski et. al. 03], ... • Much progress • very effective techniques for many applications • covered a broad range of scenarios • Key commonality • assume entities from disparate sources have same set of attributes • e.g., (price,location,phone,name) vs. (price,location,phone,name) • match entities based on similarity of corresponding attributes
<movie, pyear, actor, rating> <movie, genre, review, ryear, rrating, reviewer> Our PROM Approach • Key observation 1: Entities often have disjoint attributes • source V1: (age, name) • source V2: (name, salary) • source S1: (location ,description,phone,name) • source S2: (description,phone,name, price,sq-feet) • Key observation 2: Correlations among disjoint attributes can be exploited to maximize matching accuracy! • e.g.,(9, “Mike Brown”) vs. (“M. Brown”, $200K)a 9-year-old is unlikely to make $200K!
A Profile-Based Solution • Consider again matching persons • source V1: (age, name) • source V2: (name, salary) • (9, “Mike Brown”) vs. (“M. Brown”, $200K) • Step 1: build a person profile • what does a “typical” person “look” like? • build from data & user input • Step 2: match person names • “Mike Brown” vs. “M. Brown” => 0.7 • discard if confidence score is low, otherwise ... • Step 3: feed both tuples into profile • (9, “Mike Brown”, “M. Brown”, $200K) => 0.3
Table T2 Table T1 Advantages of Profile-Based Solution • Can exploit disjoint attributes to improve accuracy • Profiles capture task-independent knowledge • created from task data, domain experts, external data • created once, used anywhere • an example of “knowledge construction and reuse” • Yields an extensible, modular architecture • plug and play with new profiles Tuple t2 Tuple t1 Similarity Estimator Training data Expert knowledge Domain data Previous matching tasks Hard profilers Hard Profile Combiner User specified constraints Soft Profile Combiner Soft profilers Matching pairs
Profilers Completeness Profiler Manual Profiler Association Rule Profiler • Manually encoded rules • Domain Expert Specified • Categorical rules based on complete data • Learn from external data that is complete in some aspect Eg: debut-year b-year • Encodes interesting association rules having high confidence • Employs Association Rule Mining Techniques Eg: Color US movies are produced only after 1917 Eg: (birth-year < 1900) implies (#ODI-matches = 0) PROFILERS encode information about domain concepts and can be constructed in many ways Histogram Profiler Instance Profiler • Characteristics of a few frequent entities • All possible value combinations for a set of attributes Eg: Profilers for 10 most productive director Classifier Eg: (studio,movie-genre) • Learn from training data • Encodes high confidence rules relating disjoint attributes • Learn from external data that is complete in some aspect • External data Eg: Decision tree
Entity Matching: Empirical Evaluation Improve accuracy significantly across six real-world domains More profilers result in better performance
Entity Integration • Problem: find all tuples related to a real world entity. • given a seed paper Chris C. Zhai, A. Kramnik, Hui Fang, “Query Processing”, SIGMOD, 1998 find all papers by Chris C. Zhai from DBLP-Lite DBLP-Lite data source • Desired result: papers (1)-(2)
Baseline Solutions: Pairwise Matching • Seed paper:Chris C. Zhai, A. Kramnik, Hui Fang,“Query Processing”, SIGMOD, 1998 • If match papers based only on author names => retrieve (1)-(6) • If consider also co-authors and confs => retrieve (1)-(2), (4)-(6)
Better Solution: Apply Profilers to Pairwise Matching • Seed paper:Chris C. Zhai, A. Kramnik, Hui Fang,“Query Processing”, SIGMOD, 1998 • If match papers based only on author names => retrieve (1)-(6) • If consider also co-authors and confs => retrieve (1)-(2), (4)-(6) Aggregate Property: very active in both DB and IR, with 3 SIGMOD/VLDB papers and 3 SIGIR papers in 3 years Doesn't fit profile of a typical researcher!
Even Better Solution: Global Matching (5), (6) Cheng Zhai in IR seed paper Chris C. Zhai, A. Kramnik, Hui Fang, “Query Processing”, SIGMOD, 1998 C. Zhai, “Search Optimization”, SIGIR, 1999 (4)
Empirical Evaluation Clustering improves performance over pair-wise matching Adding profilers improves performance over both clustering and pair-wise matching.
More Information onEntity Matching and Integration • Context-based entity matching and integration • Tech. Report UIUC-03-2004 • Profile-based object matching for information integration • A. Doan, Y. Lu, Y. Lee, and J. Han • IEEE Intelligent Systems, special issue on information integration, 2003 • Object matching for data integration: a profile-based approach • A. Doan, Y. Lu, Y. Lee, and J. Han • Proc. of the IJCAI-03 workshop on information integration on the Web, 2003
Project Overview • Thrust 1: automate current labor-intensive tasks • schema matching • mediated schema construction • entity matching • Thrust 2: develop new capabilities • entity integration • Thrust 3: monitor & adjust to changes • Thrust 4: reduce cost of system admin • by leveraging the mass of users • Thrust 5: design sources for interoperability
The Problem • Numerous automatic tools have been developed for • schema matching, wrapper construction, source discovery, etc. • No matter how good these tools are, system admin still needs to • verify predictions of tools • correct wrong ones • These tasks are still extremely labor intensive • even worse when considering system maintenance • System complexity overwhelms capacity of human admin • Reduce the labor cost of system admin is critical! • perhaps most important issue, to enable practical systems!
Solution: Shift Some Labor to Users • Take some task or part of some task • e.g., schema matching, wrapper construction, source discovery • Convert it into a series of very simple questions • such that knowing the answers = solving the task • Ask users to answer questions • such that each user has to do very little work Spread the task labor thinly over a mass of users !
Author Author Price Price ? ? ? Writer Writer Amount Amount Author Author Cost Cost John Steinbeck John Steinbeck $14.99 $14.99 Upton Sinclair Upton Sinclair $6.99 $6.99 Joseph Heller Joseph Heller $21.99 $21.99 George Orwell George Orwell $7.99 $7.99 Aldous Huxley Aldous Huxley $12.99 $12.99 Example: Mass Collaboration for Schema Matching
Mass Collaboration is not New • Successfully applied to • open source software construction • knowledge base construction • collaborative software bug detection • collaborative filtering • annotating online pictures [CMU] • Leverage both implicit and explicit feedback from users • But has not been applied to data integration settings • Can use both implicit and explicit feedback • focus here on explicit one
MOBS Project: Mass Collaboration to Build DI Systems • Joint work with • Rob McCann, Alex Kramnik, Warren Shen, Vanitha Varadarajan, Olu Sobulo • If succeeds • can dramatically reduce cost & time • launch numerous DI systems on Web & enterprises • Key challenges • how to break a task into a series of questions • how to entice users to answer questions • how to combine user answers (e.g., what to do with malicious users?) • Illustrate baseline solutions via schema matching
Mediated Schema title | author | year | price | category b1 | b2 | b3 | b4 | b5 d1 | d2 | d3 | d4 | d5 a1 | a2 | a3 | a4 | a5 c1 | c2 | c3 | c4 | c5 Schema S1 Schema S2 Schema S4 Schema S3 Example: Book Domain
Mediated Schema title | author | year | price | category b1 | b2 | b3 | b4 | b5 d1 | d2 | d3 | d4 | d5 a1 | a2 | a3 | a4 | a5 c1 | c2 | c3 | c4 | c5 Schema S1 Schema S2 Schema S4 Schema S3 Build Partial Correct System
Solicit User Answers 0 1 3 2
Mediated Schema title | author | year | price | category b1 | b2 | b3 | b4 | b5 d1 | d2 | d3 | d4 | d5 a1 | a2 | a3 | a4 | a5 c1 | c2 | c3 | c4 | c5 Schema S1 Schema S2 Schema S4 Schema S3 Combine User Answers
MOBS Challenges Revisited • How to decompose a task into a series of questions? • task dependent, currently works for source discovery, 1-1 matching • if can’t solve the whole task, ok for part of the task (e.g., wrapper) • How to entice users to answer questions? • incentive models: monopoly or better-service applications use helper applicationsuse volunteers • How to evaluate users and combine their answers? • use machine learning • build a dynamic Bayesian network model • solicit user answers to questions with known answers • use these as training data to learn network parameters • More detail in [McCann et. al. Tech Report 04, WebDB-03]
Applied MOBS in many settings ... scale: small-community intranet to high-traffic website users: unpredictable novice users to cooperative experts ... and to several DI tasks Deep Web: form recognition, query interface matching Surface Web: hub discovery, data extraction, mini-Citeseer MOBS Applicability
Project Overview • Thrust 1: automate current labor-intensive tasks • schema matching • mediated schema construction • entity matching • Thrust 2: develop new capabilities • entity integration • Thrust 3: monitor & adjust to changes • Thrust 4: reduce cost of system admin • by leveraging the mass of users • Thrust 5: design sources for interoperability
Summary • The need for data integration is pervasive • Manual data integration is a key bottleneck • Our solution: AIDA project on autonomic DI systems • Discussed problems • schema matching [SIGMOD-04] • mediated schema construction [SIGMOD-04] • entity matching & integration [Tech report 04] • mass collaboration [Tech report 04] • Machine learning is the underlying technique • Many implications beyond data integration context • More information: “anhai” on Google