170 likes | 395 Views
Corpus-based Schema Matching. Jayant Madhavan. Schema Matching. Schema Matching: Discovering correspondences between similar elements Eventually… SQL expressions that can populate one database from other. BooksAndMusic Title Author Publisher ItemID ItemType ListPrice Categories
E N D
Corpus-based Schema Matching Jayant Madhavan
Schema Matching Schema Matching: Discovering correspondences between similar elements Eventually… SQL expressions that can populate one database from other BooksAndMusic Title Author Publisher ItemID ItemType ListPrice Categories Keywords Books Title ISBN Price OurPrice Edition Authors ISBN FirstName LastName Discounts ItemID DiscountPrice BookGenres ISBN Genre Inventory Database A Inventory Database B Corpus-based Schema Matching
Book, Music, Store, … Mappings Books, Pubs, Authors,… Products, Discounts, … Heterogeneity and Data Sharing • Data Integration • Mappings provide the glue between independent data sources Query Books+Music Central Mediator All Books CD World Amazon Data Sources • Schema matching important to any application with multiple data sources Corpus-based Schema Matching
Abbreviations, synonyms,… Incomplete, absent,… Inconsistent, absent,… Overlapping schemas,… Different values, scales,… Typical Approaches • Multiple sources of evidence in the schemas • Schema element names • Descriptions and documentation • Data types • Schema structure • Data instances • BooksAndCDs/Categories ~ BookCategories/Category • ItemID: unique identifier for a book or a CD • DateTime Integer • All books have similar attributes • All addresses have similar formats Combine multiple techniques to exploit all available evidence [Do, Rahm; VLDB 2002], [Doan, et al.; WWW 2002]… Corpus-based Schema Matching
S T s t 2. Compare models Matching Techniques Schemas 1. Build models Ms Mt Name: Instances: Type: … Name: Instances: Type: … Element Models 3. Combine results t1 tn s1 Similarity Matrix sm 4. Generate matches Mapping s t Corpus-based Schema Matching
Insufficient evidence Product Music (no tuples) MusicCD CD Corpus-based Schema Matching
Obtaining more evidence Product, CD Music, MusicCD Corpus-based Augment MusicCD Corpus CD Corpus-based Schema Matching
Corpus-based Schema Matching • Can we use known schemas and mappings to match as yet unseen schemas? • Augment information about elements in schemas being matched • Learn schema design patterns and constraints from known schemas to improve matches Corpus-based Schema Matching
Multiple representations for concepts • Learn alternate names, data instances, names of related elements, data types, … CDs CD Music Album AlbumName Name TrackName DiscountPrice DiscountedPrice SalePrice OurPrice Discounted DiscPrice Artist AuthorArtist Name LastName Author ID CDID ProdCode ISBN RecordLabel Label Company RecordingCompany Artists CD2Artist AuthorArtists ArtistID Corpus-based Schema Matching
Schema Design Patterns • Relations between elements Tables and likely columns Corpus-based Schema Matching
Corpus of known schemas and mappings S Schemas s Build initial models Ms Element Models Name: Instances: Type: … Search similar elements e M’s f Augmented Models Build augmented models Name: Instances: Type: … Typical Schema Matcher Learn schema design patterns Concepts/Clusters Generate Matches Domain Constraints Mapping Corpus-based Schema Matching
Contents of the Corpus • In order to augment • Learn model ensemble for each element • names, data instances, types, structure, … • Train using the schemas and mappings • Element and elements it maps to are positive examples • In order to learn domain constraints • Cluster elements in the corpus into concepts • Estimate schema statistics • Likely tables-columns and element co-occurrence • Learn importance of individual constraints Corpus-based Schema Matching
Experimental Results • Four domains • Automatically extracted web forms • Manually created relational schemas • Techniques • Direct: Glue [WWW’2004] • Corpus-based Augment • Corpus-based Pivot [IIW’2004] Corpus-based Schema Matching
Improved Matching Performance • 16-19 schemas and 6 mappings in the corpus • 22-54 schema pairs being tested Corpus-based Schema Matching
Difficult Match Tasks • More significant improvements for difficult tasks • Improvements are less for easy tasks Corpus-based Schema Matching
Related Work • Using past matching experience • [Doan, et al., SIGMOD’2001; Do & Rahm, VLDB’2002] • We are trying to match unseen schemas. • Using web forms to construct mediated schema • [He & Chang, SIGMOD’2003] • Clustering of elements is an intermediate step in our corpus. • Using a Domain Ontology • [Xu & Embley, DASFAA’2003] • Our corpus structures are automatically generated. Corpus-based Schema Matching
Conclusions • Schema Matching is hard with insufficient evidence • Corpus-based Schema Matching • Augment the evidence about elements in unseen schemas • Learn schema design patterns to select matches • Improves matching especially for difficult tasks • Future Work • Large schemas and complex mappings • User feedback to curate the corpus • Corpus as a tool for other data management task [Halevy & Madhavan, IJCAI’2003] http://www.cs.washington.edu/homes/jayant Corpus-based Schema Matching