HAMSTER: Using Search Clicklogs for Schema and Taxonomy Matching

HAMSTER: Using Search Clicklogs for Schema and Taxonomy Matching Arnab Nandi Phil BernsteinUniv of Michigan Microsoft Research

Scenario Arnab Nandi & Phil Bernstein

Scenario Arnab Nandi & Phil Bernstein • Search over structured data • Commerce • entertainment • Data onboarding– merge an XML data feed from a 3rd partyto Microsoft data warehouse.

Scenario “Amazon.com” 3rd Party Feed 3rd Party Feed 3rd Party Feed 3rd Party Feed query Users Search engine + data warehouse • High Precision • High Recall • Minimal Human Involvement results Arnab Nandi & Phil Bernstein

Example Feed 3rd Party Movie Site (Foreign) Warehouse: Movies (Host) • -<Movie> • <Title Key="Yes">Indiana Jones and The Kingdom of The Crystal Skull</Title> • <Release Key="Yes">2008</Release> • <Description>Ever…</Description> • <RunTime>127</RunTime> • <Categories> • <Category>Action</Category> • <Category>Comedy</Category> • </Categories> • <MPAA>PG-13</MPAA> • <SiteUrl>http://www.indianajones.com/site/index.html</SiteUrl> • -<Persons> • <Person Role="Actor" Character="Indiana Jones">Harrison Ford</Person> • -</Persons> • </Movie> <MOVIE> <MOVIE_ID>57590</MOVIE_ID> <MOVIE_NAME>Indiana Jones and the Kingdom of the Crystal Skull</MOVIE_NAME> <RUNTIME>02:00</RUNTIME> <GENRE1>Action/Adventure</GENRE1> <GENRE2/> <MPAA>NR</MPAA> <ADVISORY/> <URL>http://www.indianajones.com/</URL> <ACTOR1>Harrison Ford</ACTOR1> <ACTOR2>Karen Allen</ACTOR2> </MOVIE> Arnab Nandi & Phil Bernstein

Schema Matching 3rd Party Movie Site (Foreign) Warehouse: Movies (Host) • -<Movie> • <Title Key="Yes">Indiana Jones and The Kingdom of The Crystal Skull</Title> • <Release Key="Yes">2008</Release> • <Description>Ever…</Description> • <RunTime>127</RunTime> • <Categories> • <Category>Action</Category> • <Category>Comedy</Category> • </Categories> • <MPAA>PG-13</MPAA> • <SiteUrl>http://www.indianajones.com/site/index.html</SiteUrl> • -<Persons> • <Person Role="Actor" Character="Indiana Jones">Harrison Ford</Person> • -</Persons> • </Movie> <MOVIE> <MOVIE_ID>57590</MOVIE_ID> <MOVIE_NAME>Indiana Jones and the Kingdom of the Crystal Skull</MOVIE_NAME> <RUNTIME>02:00</RUNTIME> <GENRE1>Action/Adventure</GENRE1> <GENRE2/> <RATING>NR</RATING> <ADVISORY/> <URL>http://www.indianajones.com/</URL> <ACTOR1>Harrison Ford</ACTOR1> <ACTOR2>Karen Allen</ACTOR2> </MOVIE> Arnab Nandi & Phil Bernstein

Taxonomy Matching 3rd Party Movie Site (Foreign) Warehouse: Movies (Host) • -<Movie> • <Title Key="Yes">Indiana Jones and The Kingdom of The Crystal Skull</Title> • <Release Key="Yes">2008</Release> • <Description>Ever…</Description> • <RunTime>127</RunTime> • <Categories> • <Category>Action</Category> • <Category>Comedy</Category> • </Categories> • <MPAA>PG-13</MPAA> • <SiteUrl>http://www.indianajones.com/site/index.html</SiteUrl> • -<Persons> • <Person Role="Actor" Character="Indiana Jones">Harrison Ford</Person> • -</Persons> • </Movie> <MOVIE> <MOVIE_ID>57590</MOVIE_ID> <MOVIE_NAME>Indiana Jones and the Kingdom of the Crystal Skull</MOVIE_NAME> <RUNTIME>02:00</RUNTIME> <GENRE1>Action/Adventure</GENRE1> <GENRE2/> <RATING>NR</RATING> <ADVISORY/> <URL>http://www.indianajones.com/</URL> <ACTOR1>Harrison Ford</ACTOR1> <ACTOR2>Karen Allen</ACTOR2> </MOVIE> Arnab Nandi & Phil Bernstein

Various Problems Badly normalized…. Unit conversion… In-band signaling… Arbitrary labels Zero documentation Not enough instances Formatting choices… Non standard vocabulary / language Arnab Nandi & Phil Bernstein

Unlike conventional matching… 3rd Party Feed query Users Search engine + data warehouse results Arnab Nandi & Phil Bernstein • We have web search click data • For both Warehouse & 3rd party website • The databases we are integrating (usually) have a presence on the web • Why not use click data as a feature for schema & taxonomy matching?

Outline Arnab Nandi & Phil Bernstein • Scenario • Using Clicklogs • Core idea • Using Query Distributions • Example • System Architecture • Results

Core idea Web Search Small laptop “If two (sets of) products are searched for by similar queries, then they are similar” Arnab Nandi & Phil Bernstein

Core idea Warehouse Asus.com Clicklog hardware Small Laptops Pro. Laptops eee X Y eee ::: small laptops Small laptop Small laptop Small laptop Z Arnab Nandi & Phil Bernstein

Query Distributions click count Arnab Nandi & Phil Bernstein

Mapping to Taxonomy 3rd party DB (provided to us) • Map URL to product, which belongs to taxonomy • http://www.amazon.com/dp/B001JTA59C • Shopping | Electronics |Netbooks Arnab Nandi & Phil Bernstein

Aggregating Query Distributions Warehouse Asus.com hardware Small Laptops Pro. Laptops eee eee ::: small laptops Arnab Nandi & Phil Bernstein

Generating Correspondences • Goal: To match two schema elements (or categories), determine if they have the same distribution of queries searching for them. • Process • For each page (URL) • Identify query distribution • Identify category / schema element of that page • For each category / schema element C • Aggregate over pages in C to get query distribution • For each foreign category / schema element • Find host category / schema element with most similar query distribution Arnab Nandi & Phil Bernstein

Example: Taxonomy Matching Warehouse: Professional Laptops Warehouse: Small Laptops eee Arnab Nandi & Phil Bernstein

Example: Taxonomy Matching “laptop” : 70 / 75“netbook” : 5/75 Warehouse: Professional Laptops Warehouse: Small Laptops “laptop”: 25/45“netbook”: 20/45 eee “laptop”: 5/25“netbook”: 15/25“cheap laptop”: 5/25 Arnab Nandi & Phil Bernstein

Distribution Similarity Metric Σ (all qhost, qforeign combinations) Arnab Nandi & Phil Bernstein Jaccard(qhost, qforeign) ✕MinFreq(qhost, qforeign)

Example: Taxonomy Matching Warehouse: Professional Laptops “laptop” : 70 / 75“netbook” : 5/75 0.31 Warehouse: Small Laptops “laptop”: 25/45“netbook”: 20/45 0.74 eee “laptop”: 5/25“netbook”: 15/25“cheap laptop”: 5/25 “small laptops” vs “eee”laptop vs laptop netbookvsnetbooklaptop vs cheap laptop 1 x (25/45) + 1 x (20/45)+ 0.5 x (5/25) = 0.74 Arnab Nandi & Phil Bernstein

Advantages of Clicklogs Arnab Nandi & Phil Bernstein • Resilient to language • Resilient to new domains, data, and features • As long as people query & click, we have data to learn from • Generates mappings previous methods can’t • Electronics ▷ Electronics Features ▷ Brands ▷ Texas Instruments≈ Office Products ▷ Office Machines ▷ Calculators • Software ▷ Categories ▷ Programming ▷ Programming Languages ▷Visual Basic ≈ Software ▷ Developer Tools

System Design Arnab Nandi & Phil Bernstein

Experimenting with Click Logs Arnab Nandi & Phil Bernstein • Commercial warehouse mapping, 258 products • from a 70,000 term Amazon.com taxonomy (613 in gold) • to a 6,000 term warehouse taxonomy (40 in gold) • Live.com (now Bing.com) search querylog • Amazon to warehouse mapping task, consecutively halving the clicklog size used • 1.8 million clicks to Amazon.com product pages • Typically each product had a query distribution averaging 13 unique (i.e., different) search queries (min 1, max 181, stdev 22).

Summary of Results Arnab Nandi & Phil Bernstein 90% precision / recall possible Query distribution is a good similarity metric Bigger clicklogs imply better recall Technique isn't very sensitive to similarity metric

Precision / Recall Arnab Nandi & Phil Bernstein • Commercial warehouse mapping, 258 products • from a 70K term Amazon.com taxonomy • to a 6,000 term warehouse taxonomy (613 categories used)

Summary of Results Arnab Nandi & Phil Bernstein 90% precision / recall possible Query distribution is a good similarity metric Bigger clicklogs imply better recall Technique isn't very sensitive to similarity metric

Match Quality • QDs of entities are closest to the distributions of their aggregate classes • QDs of similar aggregates are similar Arnab Nandi & Phil Bernstein QDs are unique to entities  QDs are unique to aggregate classes 

Summary of Results Arnab Nandi & Phil Bernstein 90% precision / recall possible Query distribution is a good similarity metric Bigger clicklogs imply better recall Technique isn't very sensitive to similarity metric

Varying Clicklog Size Arnab Nandi & Phil Bernstein Successively decreased clicklog size by half Recall decreases as clicklog size is decreased

Summary of Results Arnab Nandi & Phil Bernstein 90% precision / recall possible Query distribution is a good similarity metric Bigger clicklogs imply better recall Technique isn't very sensitive to similarity metric

Comparing Query Distributions Jaccard(qhost, qforeign) ✕ MinFreq(qhost, qforeign) • ReplaceJaccardwith various phrase similarity metrics • Minimal difference due to size of most queries Σ (all qhost, qforeign combinations) Arnab Nandi & Phil Bernstein

Summary of Results Arnab Nandi & Phil Bernstein 90% precision / recall possible • Query distribution is a good similarity metric • Bigger clicklogs imply better recall • Technique isn't very sensitive to similarity metric

Related + Future Work Arnab Nandi & Phil Bernstein • Usage Based / Crowdsourcing • Usage-Based Schema Matching (ICDE 2008)Elmeleegy, H.; Ouzzani, M.; Elmagarmid, A. • Matching schemas in online communities: A web 2.0 approach(ICDE 2008) R McCann, W Shen, AH Doan • Web Scale Integration • Web-scale Data Integration: You can only afford to Pay As You Go (CIDR 2007)JayantMadhavan, Shawn R. Jeffery, Shirley Cohen, Xin (Luna) Dong, David Ko, Cong Yu, Alon Halevy

Related + Future Work Arnab Nandi & Phil Bernstein • “Mixed” methods • Ontology matching: A machine learning approach (Handbook on Ontologies 2004)A Doan, J Madhavan, P Domingos, A Halevy • Learning to match the schemas of data sources: A multistrategy approach (Machine Learning Journal 2003)A Doan, P Domingos, A Halevy • Schema and ontology matching with COMA++ (SIGMOD 2005)D Aumueller, HH Do, S Massmann, E Rahm

Conclusion • Unsupervised mapping is possible • very high recall / precision when enough queries are present • Click logs are promising • Finds results that other methods cannot find • As clicklog size increases, it will produce more mappings • Combinable with existing methods Arnab Nandi & Phil Bernstein

Questions? http://arnab.org/contact http://research.microsoft.com/~philbe/ Arnab Nandi & Phil Bernstein

HAMSTER: Using Search Clicklogs for Schema and Taxonomy Matching

HAMSTER: Using Search Clicklogs for Schema and Taxonomy Matching

Presentation Transcript

Automatic Schema Matching

Informationsintegration Schema Matching

Schema Matching Algorithms

Automating Schema Matching for Data Integration

Chapter 5: Schema Matching and Mapping

Automating Schema Matching

Privacy-Preserving Schema Matching Using Mutual Information

Corpus-based Schema Matching

Using Schema Matching to Simplify Heterogeneous Data Translation

Generic Schema Matching with Cupid

Generic Schema Matching using Cupid

Informationsintegration Schema Matching

Putting Context into Schema Matching

Ontology Matching and Schema Integration using Node Ranking

HAMSTER: Using Search Clicklogs for Schema and Taxonomy Matching

Block Matching using Fast Walsh Search

Ontology Matching and Schema Integration using Node Ranking

Schema matching for Database Systems

SCHEMA-BASED SEMANTIC MATCHING