380 likes | 478 Views
HAMSTER: Using Search Clicklogs for Schema and Taxonomy Matching . Arnab Nandi Phil Bernstein Univ of Michigan Microsoft Research. Scenario. Scenario. Search over structured data Commerce entertainment
E N D
HAMSTER: Using Search Clicklogs for Schema and Taxonomy Matching Arnab Nandi Phil BernsteinUniv of Michigan Microsoft Research
Scenario Arnab Nandi & Phil Bernstein
Scenario Arnab Nandi & Phil Bernstein • Search over structured data • Commerce • entertainment • Data onboarding– merge an XML data feed from a 3rd partyto Microsoft data warehouse.
Scenario “Amazon.com” 3rd Party Feed 3rd Party Feed 3rd Party Feed 3rd Party Feed query Users Search engine + data warehouse • High Precision • High Recall • Minimal Human Involvement results Arnab Nandi & Phil Bernstein
Example Feed 3rd Party Movie Site (Foreign) Warehouse: Movies (Host) • -<Movie> • <Title Key="Yes">Indiana Jones and The Kingdom of The Crystal Skull</Title> • <Release Key="Yes">2008</Release> • <Description>Ever…</Description> • <RunTime>127</RunTime> • <Categories> • <Category>Action</Category> • <Category>Comedy</Category> • </Categories> • <MPAA>PG-13</MPAA> • <SiteUrl>http://www.indianajones.com/site/index.html</SiteUrl> • -<Persons> • <Person Role="Actor" Character="Indiana Jones">Harrison Ford</Person> • -</Persons> • </Movie> <MOVIE> <MOVIE_ID>57590</MOVIE_ID> <MOVIE_NAME>Indiana Jones and the Kingdom of the Crystal Skull</MOVIE_NAME> <RUNTIME>02:00</RUNTIME> <GENRE1>Action/Adventure</GENRE1> <GENRE2/> <MPAA>NR</MPAA> <ADVISORY/> <URL>http://www.indianajones.com/</URL> <ACTOR1>Harrison Ford</ACTOR1> <ACTOR2>Karen Allen</ACTOR2> </MOVIE> Arnab Nandi & Phil Bernstein
Schema Matching 3rd Party Movie Site (Foreign) Warehouse: Movies (Host) • -<Movie> • <Title Key="Yes">Indiana Jones and The Kingdom of The Crystal Skull</Title> • <Release Key="Yes">2008</Release> • <Description>Ever…</Description> • <RunTime>127</RunTime> • <Categories> • <Category>Action</Category> • <Category>Comedy</Category> • </Categories> • <MPAA>PG-13</MPAA> • <SiteUrl>http://www.indianajones.com/site/index.html</SiteUrl> • -<Persons> • <Person Role="Actor" Character="Indiana Jones">Harrison Ford</Person> • -</Persons> • </Movie> <MOVIE> <MOVIE_ID>57590</MOVIE_ID> <MOVIE_NAME>Indiana Jones and the Kingdom of the Crystal Skull</MOVIE_NAME> <RUNTIME>02:00</RUNTIME> <GENRE1>Action/Adventure</GENRE1> <GENRE2/> <RATING>NR</RATING> <ADVISORY/> <URL>http://www.indianajones.com/</URL> <ACTOR1>Harrison Ford</ACTOR1> <ACTOR2>Karen Allen</ACTOR2> </MOVIE> Arnab Nandi & Phil Bernstein
Taxonomy Matching 3rd Party Movie Site (Foreign) Warehouse: Movies (Host) • -<Movie> • <Title Key="Yes">Indiana Jones and The Kingdom of The Crystal Skull</Title> • <Release Key="Yes">2008</Release> • <Description>Ever…</Description> • <RunTime>127</RunTime> • <Categories> • <Category>Action</Category> • <Category>Comedy</Category> • </Categories> • <MPAA>PG-13</MPAA> • <SiteUrl>http://www.indianajones.com/site/index.html</SiteUrl> • -<Persons> • <Person Role="Actor" Character="Indiana Jones">Harrison Ford</Person> • -</Persons> • </Movie> <MOVIE> <MOVIE_ID>57590</MOVIE_ID> <MOVIE_NAME>Indiana Jones and the Kingdom of the Crystal Skull</MOVIE_NAME> <RUNTIME>02:00</RUNTIME> <GENRE1>Action/Adventure</GENRE1> <GENRE2/> <RATING>NR</RATING> <ADVISORY/> <URL>http://www.indianajones.com/</URL> <ACTOR1>Harrison Ford</ACTOR1> <ACTOR2>Karen Allen</ACTOR2> </MOVIE> Arnab Nandi & Phil Bernstein
Various Problems Badly normalized…. Unit conversion… In-band signaling… Arbitrary labels Zero documentation Not enough instances Formatting choices… Non standard vocabulary / language Arnab Nandi & Phil Bernstein
Unlike conventional matching… 3rd Party Feed query Users Search engine + data warehouse results Arnab Nandi & Phil Bernstein • We have web search click data • For both Warehouse & 3rd party website • The databases we are integrating (usually) have a presence on the web • Why not use click data as a feature for schema & taxonomy matching?
Outline Arnab Nandi & Phil Bernstein • Scenario • Using Clicklogs • Core idea • Using Query Distributions • Example • System Architecture • Results
Core idea Web Search Small laptop “If two (sets of) products are searched for by similar queries, then they are similar” Arnab Nandi & Phil Bernstein
Core idea Warehouse Asus.com Clicklog hardware Small Laptops Pro. Laptops eee X Y eee ::: small laptops Small laptop Small laptop Small laptop Z Arnab Nandi & Phil Bernstein
Query Distributions click count Arnab Nandi & Phil Bernstein
Mapping to Taxonomy 3rd party DB (provided to us) • Map URL to product, which belongs to taxonomy • http://www.amazon.com/dp/B001JTA59C • Shopping | Electronics |Netbooks Arnab Nandi & Phil Bernstein
Aggregating Query Distributions Warehouse Asus.com hardware Small Laptops Pro. Laptops eee eee ::: small laptops Arnab Nandi & Phil Bernstein
Generating Correspondences • Goal: To match two schema elements (or categories), determine if they have the same distribution of queries searching for them. • Process • For each page (URL) • Identify query distribution • Identify category / schema element of that page • For each category / schema element C • Aggregate over pages in C to get query distribution • For each foreign category / schema element • Find host category / schema element with most similar query distribution Arnab Nandi & Phil Bernstein
Outline Arnab Nandi & Phil Bernstein • Scenario • Using Clicklogs • Core idea • Using Query Distributions • Example • System Architecture • Results
Example: Taxonomy Matching Warehouse: Professional Laptops Warehouse: Small Laptops eee Arnab Nandi & Phil Bernstein
Example: Taxonomy Matching “laptop” : 70 / 75“netbook” : 5/75 Warehouse: Professional Laptops Warehouse: Small Laptops “laptop”: 25/45“netbook”: 20/45 eee “laptop”: 5/25“netbook”: 15/25“cheap laptop”: 5/25 Arnab Nandi & Phil Bernstein
Distribution Similarity Metric Σ (all qhost, qforeign combinations) Arnab Nandi & Phil Bernstein Jaccard(qhost, qforeign) ✕MinFreq(qhost, qforeign)
Example: Taxonomy Matching Warehouse: Professional Laptops “laptop” : 70 / 75“netbook” : 5/75 0.31 Warehouse: Small Laptops “laptop”: 25/45“netbook”: 20/45 0.74 eee “laptop”: 5/25“netbook”: 15/25“cheap laptop”: 5/25 “small laptops” vs “eee”laptop vs laptop netbookvsnetbooklaptop vs cheap laptop 1 x (25/45) + 1 x (20/45)+ 0.5 x (5/25) = 0.74 Arnab Nandi & Phil Bernstein
Advantages of Clicklogs Arnab Nandi & Phil Bernstein • Resilient to language • Resilient to new domains, data, and features • As long as people query & click, we have data to learn from • Generates mappings previous methods can’t • Electronics ▷ Electronics Features ▷ Brands ▷ Texas Instruments≈ Office Products ▷ Office Machines ▷ Calculators • Software ▷ Categories ▷ Programming ▷ Programming Languages ▷Visual Basic ≈ Software ▷ Developer Tools
System Design Arnab Nandi & Phil Bernstein
Outline Arnab Nandi & Phil Bernstein • Scenario • Using Clicklogs • Core idea • Using Query Distributions • Example • System Architecture • Results
Experimenting with Click Logs Arnab Nandi & Phil Bernstein • Commercial warehouse mapping, 258 products • from a 70,000 term Amazon.com taxonomy (613 in gold) • to a 6,000 term warehouse taxonomy (40 in gold) • Live.com (now Bing.com) search querylog • Amazon to warehouse mapping task, consecutively halving the clicklog size used • 1.8 million clicks to Amazon.com product pages • Typically each product had a query distribution averaging 13 unique (i.e., different) search queries (min 1, max 181, stdev 22).
Summary of Results Arnab Nandi & Phil Bernstein 90% precision / recall possible Query distribution is a good similarity metric Bigger clicklogs imply better recall Technique isn't very sensitive to similarity metric
Precision / Recall Arnab Nandi & Phil Bernstein • Commercial warehouse mapping, 258 products • from a 70K term Amazon.com taxonomy • to a 6,000 term warehouse taxonomy (613 categories used)
Summary of Results Arnab Nandi & Phil Bernstein 90% precision / recall possible Query distribution is a good similarity metric Bigger clicklogs imply better recall Technique isn't very sensitive to similarity metric
Match Quality • QDs of entities are closest to the distributions of their aggregate classes • QDs of similar aggregates are similar Arnab Nandi & Phil Bernstein QDs are unique to entities QDs are unique to aggregate classes
Summary of Results Arnab Nandi & Phil Bernstein 90% precision / recall possible Query distribution is a good similarity metric Bigger clicklogs imply better recall Technique isn't very sensitive to similarity metric
Varying Clicklog Size Arnab Nandi & Phil Bernstein Successively decreased clicklog size by half Recall decreases as clicklog size is decreased
Summary of Results Arnab Nandi & Phil Bernstein 90% precision / recall possible Query distribution is a good similarity metric Bigger clicklogs imply better recall Technique isn't very sensitive to similarity metric
Comparing Query Distributions Jaccard(qhost, qforeign) ✕ MinFreq(qhost, qforeign) • ReplaceJaccardwith various phrase similarity metrics • Minimal difference due to size of most queries Σ (all qhost, qforeign combinations) Arnab Nandi & Phil Bernstein
Summary of Results Arnab Nandi & Phil Bernstein 90% precision / recall possible • Query distribution is a good similarity metric • Bigger clicklogs imply better recall • Technique isn't very sensitive to similarity metric
Related + Future Work Arnab Nandi & Phil Bernstein • Usage Based / Crowdsourcing • Usage-Based Schema Matching (ICDE 2008)Elmeleegy, H.; Ouzzani, M.; Elmagarmid, A. • Matching schemas in online communities: A web 2.0 approach(ICDE 2008) R McCann, W Shen, AH Doan • Web Scale Integration • Web-scale Data Integration: You can only afford to Pay As You Go (CIDR 2007)JayantMadhavan, Shawn R. Jeffery, Shirley Cohen, Xin (Luna) Dong, David Ko, Cong Yu, Alon Halevy
Related + Future Work Arnab Nandi & Phil Bernstein • “Mixed” methods • Ontology matching: A machine learning approach (Handbook on Ontologies 2004)A Doan, J Madhavan, P Domingos, A Halevy • Learning to match the schemas of data sources: A multistrategy approach (Machine Learning Journal 2003)A Doan, P Domingos, A Halevy • Schema and ontology matching with COMA++ (SIGMOD 2005)D Aumueller, HH Do, S Massmann, E Rahm
Conclusion • Unsupervised mapping is possible • very high recall / precision when enough queries are present • Click logs are promising • Finds results that other methods cannot find • As clicklog size increases, it will produce more mappings • Combinable with existing methods Arnab Nandi & Phil Bernstein
Questions? http://arnab.org/contact http://research.microsoft.com/~philbe/ Arnab Nandi & Phil Bernstein