340 likes | 436 Views
HAMSTER: Using Search Clicklogs for Schema and Taxonomy Matching. Arnab Nandi Phil Bernstein UNIV OF MICHIGAN MICROSOFT RESEARCH. PRESENTED BY VAIBHAV MEHTA. Scenario. Scenario. Search over structured data Commerce entertainment
E N D
HAMSTER: Using Search Clicklogs for Schema and Taxonomy Matching Arnab Nandi Phil BernsteinUNIV OF MICHIGAN MICROSOFT RESEARCH PRESENTED BY VAIBHAV MEHTA
Scenario Arnab Nandi & Phil Bernstein
Scenario • Search over structured data • Commerce • entertainment • Data onboarding – merge an XML data feed from a 3rd partyto Microsoft data warehouse. Arnab Nandi & Phil Bernstein
Scenario “Amazon.com” 3rd Party Feed 3rd Party Feed 3rd Party Feed 3rd Party Feed query Users Search engine + data warehouse • High Precision • (Irrespective of Recall) • Minimal Human Involvement results Arnab Nandi & Phil Bernstein
Example Feed 3rd Party Movie Site (Foreign) Warehouse: Movies (Host) • -<Movie> • <Title Key="Yes">Indiana Jones and The Kingdom of The Crystal Skull</Title> • <Release Key="Yes">2008</Release> • <Description>Ever…</Description> • <RunTime>127</RunTime> • <Categories> • <Category>Action</Category> • <Category>Comedy</Category> • </Categories> • <MPAA>PG-13</MPAA> • <SiteUrl>http://www.indianajones.com/site/index.html</SiteUrl> • -<Persons> • <Person Role="Actor" Character="Indiana Jones">Harrison Ford</Person> • -</Persons> • </Movie> <MOVIE> <MOVIE_ID>57590</MOVIE_ID> <MOVIE_NAME>Indiana Jones and the Kingdom of the Crystal Skull</MOVIE_NAME> <RUNTIME>02:00</RUNTIME> <GENRE1>Action/Adventure</GENRE1> <GENRE2/> <MPAA>NR</MPAA> <ADVISORY/> <URL>http://www.indianajones.com/</URL> <ACTOR1>Harrison Ford</ACTOR1> <ACTOR2>Karen Allen</ACTOR2> </MOVIE> Arnab Nandi & Phil Bernstein
Schema Matching 3rd Party Movie Site (Foreign) Warehouse: Movies (Host) • -<Movie> • <Title Key="Yes">Indiana Jones and The Kingdom of The Crystal Skull</Title> • <Release Key="Yes">2008</Release> • <Description>Ever…</Description> • <RunTime>127</RunTime> • <Categories> • <Category>Action</Category> • <Category>Comedy</Category> • </Categories> • <MPAA>PG-13</MPAA> • <SiteUrl>http://www.indianajones.com/site/index.html</SiteUrl> • -<Persons> • <Person Role="Actor" Character="Indiana Jones">Harrison Ford</Person> • -</Persons> • </Movie> <MOVIE> <MOVIE_ID>57590</MOVIE_ID> <MOVIE_NAME>Indiana Jones and the Kingdom of the Crystal Skull</MOVIE_NAME> <RUNTIME>02:00</RUNTIME> <GENRE1>Action/Adventure</GENRE1> <GENRE2/> <RATING>NR</RATING> <ADVISORY/> <URL>http://www.indianajones.com/</URL> <ACTOR1>Harrison Ford</ACTOR1> <ACTOR2>Karen Allen</ACTOR2> </MOVIE> Arnab Nandi & Phil Bernstein
Taxonomy Matching 3rd Party Movie Site (Foreign) Warehouse: Movies (Host) • -<Movie> • <Title Key="Yes">Indiana Jones and The Kingdom of The Crystal Skull</Title> • <Release Key="Yes">2008</Release> • <Description>Ever…</Description> • <RunTime>127</RunTime> • <Categories> • <Category>Action</Category> • <Category>Comedy</Category> • </Categories> • <MPAA>PG-13</MPAA> • <SiteUrl>http://www.indianajones.com/site/index.html</SiteUrl> • -<Persons> • <Person Role="Actor" Character="Indiana Jones">Harrison Ford</Person> • -</Persons> • </Movie> <MOVIE> <MOVIE_ID>57590</MOVIE_ID> <MOVIE_NAME>Indiana Jones and the Kingdom of the Crystal Skull</MOVIE_NAME> <RUNTIME>02:00</RUNTIME> <GENRE1>Action/Adventure</GENRE1> <GENRE2/> <RATING>NR</RATING> <ADVISORY/> <URL>http://www.indianajones.com/</URL> <ACTOR1>Harrison Ford</ACTOR1> <ACTOR2>Karen Allen</ACTOR2> </MOVIE> Arnab Nandi & Phil Bernstein
Various Problems Badly normalized…. Unit conversion… In-band signaling… Arbitrary labels Zero documentation Not enough instances Formatting choices… Non standard vocabulary / language Arnab Nandi & Phil Bernstein
Unlike conventional matching… • We have web search click data • For both Warehouse & 3rd party website • The databases we are integrating (usually) have a presence on the web • Why not use click data as a feature for schema & taxonomy matching? 3rd Party Feed query Users Search engine + data warehouse results Arnab Nandi & Phil Bernstein
Outline • Scenario • Using Clicklogs • Core idea • Using Query Distributions • Example • System Architecture • Results Arnab Nandi & Phil Bernstein
Core idea • “If two (sets of) products are searched for by similar queries, then they are similar” Web Search Small laptop Arnab Nandi & Phil Bernstein
Core idea Asus.com Warehouse Clicklog hardware Small Laptops eee Pro. Laptops X Y eee ::: small laptops Small laptop Small laptop Small laptop Z Arnab Nandi & Phil Bernstein
Query Distributions click count Arnab Nandi & Phil Bernstein
Mapping to Taxonomy • Map URL to product, which belongs to taxonomy • http://www.amazon.com/dp/B001JTA59C • Shopping | Electronics |Netbooks 3rd party DB (provided to us) Arnab Nandi & Phil Bernstein
Aggregating Query Distributions Asus.com Warehouse hardware Small Laptops eee Pro. Laptops eee ::: small laptops Arnab Nandi & Phil Bernstein
Generating Correspondences • Goal: To match two schema elements (or categories), determine if they have the same distribution of queries searching for them. • Process • For each page (URL) • Identify query distribution • Identify category / schema element of that page • For each category / schema element C • Aggregate over pages in C to get query distribution • For each foreign category / schema element • Find host category / schema element with most similar query distribution Arnab Nandi & Phil Bernstein
Outline • Scenario • Using Clicklogs • Core idea • Using Query Distributions • Example • System Architecture • Results Arnab Nandi & Phil Bernstein
Example: Taxonomy Matching Warehouse: Professional Laptops Warehouse: Small Laptops eee Arnab Nandi & Phil Bernstein
Example: Taxonomy Matching Warehouse: Professional Laptops “laptop” : 70 / 75“netbook” : 5/75 Warehouse: Small Laptops “laptop”: 25/45“netbook”: 20/45 eee “laptop”: 5/25“netbook”: 15/25“cheap laptop”: 5/25 Arnab Nandi & Phil Bernstein
Distribution Similarity Metric Jaccard(qhost, qforeign) ✕ MinFreq(qhost, qforeign) Σ (all qhost, qforeign combinations) Arnab Nandi & Phil Bernstein
Example: Taxonomy Matching “small laptops” vs “eee”laptop vs laptop netbook vs netbook laptop vs cheap laptop 1 x (5/25) + 1 x (20/45)+ 0.5 x (5/25) = 0.74 Warehouse: Professional Laptops “laptop” : 70 / 75“netbook” : 5/75 0.31 Warehouse: Small Laptops “laptop”: 25/45“netbook”: 20/45 0.74 eee “laptop”: 5/25“netbook”: 15/25“cheap laptop”: 5/25 Arnab Nandi & Phil Bernstein
Advantages of Clicklogs • Resilient to language • Resilient to new domains, data, and features • As long as people query & click, we have data to learn from • Generates mappings previous methods can’t • Electronics ▷ Electronics Features ▷ Brands ▷ Texas Instruments≈ Office Products ▷ Office Machines ▷ Calculators • Software ▷ Categories ▷ Programming ▷ Programming Languages ▷Visual Basic ≈ Software ▷ Developer Tools Arnab Nandi & Phil Bernstein
System Design Arnab Nandi & Phil Bernstein
Outline • Scenario • Using Clicklogs • Core idea • Using Query Distributions • Example • System Architecture • Results Arnab Nandi & Phil Bernstein
Experimenting with Click Logs • Commercial warehouse mapping, 258 products • from a 70,000 term Amazon.com taxonomy (613 in gold) • to a 6,000 term warehouse taxonomy (40 in gold) • Live.com (now Bing.com) search querylog • Amazon to warehouse mapping task, consecutively halving the clicklog size used • 1.8 million clicks to Amazon.com product pages • Typically each product had a query distribution averaging 13 unique (i.e., different) search queries (min 1, max 181, stdev 22). Arnab Nandi & Phil Bernstein
Summary of Results • 90% precision / recall possible • Bigger clicklogs imply better recall • Technique isn't very sensitive to similarity metric Arnab Nandi & Phil Bernstein
Precision / Recall • Commercial warehouse mapping, 258 products • from a 70K term Amazon.com taxonomy • to a 6,000 term warehouse taxonomy (613 categories used) Arnab Nandi & Phil Bernstein
Summary of Results 90% precision / recall possible • Bigger clicklogs imply better recall • Technique isn't very sensitive to similarity metric Arnab Nandi & Phil Bernstein
Varying Clicklog Size • Successively decreased clicklog size by half • Recall decreases as clicklog size is decreased Arnab Nandi & Phil Bernstein
Summary of Results 90% precision / recall possible Bigger clicklogs imply better recall • Technique isn't very sensitive to similarity metric Arnab Nandi & Phil Bernstein
Comparing Query Distributions Jaccard(qhost, qforeign) ✕ MinFreq(qhost, qforeign) • Replace Jaccard with various phrase similarity metrics • Minimal difference due to size of most queries Σ (all qhost, qforeign combinations) Arnab Nandi & Phil Bernstein
Summary of Results 90% precision / recall possible • Query distribution is a good similarity metric • Bigger clicklogs imply better recall • Technique isn't very sensitive to similarity metric Arnab Nandi & Phil Bernstein
Conclusion • Unsupervised mapping is possible • very high recall / precision when enough queries are present • Click logs are promising • Finds results that other methods cannot find • As clicklog size increases, it will produce more mappings • Combinable with existing methods Arnab Nandi & Phil Bernstein
Questions? Arnab Nandi & Phil Bernstein