280 likes | 452 Views
Joint Optimization of Wrapper Generation and Template Detection. Shuyi Zheng, Di Wu, Ruihua Song , Ji-Rong Wen Microsoft Research Asia SIGKDD-2007, San Jose, California, USA. Outline. Introduction Our approach Experiments Demo Conclusion. Motivations. Page Generation Script
E N D
Joint Optimization of Wrapper Generation and Template Detection Shuyi Zheng, Di Wu, Ruihua Song, Ji-Rong Wen Microsoft Research Asia SIGKDD-2007, San Jose, California, USA
Outline • Introduction • Our approach • Experiments • Demo • Conclusion SIGKDD-2007, San Jose, California, USA
Motivations Page Generation Script (e.g., ASP, PHP, JSP) Encoding Database Decoding Wrapper SIGKDD-2007, San Jose, California, USA
Related Work • Some automatic or semi-automatic wrapper learning methods have been proposed • e.g. WIEN[12], SoftMeley,[11] Stalker[17], RoadRunner[6], EXALG[2], TTAG[4], works in [18], ViNTs[21] and etc. • Page clustering for wrapper induction is considered a trivial task • Manual: most of previous work • Automatic but isolated from wrapper generation: RoadRunner[6,7] and [18] SIGKDD-2007, San Jose, California, USA
Problems (cont.) • Dynamic URLs • With the popularity of dynamic URLs, it is no longer as effective to detect templates by URLs as before SIGKDD-2007, San Jose, California, USA
(a): …/gp/product/B000BNLGJA/ (b): …/gp/product/B00007J8SC/ (c): …/gp/product/B0000DD95R/ (d): …/gp/product/B0000A1AT9/ (a): www.amazon.com/gp/product/B000BNLGJA/ (b): www.amazon.com/gp/product/B00007J8SC/ (c): www.amazon.com/gp/product/B0000DD95R/ (d): www.amazon.com/gp/product/B0000A1AT9/ SIGKDD-2007, San Jose, California, USA
Problems • Dynamic URLs • With the popularity of dynamic URLs, it is no longer as effective to detect templates by URLs as before • Complex Templates • Even if URLs can group pages that share a template, such a method is sometimes far from optimal to generate only one wrapper for a complex template SIGKDD-2007, San Jose, California, USA
(c): www.amazon.com/gp/product/B0000DD95R/ (d): www.amazon.com/gp/product/B0000A1AT9/ SIGKDD-2007, San Jose, California, USA
Our Proposed Approach • Main ideas • Similarity-based templates, instead of ground-truth templates • Advantages • Be more stable • Optimize the number of wrappers SIGKDD-2007, San Jose, California, USA
Outline • Introduction • Our approach • Experiments • Demo • Conclusion SIGKDD-2007, San Jose, California, USA
Problem Definition SIGKDD-2007, San Jose, California, USA
System Overview SIGKDD-2007, San Jose, California, USA
Wrapper Generation [6, 4, 18] SIGKDD-2007, San Jose, California, USA
Wrapper-DOM Distance • Distance between a wrapper and a DOM tree • Tree alignment • Cost calculation SIGKDD-2007, San Jose, California, USA
Wrapper-Oriented Page Clustering (WPC) (a) Level-1 Wrapper (b) Level-2 Wrapper (c) Level-3 Wrapper (d) Level-4 Wrapper SIGKDD-2007, San Jose, California, USA
Outline • Introduction • Our approach • Experiments • Demo • Conclusion SIGKDD-2007, San Jose, California, USA
Experiments • Data • 1700 product pages from Amazon.com (Amazon) • Mixed 1000 pages from 10 shopping sites (M10) • Target product records: (name, image, price) • Settings • 2-fold cross-validation • Evaluation measures: Precision, Recall and F1 SIGKDD-2007, San Jose, California, USA
Effectiveness Test • Amazon: 44 wrappers, F1: 94.88% vs. 78% • M10: SIGKDD-2007, San Jose, California, USA
WPC with Different Thresholds SIGKDD-2007, San Jose, California, USA
Stability Test • Objective • Evaluate how the choice of initial training page impacts the performance of WPC SIGKDD-2007, San Jose, California, USA
Outline • Introduction • Our approach • Experiments • Demo • Conclusion SIGKDD-2007, San Jose, California, USA
Demo! Microsoft Office Excel 2007 Web Data Add-In is coming soon! Please have a try in two weeks! http://blogs.msdn.com/xaw SIGKDD-2007, San Jose, California, USA
Outline • Introduction • Our approach • Experiments • Demo • Conclusion SIGKDD-2007, San Jose, California, USA
Conclusion • Our system • Takes a miscellaneous training set as input • Conducts template detection and wrapper generation in a single step • Can achieve a joint optimization under the criterion of extraction accuracy • In the near future, • We will extend the approach to handle the templates containing content strings SIGKDD-2007, San Jose, California, USA
Thanks! Contacts: Ruihua Song (rsong@microsoft.com) Shuyi Zheng (shzheng@cse.psu.edu) SIGKDD-2007, San Jose, California, USA
Poster No. 11 • Looking forward to talking with you at Poster Reception II this evening! SIGKDD-2007, San Jose, California, USA
Backup Slides SIGKDD-2007, San Jose, California, USA
Labeling Cost • To show how many training pages are required for learning wrappers to achieve an accuracy higher than 95% in terms of F1. SIGKDD-2007, San Jose, California, USA