350 likes | 363 Views
Explore automated wrapper maintenance for dynamic web data extraction. Learn about wrapper verification, maintenance, and generation. User-Defined Schema allows custom data mapping. In-depth analysis reveals the process to discover, recover, and re-induct data elements. Experience the SG-WRAM system in action.
E N D
Schema-Guided Wrapper Maintenance for Web-Data Extraction Xiaofeng Meng, Dongdong Hu Renmin University of China, Beijing, China Chen Li University of California, Irvine, CA, USA
Wrappers for Web Sources • Extract information from Web pages • Used in many Web-based applications XML Wrapper HTML Documents RDBMS Wrapper Application (e.g., data Integration) ……… ……… Programs Wrapper
Problem • The Web are very dynamic: contents, page structures • Original wrappers can stop working: rely on Web page structures • Re-generating wrappers is not easy: heavy workload to system developers Original Wrapper Extract nothing … Changed Documents Original Wrapper Incomplete results ……… ……… Original Wrapper Incorrect results
Example The original wrapper fails due to the structure change.
Problems • Wrapper verification: Is a wrapper is operating correctly? • Several studies have been conducted on the verification problem: • E.g., computing the similarity between a wrapper’s expected and observed output, “regression test” • Wrapper maintenance: how to automatically modify a wrapper when the pages have changed? Focus of this work
Outline • Motivation • System overview • Schema-Guided Wrapper Maintenance • Experiments • Related Work and Conclusion
Documents Changed Documents Wrapper Generator Wrapper Executor Wrapper XML Repository Schema Rule Wrapper Maintainer Rule Re-induction Data Feature Discovery Block Configuration Data Item Recovery The SG-WRAM System
User-Defined Schema User provides schema for the target data <!ELEMENT VideoList (Video+)> <!ELEMENT Video (Name, Director, Actors, Price)> <!ELEMENT Name (#PCDATA)> <!ELEMENT Director (#PCDATA)> <!ELEMENT Actors (#PCDATA)> <!ELEMENT Price (VHSPrice, DVDPrice)> <!ELEMENT VHSPrice (#PCDATA)> <!ELEMENT DVDPrice (#PCDATA)>
Schema-Guided Wrapper Generation • Using a GUI toolkit, users can map data items in HTML pages to elements in DTD DTD tree HTML page
Schema-Guided Wrapper Generation • Internally, the system computes the mappings from the corresponding HTML tree to the DTD tree • Then generates the extraction rule HTML tree DTD tree
Paths to the data items Value of the data item Expressing Extraction Rule in XQuery • Each rule is an FLWR XQuery expression Example FOR $vedio IN $vedioList/body/div[0]/table[4]/tr[0]/td[2]/table/tr[0] /td[1] RETURN <vedio> { LET $name = $vedio/span[0]/b[0]/a[0]/text()[0] RETURN <name> $name </name> } </vedio>
Annotations for data items • Describe the semantic meaning of a data item • Indicate the location of the data item • Specified by the user using the GUI • Recorded in the function of “contains(pathToAnnotation, annotationValue)” in XPath /body/div[0]/table[4]/tr[0]/td[2]/table[1]/tr[0]/td[1]/text()[0][contains(null,"directed by")]
Outline • Motivation • System Overview • Wrapper Maintenance (four steps): • Data-Feature Discovery • Item Recovery • Block Configuration • Rule Re-induction • Experiments • Related Work and Conclusion
Intuition of the approach • The page structure could change • Observation: many “features” of data items are more static, e.g.: • Hyperlink • Annotation • Pattern • These features can help us find the new places of the old data items
Step 1: Data-feature discovery • Compute features of the data items in the original page
Data-Pattern Feature • A syntactic feature • Represented as a regular expression • E.g. $ 15.38 [$][0-9]{0,}[0-9](.)[0-9]{2} • Can be extracted using existing technologies, e.g., [Brin98], [GHQR98], [LM00]
Get annotation and hyperlink information from the original page Checking the XQuery based extraction rule Hyperlink: step of “…/a/…” in the path Annotation: function of “contains()” Hyperlink Indication Annotation Value Path from data item to annotation Annotations and Hyperlinks { LET $name = $vedio/span[0]/b[0]/a[0]/text()[0] RETURN <name> $name </name> } { LET $actors = $vedio/text()[contains( /preceding-sibling::b[0] ,"Featuring")] RETURN <actors> $actors </actors> }
Step 2: Data-Item Recovery • Traverse the new HTML tree following the depth-first traversal order • Use the old features to identify potential data items using 3 matching conditions: • Hyperlink • Annotation • Data pattern
Example [A-Z][a-z]{0,} ok ok Check hyperlink Check data pattern Recognize a data item Find value starting from annotation yes Recognize a data item Find annotation Check data pattern [$][0-9]{0,}[0-9](.)[0-9]{2}
Results of Data Item Recovery • A mapping list including all the recognized data items • Each mapping contains • Value of the data item • Path to it in the HTML tree • Path of the corresponding DTD element A sample mapping: M1’ (D: “May”, HP: …/table[0]/tr[0]/td[1]/span[0]/b[0]/a[0]/text()[0], SP: VideoList/Video/Name )
Observation: Data items are located in semantic blocks Conforms to the user-defined schema Data items are grouped in semantic blocks Step 3: Block Configuration Partial-Match Full-Match Over-Match
Computing “Full Match” Blocks • Identify the level in a top-down manner • Check the level by recursively considering the matches between candidate blocks and the schema “Full match” blocks
Results of Block Configuration • A set of blocks that can fully match with the DTD • Each of them is represented as a list of mappings Examples
Step 4: Rule Re-Induction • Semantic blocks contain mappings from data items in HTML to DTD elements • Induce new extraction rule by calling the induction algorithm in wrapper generator • Refine the rule by trying to ensure the extraction rule cover all other semantic blocks • Generalization is necessary
Outline • Motivation • System Overview • Wrapper Maintenance (four steps): • Data-Feature Discovery • Item Recovery • Block Configuration • Rule Re-induction • Experiments • Related Work and Conclusion
Web Sources • From October 2002 toMay 2003 • Collected Web page changes • From 16 data-intensive sites • Using site search engine or from the same URL • All the pages have complex table structures • Observed changes • Data items (add, delete, modify) • Table structure non-table structure • Complex table structure re-arrangement
Experiment Procedures New Web Docs Original Web Docs step1 Wrapper Repository Wrapper Generator Original Wrappers step2 Repaired Wrappers Check Extraction Results ……… Changed pages step3 Wrapper Maintainer
Experiment Metrics • Recall (R) • Proportion of the correctly extracted data items of all the data items that should be extracted • Precision (P) • Proportion of the correctly extracted data items of all the data items that have been extracted
Related Work on Wrapper Maintenance • [Kushmerick 99] • Using simple numeric features of the extracted strings • [Lerman K., Minton S. 00] • Using the starting and ending strings as the description of the data fields • [Chidlovskii B. 01] • Syntactic features of data items to be extracted, and semantic features: URL, time strings, entities…
Comparions • These approaches heavily rely on the syntactic features of the data items, and may not precisely recognize data items.
Conclusion • SG-WRAM: a wrapper-maintenance system • Intuition: use features that are more stable • Pattern • Hyperlink • Annotation • Four steps of the approach: • Data-Feature Discovery • Item Recovery • Block Configuration • Rule Re-induction • Experiments showed that it is effective
Thank you! Schema-Guided Wrapper Maintenance for Web-Data Extraction Xiaofeng Meng, Dongdong Hu Renmin University of China, Beijing, China Chen Li University of California, Irvine, CA, USA