370 likes | 383 Views
This study presents a method for extracting dynamic sections and search result records from varying search engine result pages automatically. The technique utilizes ViNTs and refined mining processes to overcome challenges such as non-uniform section formats and boundary issues. The extraction algorithm involves identifying candidate section boundary markers, dynamic section boundaries, and refining extracted sections and records. By focusing on section cohesion and granularity, the approach aims to accurately extract relevant data for various web applications.
E N D
VLDB 2006 Seoul Automatic Extraction ofDynamic Record Sectionsfrom Search Engine Result Pages Hongkun Zhao, Weiyi Meng, Clement Yu* Department of Computer Science State University of New York at Binghamton * Department of Computer Science University of Illinois at Chicago September 15, 2006
Presentation Outline • Background • Dynamic section extraction • Problem Statement • The solution • Experiments • Related work
Background: SRR Extraction - Motivations • SRRs are frequently needed to feed into other Web applications: • Metasearch engines need the SRRs from different search engines and merge them. • Comparison shopping services need to compare SRRs from different search engines to find the best deal.
Background: Main Research Issues • Three levels of search result extraction • Section identification • Record extraction • Data unit identification and annotation • Automatic wrapper generation
Background: SRR Extraction – ViNTs • Most current works on automatic search result extraction are on record extraction, including • ViNTs (WWW 2005) • ViNTs can extract records from sections containing at least three records, including non-result (static) records
Problem Definition: Dynamic Sections • A typical search engine result page contains static, semi-dynamic and dynamic contents. • Static: query independent • Semi-dynamic: basic structure is query independent • Dynamic: query dependent • A dynamic section is a set of all SRRs that appear consecutively and have certain common features such as a common header and a common display format.
Problem Definition: Dynamic Section Extraction Problem statement: automatically extract all dynamic sections as well as SRRs within each dynamic section from search result page of any search engine. • Why dynamic section extraction: • They correspond to search results and many applications need them. • Different applications may needs SRRs from different sections.
Problem Definition: Challenges in Dynamic Section Extraction • Non-uniform section format problem • Section-record granularity problem • Records versus sections • Hidden section extraction problem • Some sections may not appear in sample result pages used for training
MRE: Multi-Record section Extraction • MRE is revised from ViNTs (WWW 2005) • Using MRE to extract MRs has four potential problems: • boundary problem, i.e., some records near the two boundaries of an MR may be incorrectly extracted • sections with fewer than three records may not be extracted • some extracted sections may contain static contents with repeating patterns • some extracted MRs may mistakenly take consecutive sections with the same format as records, and some large records may be incorrectly extracted as sections.
DSE: Dynamic Section Extraction Step 1: Identify candidate section boundary markers (CSBM) • Use a pair of result pages at a time • CSBMs are usually static or semi-dynamic content lines that appear in both result pages and have compatible tag paths Step 2: Identify dynamic sections (DS) based on the CSBMs • Each (candidate) DS has a left boundary marker (LBM) and a right boundary marker (RBM), which are CSBMs and are not part of the DS • Note: some DSs may be incorrect due to incorrect CSBMs
MRs and DSs Refining • Idea: Use MRs and DSs to refine each other to • identify and discard static sections • correct the boundaries of some MRs and DSs • Note: To deal with the non-uniform sectionformat problem, neither of the two algorithms, MRE and DSE, assumes there is a common format/pattern among different sections when performing section extraction
Mining Records from DSs Goal: Identify records from dynamic sections that do not match any MRs such as those with fewer than three records. Method: Consider dynamic section DS • Identify repeating tags within the tag forest for DS as candidate separators • Use each candidate separator to partition DS into records and select the partition with the highest section cohesion.
Mining Records from DSs Observations about section cohesion: records in a section tend to be similar to each other, while the lines within a record tend to be dissimilar to each other. The cohesion of a section S with records r1, r2, …, rk average distance of the lines within each record = average distance among the records
Background: Search Result Record (SRR) Partition with high cohesion Partition with low cohesion
Solving Section-Record Granularity Problem Two subproblems: • Oversized record problem: Some consecutive sections are recognized as records or multiple small records are recognized as a single large record • Splitting record problem: Large records are recognized as sections or large records are split into smaller records
Solving Oversized Record Problem • Use record mining technique to try to find smaller records from a candidate oversized record R. • If no smaller records can be found, R is not an oversized record • If smaller records can be found, R is recognized as an oversized record • If small records can be found and they are similar to the records mined from another (adjacent) candidate oversized record R1, then R and R1 are recognized as consecutive sections.
Solving Splitting Record Problem • Let R be an MR with records (r1, …, rk), which is a partition of R. • We generate new partitions by merging these records in different ways and calculate the cohesion of each partition. • The partition with the highest cohesion will be selected and larger records may be yielded as a result. • If there exists a set of consecutive MRs that are siblings under the same sub-tree of the DOM tree, and all MRs in the set consist of only one record, then we form a new section with each original section in the set as a record and remove the original sections.
Certifying DSs Based on Multiple Result Pages • Multiple result pages are used • If an MR on one result page matches with an MR on at least another result page, both MRs are certified as the section instances of the same section schema. • More than two result pages can be used to generate section instance groups for different section schemas. • A matching score is computed between two MRs from two pages based on their tag path similarity, SBM similarity and tag forest similarity.
Wrapper Generation • Section wrapper format: <pref, seps, LBMs, RBMs> • pref is the tag path that leads to the minimum sub-tree t that contains all records in this section • seps is the separator set used to partition the sub-forest of t into records • LBMs and RBMs are the sets of left and right boundary markers of the section • Page wrapper: a sequence of section wrappers
Solving Hidden Section Extraction Problem • For sections with zero or only one instances on sample result pages, no wrapper will be generated. • Use section family to solve this problem: A section family represents a class of section schemas that share some common features. • Basic idea: Hope the schema of the hidden section is similar to that of an existing section.
Solving Hidden Section Extraction Problem An example of a section family: All member section schemas have the same pref and seps, and their LBMs (RBMs) share the same line text attribute.
Experimental Results • Dataset • 100 search engines from the ViNTs dataset, 19 with multiple DSs • 19 additional search engines that produce multiple DSs • Total 38 search engines produce multiple DSs • Collect 10 result pages for each search engine, 5 are used for wrapper generation and 5 are used to test the wrappers • Performance measures: Recall and Precision • Perfect • Partially correct (> 60% records are extracted)
Experimental Results Section extraction results on all 119 search engines: Perfect Total Perfect Total
Experimental Results Section extraction results on the 38 search engines whose result pages have multiple dynamic sections: Perfect Total Perfect Total
Experimental Results Record extraction results on all extracted sections:
Related Work • Many existing works on record extraction from web pages: RoadRunner, EXALG, IEPAD, DeLa, Omni, MDR, ViPER … • Only MDR (Liu, Grossman, Zhai, SIGKDD, 2003) has the ability to output multiple sections but • it does not differentiate dynamic sections from static contents • it does not address the non-uniform format problem and the section-record granularity problem. • the hidden section extraction problem does not occur for MDR as it does not generate wrapper, which can lead to other problems such as lower efficiency
Conclusions and Future Work Conclusions: • Studied the automatic section extraction problem • Identified several interesting issues: non-uniform format problem,section-record granularity problem and hidden section extraction problem • Provided solutions to the new problems Future work • Still room to improve: increase the accuracy of identifying boundary markers of dynamic sections • Section classification • ……