510 likes | 631 Views
Wrapper Construction. Charis Ermopoulos Qian Yang Yong Yang Hengzhi Zhong. Background. HUGE amount of information on the web They cannot be easily accessed and manipulated Information intended to be browsed by humans, not computers Information Extraction is difficult
E N D
Wrapper Construction Charis ErmopoulosQian YangYong YangHengzhi Zhong
Background • HUGE amount of information on the web • They cannot be easily accessed and manipulated • Information intended to be browsed by humans, not computers • Information Extraction is difficult • A wrapper is a procedure for extracting a particular resource’s content • Hand-coding wrapper is tedious and difficult to maintain
Virtual Integration Architecture User queries Mediated schema Mediator: Reformulation engine optimizer Data source catalog Execution engine wrapper wrapper wrapper Data source Data source Data source Sources can be: relational, hierarchical (IMS), structure files, web sites.
Wrapper Construction • Two major approaches • machine learning: typically requires some hand-labeled data • data-intensive, completely automatic
Overview • RoadRunner: Towards Automatic Data Extraction from Large Web Sites (Valter Crescenzi, Giansalvatore Mecca, Paolo Merialdo) • Wrapper Induction for Information Extraction (Nicholas Kushmerick) • Web Data Extraction Based on Partial Tree Alignment
RoadRunner • NO user interaction during generation process • NO apriori knowledge about page organization • Compares with 2 HTML pages at a time and uses mismatches to identify structures
RoadRunner • Basic Idea: • Input: a set of data-intensive, regular structured HTML pages • Generates a wrapper efficiently and automatically by inferring Union-Free Regular Expression grammar for the HTML code
Input HTML Page • Data are stored in a DBMS • HTML pages are produced using some scripts • PHP, PERL, etc • Pages that are generated by the same script have similar structure
Input HTML Page www.csbooks.com/author?Paul+Jones
Input HTML Page www.csbooks.com/author?John+Smith
Input HTML Page • 2 input html pages: • One is used as the wrapper, w • The other is used as the sample, s. • Generalize w by matching it with s
Union Free Regular Expressions • Union-Free Regular Expressions • Alphabet of symbols • All elements of • Some operators • ? matches 0 or 1 occurrence • * zero or more occurrences • + one or more occurrences • a|b a or b • Don’t handle disjunction cases
Union Free Regular Expressions • Close correspondence between nested types and UFREs • Straightforward mapping from UFREs to nested types • #PCDATA string fields • + lists • ? nullable fields (optional fields)
Union Free Regular Expression • (A, B, C, …) + non-empty list of tuples (A: string, B: string, C: string, …) • (A, B, C, …)* possible empty list
Drawbacks • Limited expressive power • Non-regular languages • (folders)n List of folders • Non-regular: (folders)n
Drawbacks • Limited expressive power • Regular languages that require unions • Reviews on amazon.com • RoadRunner not able to factorize the list = can’t discover repeated patterns
Extraction Process • Find the minimal UFRE • Iteratively find the least upper bounds on the RE lattice to generate a wrapper for the input HTML pages • Compute the least upper bound of 2 UFREs matching algorithm
The Matching Technique • The matching algorithm consists in parsing the sample using the wrapper. • A mismatch happens when some token in the sample does not comply to the grammar specified by the wrapper.
Two types of Mismatches • String mismatches: mismatches that happen when different strings occur in corresponding positions of the wrapper and sample • Tag mismatches: mismatches between different tags on the wrapper and the sample, or between one tag and one string.
String Mismatches • If the two pages belong to the same class, string mismatches may be due only to different values of a database field. • These mismatches are used to discover fields (i.e., #PCDATA). (#PCDATA)?
Tag mismatches • Tag mismatches are used to discover iterators (list of items) and optionals (items appearing conditionally). • First look for repeated patterns (i.e., patterns under an iterator), and then, if this attempt fails, try to identify an optional pattern.
Tag mismatches: Optionals (<IMG src=../>)? Wrapper Generalization: Once the optional pattern has been identified, we may generalize the wrapper accordingly and then resume the parsing. In this case, the wrapper is generalized by introducing one pattern of the form ( <IMG src=.../> )?
Tag mismatches: Iterators • Assume mismatch is caused by repeated elements in a list. • Match possible squares against earlier squares. • Generalize the wrapper by finding all contiguous repeated occurrences • (<li><i>Title:</i>#PCDATA</li>)+
Complexity • Exponential time complexity with respect to the input lengths (token number in pages) Lowering the Complexity: • Bounds on the fan–out of OR nodes • Limited backtracking • Delimiters
Limitations • Assumptions • Pages are well-structured • Structure can be modeled by a union free regular expression • Search Space for explaining mismatches is huge • Pruning the search space, will result in pruning possible wrappers
Overview • RoadRunner: Towards Automatic Data Extraction from Large Web Sites (Valter Crescenzi, Giansalvatore Mecca, Paolo Merialdo) • Wrapper Induction for Information Extraction (Nicholas Kushmerick) • Web Data Extraction Based on Partial Tree Alignment
Wrapper Induction for Information Extraction Nicholas Kushmerick’s Dissertation
Wrapper Indcution • Wrapper Induction: automatic construction • Automatic programming • Very very difficult • Particular classes of programs are feasible • Technique: inductive learning • Attributes are identified by delimiters • Input: a set of examples {…,<Pn, Ln>,…} • Output: a wrapper W∈W , such that W(Pn) = Ln, for every <Pn, Ln>
A Simple Wrapper: LR Wrapper • Using left- and right-hand delimiters, allows • l1 = <B>, r1 = </B>, l2 = <I>, r1 = </I> • Delimiter candidates • Prefixes and suffixes of attributes • Candidate validity • 4 constrains (e.g. not a substring of any attr.) • Search policy • For each delimiter, start with shortest prefix & suffix • Stop when valid Cands_ l1 = {‘</I><BR><B>’, ‘/I><BR><B>’,‘I><BR><B>’, …, ’B>’, ‘>’ } <HTML><BODY> <B>Congo</B> <I>242</I><BR> <B>Spain</B> <I>34</I> </BODY></HTML> Congo242 Spain34
Beyond LR • HLRT: use the head and tail to locate the interesting area (2K+2 delimiters) • OCLR: use open and close to indicate each tuple (2K+2 delimiters) • HOCLRT: combines HLRT and OCLR • Nested documents • N-LR • N-HLRT Name: John address: 12 Main St Phone: 123-4567 phone: 444-5555 Name: Fred address: 9 Maple Lane Phone: 666-7777 Name: Jane
Expressiveness(1) • Total 30 resources, 10 samples each
Generating Examples • One simple way: ask a person • Automatically labeling • Using domain-specific heuristics • Primitive: regular expression (e.g. 1?[0-9]:[0-9][0-9]) • NLP • Asking an already wrapped resources • Why still need wrapper induction • Performance • Tolerate high rate of noise
How many examples are enough • For each wrapper class, how many examples (N) needed to ensure with high probability (p1), the learned wrapper makes a mistake only rarely (p2) • K = 4 attributes per tuple • Shortest example page has length R = 10,000 • p1 > 0.95, p2 < 0.05, if N >1534
Summary • Advantages • Fast to learn and extract • Drawbacks • Cannot handle permutations • Cannot handle missing items
Overview • RoadRunner: Towards Automatic Data Extraction from Large Web Sites (Valter Crescenzi, Giansalvatore Mecca, Paolo Merialdo) • Wrapper Induction for Information Extraction (Nicholas Kushmerick) • Web Data Extraction Based on Partial Tree Alignment
Extracting the data records form the web • wrapper induction a set of manually labeled positive and negative examples supervised learning for data extraction rules • automatic extraction pattern discovery based on heuristic rules (repeating-tags, ontology-matching) • partial tree alignment automatic extraction no assumption about contiguous data records two steps: (1) identifying data records in a page (2) aligning and extracting data items from the data records
Identifying data records • Building a HTML tag tree (1) nested structure of HTML tags (2) visual information
Identifying data records • mining data regions: comparing tag strings of nodes a generalized node to cluster similar nodes • identify data records form generalized nodes
Data extraction two steps: (1) build rooted tag tree for each data record (2) partial tree alignment
Partial tree alignment • tree operations • node removal, insertion and replacement • tree edit distance • cost associated with the minimum set of operations • needed to transform tree A into tree B • minimum –cost mapping between two trees • dynamic programming
Partial tree alignment • simple tree matching (STM) • no replacement and level crossing are allowed
Partial tree alignment • progressively growing a seed tree Ts • Ts is initialized as the tree with the maximum number of • data fields • node is inserted if insertion location can be determined • mismatched nodes which are not inserted into Ts will • be reprocessed at later stages
Experimental Results • number of sites used: 49 • total number of pages used: 72 ( randomly collected) • data records are extracted with high accuracy