Wrapper Construction

Wrapper Construction Charis ErmopoulosQian YangYong YangHengzhi Zhong

Background • HUGE amount of information on the web • They cannot be easily accessed and manipulated • Information intended to be browsed by humans, not computers • Information Extraction is difficult • A wrapper is a procedure for extracting a particular resource’s content • Hand-coding wrapper is tedious and difficult to maintain

Virtual Integration Architecture User queries Mediated schema Mediator: Reformulation engine optimizer Data source catalog Execution engine wrapper wrapper wrapper Data source Data source Data source Sources can be: relational, hierarchical (IMS), structure files, web sites.

Wrapper Construction • Two major approaches • machine learning: typically requires some hand-labeled data • data-intensive, completely automatic

Overview • RoadRunner: Towards Automatic Data Extraction from Large Web Sites (Valter Crescenzi, Giansalvatore Mecca, Paolo Merialdo) • Wrapper Induction for Information Extraction (Nicholas Kushmerick) • Web Data Extraction Based on Partial Tree Alignment

RoadRunner • NO user interaction during generation process • NO apriori knowledge about page organization • Compares with 2 HTML pages at a time and uses mismatches to identify structures

RoadRunner • Basic Idea: • Input: a set of data-intensive, regular structured HTML pages • Generates a wrapper efficiently and automatically by inferring Union-Free Regular Expression grammar for the HTML code

Input HTML Page • Data are stored in a DBMS • HTML pages are produced using some scripts • PHP, PERL, etc • Pages that are generated by the same script have similar structure

Input HTML Page www.csbooks.com/author?Paul+Jones

Input HTML Page www.csbooks.com/author?John+Smith

Input HTML Page • 2 input html pages: • One is used as the wrapper, w • The other is used as the sample, s. • Generalize w by matching it with s

Union Free Regular Expressions • Union-Free Regular Expressions • Alphabet of symbols • All elements of • Some operators • ? matches 0 or 1 occurrence • * zero or more occurrences • + one or more occurrences • a|b a or b • Don’t handle disjunction cases

Union Free Regular Expressions • Close correspondence between nested types and UFREs • Straightforward mapping from UFREs to nested types • #PCDATA  string fields • +  lists • ?  nullable fields (optional fields)

Union Free Regular Expression • (A, B, C, …) +  non-empty list of tuples (A: string, B: string, C: string, …) • (A, B, C, …)*  possible empty list

Union Free Regular Expression

Drawbacks • Limited expressive power • Non-regular languages • (folders)n List of folders • Non-regular: (folders)n

Drawbacks • Limited expressive power • Regular languages that require unions • Reviews on amazon.com • RoadRunner not able to factorize the list = can’t discover repeated patterns

Extraction Process • Find the minimal UFRE • Iteratively find the least upper bounds on the RE lattice to generate a wrapper for the input HTML pages • Compute the least upper bound of 2 UFREs  matching algorithm

The Matching Technique • The matching algorithm consists in parsing the sample using the wrapper. • A mismatch happens when some token in the sample does not comply to the grammar specified by the wrapper.

Two types of Mismatches • String mismatches: mismatches that happen when different strings occur in corresponding positions of the wrapper and sample • Tag mismatches: mismatches between different tags on the wrapper and the sample, or between one tag and one string.

String Mismatches • If the two pages belong to the same class, string mismatches may be due only to different values of a database field. • These mismatches are used to discover fields (i.e., #PCDATA). (#PCDATA)?

Tag mismatches • Tag mismatches are used to discover iterators (list of items) and optionals (items appearing conditionally). • First look for repeated patterns (i.e., patterns under an iterator), and then, if this attempt fails, try to identify an optional pattern.

Tag mismatches: Optionals (<IMG src=../>)? Wrapper Generalization: Once the optional pattern has been identified, we may generalize the wrapper accordingly and then resume the parsing. In this case, the wrapper is generalized by introducing one pattern of the form ( <IMG src=.../> )?

Tag mismatches: Iterators • Assume mismatch is caused by repeated elements in a list. • Match possible squares against earlier squares. • Generalize the wrapper by finding all contiguous repeated occurrences • (<li>Title:#PCDATA</li>)+

Extracted Result

Recursive Example

Matching as an AND-OR Tree

Complexity • Exponential time complexity with respect to the input lengths (token number in pages) Lowering the Complexity: • Bounds on the fan–out of OR nodes • Limited backtracking • Delimiters

Limitations • Assumptions • Pages are well-structured • Structure can be modeled by a union free regular expression • Search Space for explaining mismatches is huge • Pruning the search space, will result in pruning possible wrappers

Experimental Result

Comparison with Other Works

Wrapper Induction for Information Extraction Nicholas Kushmerick’s Dissertation

Wrapper Indcution • Wrapper Induction: automatic construction • Automatic programming • Very very difficult • Particular classes of programs are feasible • Technique: inductive learning • Attributes are identified by delimiters • Input: a set of examples {…,<Pn, Ln>,…} • Output: a wrapper W∈W , such that W(Pn) = Ln, for every <Pn, Ln>

A Simple Wrapper: LR Wrapper • Using left- and right-hand delimiters, allows • l1 = , r1 = , l2 = , r1 = • Delimiter candidates • Prefixes and suffixes of attributes • Candidate validity • 4 constrains (e.g. not a substring of any attr.) • Search policy • For each delimiter, start with shortest prefix & suffix • Stop when valid Cands_ l1 = {‘ ’, ‘/I> ’,‘I> ’, …, ’B>’, ‘>’ } <HTML><BODY> Congo 242 Spain 34 </BODY></HTML> Congo242 Spain34

Beyond LR • HLRT: use the head and tail to locate the interesting area (2K+2 delimiters) • OCLR: use open and close to indicate each tuple (2K+2 delimiters) • HOCLRT: combines HLRT and OCLR • Nested documents • N-LR • N-HLRT Name: John address: 12 Main St Phone: 123-4567 phone: 444-5555 Name: Fred address: 9 Maple Lane Phone: 666-7777 Name: Jane

Expressiveness(1) • Total 30 resources, 10 samples each

Expressiveness(2)

Generating Examples • One simple way: ask a person • Automatically labeling • Using domain-specific heuristics • Primitive: regular expression (e.g. 1?[0-9]:[0-9][0-9]) • NLP • Asking an already wrapped resources • Why still need wrapper induction • Performance • Tolerate high rate of noise

How many examples are enough • For each wrapper class, how many examples (N) needed to ensure with high probability (p1), the learned wrapper makes a mistake only rarely (p2) • K = 4 attributes per tuple • Shortest example page has length R = 10,000 • p1 > 0.95, p2 < 0.05, if N >1534

Summary • Advantages • Fast to learn and extract • Drawbacks • Cannot handle permutations • Cannot handle missing items

Extracting the data records form the web • wrapper induction a set of manually labeled positive and negative examples supervised learning for data extraction rules • automatic extraction pattern discovery based on heuristic rules (repeating-tags, ontology-matching) • partial tree alignment automatic extraction no assumption about contiguous data records two steps: (1) identifying data records in a page (2) aligning and extracting data items from the data records

Identifying data records • Building a HTML tag tree (1) nested structure of HTML tags (2) visual information

Identifying data records • mining data regions: comparing tag strings of nodes a generalized node to cluster similar nodes • identify data records form generalized nodes

Data extraction two steps: (1) build rooted tag tree for each data record (2) partial tree alignment

Partial tree alignment • tree operations • node removal, insertion and replacement • tree edit distance • cost associated with the minimum set of operations • needed to transform tree A into tree B • minimum –cost mapping between two trees • dynamic programming

Partial tree alignment • simple tree matching (STM) • no replacement and level crossing are allowed

Partial tree alignment • progressively growing a seed tree Ts • Ts is initialized as the tree with the maximum number of • data fields • node is inserted if insertion location can be determined • mismatched nodes which are not inserted into Ts will • be reprocessed at later stages

Experimental Results • number of sites used: 49 • total number of pages used: 72 ( randomly collected) • data records are extracted with high accuracy

Wrapper Construction