100 likes | 202 Views
Initial Results on Wrapping Semistructured Web Pages with Finite-State Transducers and Contextual Rules. Chun-Nan Hsu Arizona State University. Introduction. Based on the problem of wrapper generation (which extracts from structured text only)
E N D
Initial Results on Wrapping Semistructured Web Pages with Finite-State Transducers and Contextual Rules Chun-Nan Hsu Arizona State University 1
Introduction • Based on the problem of wrapper generation (which extracts from structured text only) • Attempts to generate wrappers for unstructured data as well: • Missing attributes • Multiple attribute values • Variant attribute permutations • Exceptions and typos (in extracted items themselves, or in contextual items?) • Fairly high-level overview 2
Extraction Method • HTML Tokenizer • All-CAPS strings • Strings beginning with capital letter • Lowercase strings • HTML tags • etc. • Finite-State Transducer (FSA with output instructions) 3
Contextual Rules Describe “separators” between fields • May or may not be physical characters (e.g. “CA90210” “CA” and “90210”) • Appear before and after each extraction field (h and t, respectively) • Composed of a left context (hL or tL) and a right context (hR or tR) • Used to describe transitions between states in the FST 4
Key • ? : wildcard • U : state to extract URL • U : state to skip over tokens • until we reach N • N : state to extract Name • N : state to skip over tokens • until we reach A • s<X,Y> : separator rule for • the separator of • states X and Y • etc. s<N,N> / ε s<U,U> / ε s<U,N> / “N=” + next_token Finite-State Transducer Example application: faculty web pages <LI> <A HREF="…"> Mani Chandy </A>, <I>Professor of Computer Science</I> and <I>Executive Officer for Computer Science</I> … <LI> Fred Thompson, <I>Professor Emeritus of Applied Philosophy and Computer Science</I> ? / next_token ? / ε _U ? / ε U etc. s<b,U> / “U=” + next_token b _N s<b,N> / “N=” + next_token N ? / ε ? / next_token 5
Induction Algorithm • Calculate permutation of extraction fields (e.g. U, U, N, N, M) and add transitions to the graph if necessary • Not every permutation has to appear in the training data!! (Somewhat incorrect info in [eikvil99]) 6
Induction Algorithm For each extraction field in the training set: • Generate the left- and right-separators, add them to the corresponding contextual rule lists • Align tokens into columns Heuristic: align word tokens together, non-word tokens starting closest to the separator boundary • Attempt generalization with other rules • Replace related tokens with least common ancestor in taxonomy tree • Generalize whitespace tokens • Remove any duplicate rules 7
Comparison with DEG Method • Similarities: • FSA Regular Expressions • Differences: • Focuses on separators between extraction fields, whereas DEG focuses on patterns of the field itself • Designed to generate wrappers (specific website) rather than general-purpose extraction rules 8
Results Notes: • Training tuples used = # of tuples labeled by the user needed to cover total tuples in training pages • Recall after 10 pages: 60/69 = 87% • Precision = …100%? • What does “Total unseen tuples covered” mean…? • No comparison with other algorithms 9
Conclusions • SoftMealy uses a FST to construct wrappers for structured and semi-structured text. • The FST structure is based on contextual rules that describe what separates each extraction field. • The rules are learned from training documents, marked up by the user interactively. • This could be an interesting approach, but a more complete analysis is needed. 10