80 likes | 205 Views
Implementing Automatic Value Extraction from Structured Web Pages. Varun Ganapathi, Jonathan Pines, Josh Wiseman. Problem. Context: Many web pages are generated by applying a template to structured data Goal: Given a set of pages generated from a template, infer the template.
E N D
Implementing Automatic Value Extraction from Structured Web Pages Varun Ganapathi, Jonathan Pines, Josh Wiseman
Problem • Context: • Many web pages are generated by applying a template to structured data • Goal: • Given a set of pages generated from a template, infer the template. • Extract values from previously unseen pages generated from the template • Why? • The template encodes structure that usually has semantic meaning. • The structured values that back a page are all the important information in the page.
What is a Template? • It is a special case of a context free grammar • Tuple ( fixed-length ordered lists ) • Sets ( arbitrary-length lists denoted by separators ) • Example of Instantiated Template: <elem>Ethan Hunt comes face to face with a dangerous and … </elem> <elem>6.8</elem> <set> <tuple><elem>Tom Cruise</elem><elem>Ethan Hunt</elem></tuple> <tuple><elem>Ving Rhames</elem><elem>Luther Strickell</elem></tuple> </set>
Learning Templates • Use the following observations: • When tokens occur frequently together, it might be because they are derived from the same template • The strings derived from templates have certain properties • Ordered • Nested • Loop • Find equivalence classes of differentiated tokens • Increase partial template • Differentiate tokens based on partial template • Construct Template using Patterns
Evaluation • We manually extracted “interesting” data from several IMDB movie pages. <elem>Ethan Hunt comes face to face with a dangerous and … </elem> <elem>6.8</elem> <set> <tuple><elem>Tom Cruise</elem><elem>Ethan Hunt</elem></tuple> <tuple><elem>Ving Rhames</elem><elem>Luther Strickell</elem></tuple> </set> • Some attributes: title, writers, directors, plot summary, rating, actors, languages, trivia, … • Attributes were either: • Correct: Our system was perfect. • Partially Correct: Our system got a bit too much. • Incorrect: Our system missed some data.
Results • Attributes: • 5 correct • 5 partially correct • 6 incorrect