370 likes | 503 Views
Extracting Structured Data from Web Pages. By Arsun ARTEL, Özgün ÖZIŞIKYILMAZ 05.11.2003 Instructor: Prof. Taflan G ündem. General Underlying Terminology Modules and their operations. Presentation Outline. Motivation Example Pages. Model & Problem Formulation. Approach in Detail.
E N D
Extracting Structured Data from Web Pages By Arsun ARTEL, Özgün ÖZIŞIKYILMAZ 05.11.2003 Instructor: Prof. Taflan Gündem
General • Underlying Terminology • Modules and their operations Presentation Outline • Motivation • Example Pages • Model & Problem Formulation • Approach in Detail • Experimental Results • Conclusion
General • Underlying Terminology • Modules and their operations What is next? • Motivation • Example Pages • Model & Problem Formulation • Approach in Detail • Experimental Results • Conclusion
Motivation • There are many web sites that contain a large collection of “structured” pages. • Extracting structured data from the web pages is useful, since it enables us to pose complex queries over the data. • This paper focuses on the problem of automatically extracting structured data from a collection of pages.
General • Underlying Terminology • Modules and their operations What is next? • Motivation • Example Pages • Model & Problem Formulation • Approach in Detail • Experimental Results • Conclusion
Example Pages • In the real world there are many examples for structured web pages. • amazon web site, e-bay web site etc. • Two examples from www.amazon.com • My System • An Eternal Golden Braid
General • Underlying Terminology • Modules and their operations What is next? • Motivation • Example Pages • Model & Problem Formulation • Approach in Detail • Experimental Results • Conclusion
Underlying Problems • Complex Schema:The “schema” of the information encoded in the web pages could be very complex with arbitrary levels nesting. For instance, each book page can contain a set of authors, with each author having a set of addresses and so on. • Template vs. Data:Syntactically, there is nothing that distinguishes the text that is part of the template and the text that is part of the data.
x extracted from the database How is a page created with template?
Basic Type, Tuples and Sets • Basic Type: b,Basic unit of text • Tuple: Ordered List of types, <T1,T2,…,Tn> • Set: {T1} < C Programming Language, {< Brian, Kernighan >, < Dennis, Ritchie >}, $30.00 >
Schema and Instance < C Programming Language, {< Brian, Kernighan >, < Dennis, Ritchie >}, $30.00 >
Template Definition • Own example: • Schema: S = <b, {b}, b> • Template: TS = <A * B {*}E C * D> • A = ‘Title:’, B = ‘Presented by:’, C = ‘Cost:’, D = ‘ ’, E = ‘and’ • Instance of TS: Title: Extracting Structured Data Presented by: Arsun and Özgün Cost: 1hr
Encoding l(T1,x1) Template
General • Underlying Terminology • Modules and their operations What is next? • Motivation • Example Pages • Model & Problem Formulation • Approach in Detail • Experimental Results • Conclusion
Set of Reviewers Multiple Pages
Some Terminology (1) • The occurrence-vectorof a token t, is defined as the vector <f1,f2,…fn> where fi is the number of occurrences of t in ith page • An equivalence classis a maximal set of tokens having the same occurrence-vector. • A token is said to have unique role, if all the occurrences of the token in the pages, is generated by a single template-token.
No unique role <1,1,1,1> <1,2,1,0> Some Terminology (2)
Some Terminology (3) • For real pages, an equivalence class of large size and support is usually valid, where supportof a token is defined as the number of pages in which the token occurs. • Example for invalid equivalence class: • {Data, Mining, Jeff, 2, Jane, 6} has occurrence vector <0, 1, 0, 0>
Some Terminology (4) • The equivalence classes with large size and support are called LFEQs (for Large and Frequent EQuivalence class). LFEQs are rarely formed by “chance”. • Threshold for size and support is set by the user (SizeThres, SupThres).
Some Terminology(5) • Validequivalence class properties: Ordering and Nesting • Back to own example: • Template: TS = <A * B {*}E C * D> • A = ‘Title:’, B = ‘Presented by:’, C = ‘Cost:’, D = ‘ ’, E = ‘and’ • Ordered: A > B > C > D • Nesting: B > E > C
Important Observations • In practice, two page-tokens with different occurrence-paths have different roles: html-parser • Two page-tokens having same occurrence paths, but with different neighbours also have different roles
Constructing Template (1) • The extraction algorithm determines the positions between consecutive tokens of an equivalence class that are non-empty. • A position between two consecutive tokens is empty if the two tokens always occur contiguously, and non-empty, otherwise.
Constructing Template (2) • The tokens connected by empty positions belong to the template. • In the non-empty positions, there are either basic types (strings extracted from database), or a more complex type • This unknown type can be determined by inspecting input pages
General • Underlying Terminology • Modules and their operations What is next? • Motivation • Example Pages • Model & Problem Formulation • Approach in Detail • Experimental Results • Conclusion
Experimental Results (1) • Basically this project is compared with the RoadRunner, however RoadRunner makes simplifying assumptions. • The first 6 web pages are obtained from RoadRunner site. • The last three web pages have more complex structure.
General • Underlying Terminology • Modules and their operations What is next? • Motivation • Example Pages • Model & Problem Formulation • Approach in Detail • Experimental Results • Conclusion
Concluding Remarks • EXALG first discovers the unknown template that generated the pages and uses the discovered template to extract the data from the input pages. • Besides getting very good results, EXALG does not completely fail to extract any data even when some of the assumptions made by EXALG are not met by the input collection. • No human intervention – automatically getting template and data
Future Work • Automatically locate collections of pages that are structured • Check, whether it is feasible to generate some large database from these pages