Query Rewriting for Extracting Data Behind HTML Forms

Query Rewriting for Extracting Data Behind HTML Forms Xueqi Chen Department of Computer Science Brigham Young University August, 2002 Funded by National Science Foundation

Motivation • Web information is stored in databases • Databases are accessed through forms • Forms are designed in various ways

Motivation • Web information is stored in databases • Databases are accessed through forms • Forms are designed in various ways • Automated agents are of great value

Input Analyzer Extracted Information Application Ontology Site Form User Query Retrieved Page(s) Output Analyzer System Flowchart

User Query Acquisition • Our system provides a form created based on application-specific ontology

Site Form Analysis • Understand type, name, and/or values for each field

Form Filling • Name matched? • Case 0 – 5 • Field matched? • Case 1, 2 • Value matched? • Case 3, 4, 5

Form Filling: Case 0 • Fields specified in user query are the same as in a site form. 84601

? ? Form Filling: Case 1 • Fields specified in a user query are not contained in a site form, but are in the returned information.

? ? Form Filling: Case 2 • Fields specified in a user query are not contained in a site form, and are not in the returned information. Color?

Form Filling: Case 3 • Fields required by a site form are not provided in user query, but a general default value, such as “All” or “Any”, is provided by the site form.

Form Filling: Case 4 • Fields appear in a site form are not provided in a user query, and the default value provided by the site form is specific, not “All” or “Any”. ?

Form Filling: Case 5 • Values specified in a user query do not match with values provided in a site form.

Post-processing • Valid pages? Error pages? Pages with error messages [Yau01]

Post-processing • Valid pages? Error pages? Pages with error messages [Yau01] • Concatenates the results [Yau01] • Recognizes the boundary of each record [EJN99] • Identifies the formats of the retrieved pages

Post-processing (cont’) • Removes duplicates [Yau01] • Extracts key information [Deg02, ETL02] • Places the results in a database [Deg02] • Executes the original user query and displays the results.

Measurements • Field-matching Efficiency • Submission Efficiency • Post-processing Efficiency

Measurements (cont’) • Field-matching Efficiency

Measurements (cont’) • Field-matching Efficiency • Submission Efficiency

Measurements (cont’) • Field-matching Efficiency • Submission Efficiency • Post-processing Efficiency

Contributions • It enhances the effectiveness of the data-extraction process • It presents another technique, in addition to [RGa01], to access data behind HTML forms.

Query Rewriting for Extracting Data Behind HTML Forms

Query Rewriting for Extracting Data Behind HTML Forms

Presentation Transcript

HTML Forms

HTML/XHTML Forms

HTML FORMS

Extracting data

HTML Forms

SPARQL Query Rewriting for Implementing Data Integration over Linked Data

HTML Forms

FORMs in HTML

HTML - Forms

HTML forms

HTML Forms

HTML Forms

More HTML Forms

HTML Forms Validation

HTML Forms

HTML III (Forms)

FORMS IN HTML

HTML Forms