210 likes | 223 Views
This study delves into query rewriting for data extraction behind HTML forms. It explores automated agents and field matching for efficient data retrieval and processing. The system aids in understanding form elements, filling data, and post-processing for enhanced effectiveness.
E N D
Query Rewriting for Extracting Data Behind HTML Forms Xueqi Chen Department of Computer Science Brigham Young University August, 2002 Funded by National Science Foundation
Motivation • Web information is stored in databases • Databases are accessed through forms • Forms are designed in various ways
Motivation • Web information is stored in databases • Databases are accessed through forms • Forms are designed in various ways • Automated agents are of great value
Input Analyzer Extracted Information Application Ontology Site Form User Query Retrieved Page(s) Output Analyzer System Flowchart
User Query Acquisition • Our system provides a form created based on application-specific ontology
Site Form Analysis • Understand type, name, and/or values for each field
Form Filling • Name matched? • Case 0 – 5 • Field matched? • Case 1, 2 • Value matched? • Case 3, 4, 5
Form Filling: Case 0 • Fields specified in user query are the same as in a site form. 84601
? ? Form Filling: Case 1 • Fields specified in a user query are not contained in a site form, but are in the returned information.
? ? Form Filling: Case 2 • Fields specified in a user query are not contained in a site form, and are not in the returned information. Color?
Form Filling: Case 3 • Fields required by a site form are not provided in user query, but a general default value, such as “All” or “Any”, is provided by the site form.
Form Filling: Case 4 • Fields appear in a site form are not provided in a user query, and the default value provided by the site form is specific, not “All” or “Any”. ?
Form Filling: Case 5 • Values specified in a user query do not match with values provided in a site form.
Post-processing • Valid pages? Error pages? Pages with error messages [Yau01]
Post-processing • Valid pages? Error pages? Pages with error messages [Yau01] • Concatenates the results [Yau01] • Recognizes the boundary of each record [EJN99] • Identifies the formats of the retrieved pages
Post-processing (cont’) • Removes duplicates [Yau01] • Extracts key information [Deg02, ETL02] • Places the results in a database [Deg02] • Executes the original user query and displays the results.
Measurements • Field-matching Efficiency • Submission Efficiency • Post-processing Efficiency
Measurements (cont’) • Field-matching Efficiency
Measurements (cont’) • Field-matching Efficiency • Submission Efficiency
Measurements (cont’) • Field-matching Efficiency • Submission Efficiency • Post-processing Efficiency
Contributions • It enhances the effectiveness of the data-extraction process • It presents another technique, in addition to [RGa01], to access data behind HTML forms.