1 / 21

Query Rewriting for Extracting Data Behind HTML Forms

This study delves into query rewriting for data extraction behind HTML forms. It explores automated agents and field matching for efficient data retrieval and processing. The system aids in understanding form elements, filling data, and post-processing for enhanced effectiveness.

lorettam
Download Presentation

Query Rewriting for Extracting Data Behind HTML Forms

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Query Rewriting for Extracting Data Behind HTML Forms Xueqi Chen Department of Computer Science Brigham Young University August, 2002 Funded by National Science Foundation

  2. Motivation • Web information is stored in databases • Databases are accessed through forms • Forms are designed in various ways

  3. Motivation • Web information is stored in databases • Databases are accessed through forms • Forms are designed in various ways • Automated agents are of great value

  4. Input Analyzer Extracted Information Application Ontology Site Form User Query Retrieved Page(s) Output Analyzer System Flowchart

  5. User Query Acquisition • Our system provides a form created based on application-specific ontology

  6. Site Form Analysis • Understand type, name, and/or values for each field

  7. Form Filling • Name matched? • Case 0 – 5 • Field matched? • Case 1, 2 • Value matched? • Case 3, 4, 5

  8. Form Filling: Case 0 • Fields specified in user query are the same as in a site form. 84601

  9. ? ? Form Filling: Case 1 • Fields specified in a user query are not contained in a site form, but are in the returned information.

  10. ? ? Form Filling: Case 2 • Fields specified in a user query are not contained in a site form, and are not in the returned information. Color?

  11. Form Filling: Case 3 • Fields required by a site form are not provided in user query, but a general default value, such as “All” or “Any”, is provided by the site form.

  12. Form Filling: Case 4 • Fields appear in a site form are not provided in a user query, and the default value provided by the site form is specific, not “All” or “Any”. ?

  13. Form Filling: Case 5 • Values specified in a user query do not match with values provided in a site form.

  14. Post-processing • Valid pages? Error pages? Pages with error messages [Yau01]

  15. Post-processing • Valid pages? Error pages? Pages with error messages [Yau01] • Concatenates the results [Yau01] • Recognizes the boundary of each record [EJN99] • Identifies the formats of the retrieved pages

  16. Post-processing (cont’) • Removes duplicates [Yau01] • Extracts key information [Deg02, ETL02] • Places the results in a database [Deg02] • Executes the original user query and displays the results.

  17. Measurements • Field-matching Efficiency • Submission Efficiency • Post-processing Efficiency

  18. Measurements (cont’) • Field-matching Efficiency

  19. Measurements (cont’) • Field-matching Efficiency • Submission Efficiency

  20. Measurements (cont’) • Field-matching Efficiency • Submission Efficiency • Post-processing Efficiency

  21. Contributions • It enhances the effectiveness of the data-extraction process • It presents another technique, in addition to [RGa01], to access data behind HTML forms.

More Related