180 likes | 194 Views
This study explores query rewriting to extract data from HTML forms. A system is developed to create forms based on application-specific ontology analysis for efficient data retrieval and processing.
E N D
Query Rewriting for Extracting Data Behind HTML Forms Xueqi Chen Department of Computer Science Brigham Young University March, 2003 Funded by National Science Foundation
Motivation • Web information is stored in databases • Databases are accessed through forms • Automated agents are of great value • Process is difficult because of nature of forms
Input Analyzer Extracted Information Application Ontology Site Form User Query Retrieved Page(s) Output Analyzer System Flowchart
User Query Acquisition • Our system provides a form created based on application-specific ontology
Site Form Analysis • Understand type, name, and/or values for each field
Form Filling • Name matching • Regular Expressions – for fields with values provided • Stemming • Levenshtein Edit Distance • Longest Common Subsequences • Soundex • Wordnet • Value matching
? ? Value Matching: Case 2
? ? Value Matching: Case 3 Color?
Measurements • Matching Efficiency • Submission Efficiency • Post-processing Efficiency
Measurements (cont’) • Matching Efficiency
Measurements (cont’) • Matching Efficiency • Submission Efficiency
Measurements (cont’) • Matching Efficiency • Submission Efficiency • Post-processing Efficiency
Contributions • It enhances the effectiveness of the data-extraction process • It presents another technique, in addition to [RGa01], to access data behind HTML forms.