300 likes | 315 Views
Discover how OntoByE, an automated tool developed by BYU, generates data extraction ontologies for web data by example, aiding users in creating data frames and patterns for efficient querying and extraction.
E N D
Generating Data-Extraction Ontologies By Example Joe Zhou Data Extraction Group Brigham Young University
Background • World Wide Web contains a huge amount of useful information. • Web data-extraction is necessary for querying the data of interest. • Most of wrappers generate extraction patterns based on delimiters or HTML tags. So they are source-dependent. • BYU ontology-based technique is resilient.
Problem and Solution • BYU Onto approach requires that ontology experts generate data-extraction ontologies for the domains of interest to ordinary users • A principal effort of our research is to automate ontology- generation process as much as possible • We developed a system OntoByE (Ontology By Example) to generate data-extraction ontologies semi-automatically
Extraction Ontology • Object sets, Relationship sets and Constraints • Data frames for Lexical Object Sets
Extraction Ontology Object sets, Relationship sets and Constraints Data frame for Digital Zoom
Sample Pages Data Frame Library Ontology Generator Marked Pages Forms Extraction Ontology User Interface Populated Database Extraction Engine Target Pages OntoByE System Overview and Architecture
Form Editor – Creating Forms for Digital Camera Application
Data Frame Library Users Data Frame Matcher Marked HTML Pages Data Frame Editor Context Phrase Locator Keyword and Context Expression Recognizer User-definedForms Form Analyzer Ontology Generator Extraction Ontology Ontology Generator– Workflow Data Frames Object Sets, Relationship Sets and Constraints Extraction Ontology
Ontology Generator –Form Analyzer Sample Form Object and Realationship Sets and Constraints BaseForm [0:1] A [1:*] BaseForm [0:3] B [1:*] BaseForm [0:*] C [1:*] BaseForm [0:3] D1 [1:*] D2 [1:*] D3 [1:*] BaseForm [0:*] E1 [1:*] E2 [1:*] E3 [1:*]
Digital Camera application Forms Object and Relationship Sets and Constraints Ontology Generator–Form Analyzer
Ontology Generator–Data Frame Matcher Data Frame Matching Heuristics: • Number of matched data Data Frame Ranking Heuristics: • Number of matched data • Keywords and/or Contexts Matching • Order of Specialization/Generalization
Experimental Preparation • Selected two domains of interest • Digital Camera Application and Apartment Rental Application • Constructed an initial data frame library • Integer (any integer value), SmallPositiveInteger (from 1 to 99), SingleDigit (from 0 to 9), RealNumber (any real value), SmallPositiveReal (from 0.01 to 99.99), Date, Email, PhoneNumber, and Price • Created application-dependent forms for each application • Collected 5 sample pages from different web sites for each domain • Marked desired data on sample pages
Experimental Observations – Strengths of OntoByE • OntoByE provides a friendly and intuitive interface to help ordinary users describe data of interest without exposing them to abstract ontology concepts • With a small initial data frame library and a small set of sample pages, OntobyE works well to search for and suggest appropriate existing data frames for object sets with application-independent values • OntoByE successfully recognizes possible keywords and contexts for user marked-data from sample pages and helps users to create new data frames with the keywords and contexts
Experimental Observations – Limitations of OntoByE • The performance of searching for or constructing data frames by OntoByE is limited by the scope and the quality of prior knowledge • The accuracy and completeness of keyword and context expression construction are limited by the number and representativeness of user samples • Constructing value expressions for application-dependent data frames requires that users know how to write regular expressions.
Conclusion • We implemented a user-friendly interface for ordinary users to take advantage of our ontology-based web data-extraction approach. • We developed a framework for interacting with ordinary users to generate ontologies by example. • Our experiments demonstrate that OntoByE works well to generate ontologies with assistance of a limited prior knowledge. As time goes by, along with the expansion of prior knowledge, OntoByE will achieve better performance.
Future Work • Have OntoByE learn to build application-dependent lexicons for users’ applications • Improve the sub-components of the back-end ontology generator, e.g. Context Phrase Locator