260 likes | 427 Views
Automatic Creation and Simplified Querying of Semantic Web Content. An Approach Based on Information-Extraction Ontologies. Yihong Ding, David W. Embley, and Stephen W. Liddle Brigham Young University. Fundamental Problems. Lack of semantic web content Difficulty of content creation
E N D
Automatic Creation and Simplified Querying of Semantic Web Content An Approach Based on Information-Extraction Ontologies Yihong Ding, David W. Embley, and Stephen W. Liddle Brigham Young University
Fundamental Problems • Lack of semantic web content • Difficulty of content creation • Inability to use semantic web content easily
Proposed Solutions • Automatically annotate data-rich web pages (turning them into semantic web pages) • Provide for free-form, textual queries of semantic web content
A Show-Case Vision Find me the price and mileage of red Nissans – I want a 1990 or newer.
Explanation: How it Works • Extraction Ontologies • Semantic Annotation • Free-Form Query Interpretation
Extraction Ontologies Object sets Relationship sets Participation constraints Lexical Non-lexical Primary object set Aggregation Generalization/Specialization
Formalism & Extraction Ontologies (a quick side note) • Fully formalized in predicate calculus • Object set ~ 1-place predicate • N-ary relationship set ~ n-place predicate • Constraint ~ closed predicate-calculus formula • As a description logic ~ ALCN (Attributive Language with Complement and Numeric Restrictions)
Extraction Ontologies Data Frame: Internal Representation: float Values External Rep.: \s*[$]\s*(\d{1,3})*(\.\d{2})? Left Context: $ Key Word Phrase Key Words: ([Pp]rice)|([Cc]ost)| … Operators Operator: > Key Words: (more\s*than)|(more\s*costly)|…
Data-Extraction Results: Car Ads Salt Lake Tribune Recall % Precision % Year 100 100 Make 97 100 Model 82 100 Mileage 90 100 Price 100 100 PhoneNr 94 100 Feature 91 99 Training set for tuning ontology: 100 Test set: 116
Car Ads: Comments • Dynamic sets • Missed: MERC, Town Car, 98 Royale • Could use lexicon of makes and models • Unspecified variation in lexical patterns • Missed: 5 speed (instead of 5 spd), p.l (instead of p.l.) • could adjust lexical patterns • Misidentification of attributes • Classified AUTO in AUTO SALES as automatic transmission • Could adjust exceptions in lexical patterns • Typographical errors • “Chrystler”, “DODG ENeon”, “I-15566-2441” • Could look for spelling variations and common typos
General Extraction Results • ~ 20 Domains (cars, obituaries, cameras, jobs, games, prescription drugs, …) • Simple, unified domains: nearly 100% recall and precision • Complex, loosely defined domains (e.g. obituaries: 82% recall and 74% precision) • Typical: 80%+ recall and precision
Generality & Resiliency ofExtraction Ontologies (another quick side note) • Assumptions about web pages (generality) • Data rich • Narrow domain • Document types • Simple multiple-record documents (easiest) • Single-record documents (harder) • Records with scattered components (even harder) • Declarative (resiliency) • Still works when web pages change • Works for new, unseen pages in the same domain • Scalable, but takes work to declare the extraction ontology
Free-Form Query Interpretation • Parse Free-Form Query (with data extraction ontology) • Select Ontology • Formulate Query Expression • Run Query Over Semantically Annotated Data
Parse Free-Form Query “Find me the and of all s – I want a ” price mileage red Nissan 1996 or newer >=Operator
Select Ontology “Find me the price and mileage of all red Nissans – I want a 1996 or newer” Similarity value: 2 Similarity value: 5
Formulate Query Expression • Conjunctive queries and aggregate queries • Mentioned object sets are all of interest in the result. • Values and operator keywords determine conditions. • Color = “red” • Make = “Nissan” • Year >= 1996 >= Operator
Formulate Query Expression For Let Where Return
Query Interpretation Results:Pilot Experiment with Car Ads • 15 car-ads free-form queries from 3 volunteer CS students • Results • Recognizing object sets of interest • Recall: 85% • Precision: 90% • Recognizing constraints • Recall: 61% • Precision: 79% • Problems • Regular expressions not tuned up and lexicons incomplete • Ambiguities: “Are there any Ford mustangs, 2002, that are red?” (Is 2002 a year, mileage, or price?) • Caveats • No disjunction • No negation
GeneralQuery Interpretation Results AskOntos (Pilot Experiment on 5 domains: cars, real estate, countries, movies, diamonds) • Object sets of interest recognized • Recall: 90% • Precision: 90% • Conditions recognized • Recall: 71% • Precision: 88%
Pragmatics All is not rosy … • Technical problems • Extraction and query-interpretation accuracy • Execution speed • Harvesting • Crawling?! • Information behind forms on the hidden web • Social problems • Cooperation from web site developers • End-user concerns • Motivation • Trust
Conclusions • Automatically create semantic-web content • Do data extraction over an ordinary web page • Create semantic-web page • Cache page • Store external semantic annotation wrt an ontology • Query semantic web pages • Free-form queries • Return results • Table • Link to original web page (scrolled and highlighted) • Pragmatic considerations www.deg.byu.edu