170 likes | 338 Views
Ontology Based Extraction of RDF Data from the World Wide Web. Tim Chartrand Masters Thesis Research Supported By NSF. Introduction. World Wide Web Has a huge amount of existing information Designed primarily for human consumption Semantic Web Is an extension of WWW
E N D
Ontology Based Extraction of RDF Data from the World Wide Web Tim Chartrand Masters Thesis Research Supported By NSF
Introduction • World Wide Web • Has a huge amount of existing information • Designed primarily for human consumption • Semantic Web • Is an extension of WWW • Gives information a well-defined meaning • Allows automation of tasks • DEG contribution – Extract data from the WWW • Solution • Extract Semantic Web data from the WWW • Superimpose extracted data
User HTML Page HTML Page DAML Ontology Extraction Ontology Extraction Ontology Extraction Engine Extraction Engine RDF Browser RDF Data Relational Data Relational Data Research Overview
User HTML Extraction Engine DAML Ontology Extraction Ontology Relational Data RDF Data mailto:tim@cs.byu.edu genealogy:fatherOf genealogy:age genealogy:fatherOf mailto:tyler@thechartrands.com 25 RDF – What is it? • Resource Description Framework • Language of the Semantic Web • Set of <subject><predicate><object> triples <mailto:tim@cs.byu.edu><genealogy#age>“25” <mailto:tim@cs.byu.edu><genealogy#fatherOf><mailto:tyler@thechartrands.com>
User HTML Extraction Engine DAML Ontology Extraction Ontology Relational Data RDF Data DAML Core Concepts • daml:class – defines a class • daml:property – defines a binary relation, has a value • rdfs:domain – specifies class to which a property applies • rdfs:range – specifies possible values of a property • daml:uniqueProperty, daml:unambiguousProperty – specify cardinality constraints for a property
User HTML Extraction Engine DAML Ontology Extraction Ontology Relational Data RDF Data Example Ontology . . . <daml:Class rdf:ID="Program"> <rdfs:label>Program</rdfs:label> </daml:Class> <daml:Class rdf:ID="OperatingSystem"> <rdfs:label>OperatingSystem</rdfs:label> </daml:Class> . . . <daml:DatatypeProperty rdf:ID="Name"> <rdf:type rdf:resource="&daml;UniqueProperty"/> <rdf:type rdf:resource="&daml;UnambiguousProperty"/> <rdfs:domain rdf:resource="#Program"/> <rdfs:range rdf:resource="&rdfs;Literal"/> </daml:DatatypeProperty> <daml:Property rdf:ID="supportsOperatingSystem"> <rdfs:domain rdf:resource="#Program"/> <rdfs:range rdf:resource="#OperatingSystem"/> </daml:Property> . . .
User HTML Extraction Engine DAML Ontology Extraction Ontology Relational Data RDF Data DAML OSM • Class Non-lexical object set • Property Binary relationship set between object sets • Literal property Lexical object set and binary relationship set between non-lexical and lexical object sets • Cardinality restriction Participation constraint
User HTML Extraction Engine DAML Ontology Extraction Ontology Relational Data RDF Data DAML OSM <daml:Class rdf:ID="Program"> <rdfs:label>Program</rdfs:label> </daml:Class> <daml:Class rdf:ID="OperatingSystem"> <rdfs:label>OperatingSystem</rdfs:label> </daml:Class> . . . <daml:DatatypeProperty rdf:ID="Name"> <rdf:type rdf:resource="&daml;UniqueProperty"/> <rdf:typerdf:resource="&daml;UnambiguousProperty"/> <rdfs:domain rdf:resource="#Program"/> <rdfs:range rdf:resource="&rdfs;Literal"/> </daml:DatatypeProperty> <daml:Property rdf:ID="supportsOperatingSystem"> <rdfs:domain rdf:resource="#Program"/> <rdfs:range rdf:resource="#OperatingSystem"/> </daml:Property>
User HTML Extraction Engine DAML Ontology Extraction Ontology Relational Data RDF Data Data Frames • Lexical object sets need data frames. • Use data-frame library • Match lexical object sets with data frames • Compare stemmed names and aliases • Levenshtein edit distance • Soundex • Longest common subsequence • Weighted average • Specialization heuristic • Choose most similar data frame (above a threshold)
User HTML Extraction Engine DAML Ontology Extraction Ontology Relational Data RDF Data User Modification • Provide graphical ontology editor • Automate graph layout • Allow the user to edit participation constraints • Allow user to edit data-frame mapping • Provide data frame editor
User HTML Extraction Engine DAML Ontology Extraction Ontology Relational Data RDF Data Extracting the Data
User HTML Extraction Engine DAML Ontology Extraction Ontology Relational Data RDF Data Pointing to the Data <html> . . . <body> <table> <tr> <td> <a href="..."><b>Stick Death 1.0</b></a><br /> Advance in levels, grab weapons, and unlock new levels and characters.<br /> <b>OS:</b> Windows 3.x/95/98/Me/NT/2000/XP<br /> <b>File Size:</b>2.66MB<br /> <b>License:</b>Free<br /> </td> <td>05/14/2002<br /> <i><b>new</b></i> </td> <td></td> <td>2,235</td> <td><a href="...">Download now</a><br /><br /></td> </tr> . . . xpointer(string-range(/html[1]/body[1]/table[1]/tr[1], ’’, 10, 3))
User HTML Extraction Engine DAML Ontology Extraction Ontology Relational Data RDF Data http://www.deg.byu.edu/software.html#Program1001 rdf:type software:name software:ProgSize software:Program software:version software:OperatingSystem Stick Death 1.0 rdf:type software:SizeVal software:SizeUnit rdf:type software:Size software:OSVersion software:OperatingSystem 2.66 MB software:OSName 3.x/95/98/Me/NT/2000/X Windows Convert to RDF
User HTML Extraction Engine DAML Ontology Extraction Ontology Relational Data RDF Data Superimposed Data
Results • RDF Data Extraction and Viewing • Built 4 data-extraction ontologies • 3 from DAML ontologies for data extraction • 1 from an existing DAML ontology • Most existing DAML ontologies not good for data extraction • Data Frame Matcher • 8 ‘training ontologies’, 16 test ontologies • 128 lexical object sets, 40 correct matches, 12 incorrect matches • Precision: 77% • Recall: 89% • Experiment (apartment rentals): 6 students 3 data frames • Phone: 2.8 min • RentalRate: 16.5 min • Bedrooms: 17.5 min
Contributions • Advancement of Semantic Web • Application of Information Extraction to building Semantic Web content • Semantic Web data as superimposed information • Algorithm for ontology conversion
Future Work • Data extraction • Enhance name matcher with data values • Support n-ary relationship sets • RDF data generation • Generate only one URI for an object • Associate concepts from DAML ontologies to well-known DAML ontologies