1 / 26

Annotating Documents for the Semantic Web Using Data-Extraction Ontologies

This dissertation proposal explores the use of data-extraction ontologies for annotating web documents to transform them into the semantic web. It discusses the motivation, current research status, problems, and proposes a new ontology-driven paradigm for semantic annotation. The proposal also covers ontology construction, knowledge reusing, and a solution for performing semantic annotation on HTML web pages.

gparks
Download Presentation

Annotating Documents for the Semantic Web Using Data-Extraction Ontologies

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Annotating Documents for the Semantic Web Using Data-Extraction Ontologies Dissertation Proposal Yihong Ding

  2. Motivation • The representation of web content limits its usability • A machine understandable web • Shared, explicit, formal conceptualizations (ontologies) • The semantic web

  3. A Problem • How to transform current web to be the semantic web?

  4. A Solution: Semantic Annotation • Add explicit, formal, and unambiguous metadata to web documents • Explicit: publicly accessible • Formal: publicly agreeable • Unambiguous: publicly identifiable

  5. Implicit Annotation Explicit Annotation Annotation Representation

  6. Semantic Annotation Current Research Status • Manual annotation through friendly interfaces [Annotea, etc.] • Automatic annotation with ontology generation [SCORE] • Automatic annotation using automated IE tool based on pre-defined ontologies [SemTag, MnM, etc.]

  7. Non-ontology-based IE Wrapper Rules and extracting categories Domain Ontology Current Automatic Annotatora typical paradigm (2) Alignment (1) Extraction (3) Annotation Document

  8. Non-ontology-based IE Wrapper Rules and extracting categories Domain Ontology Current Automatic Annotator Problems (2) Problem of concept disambiguation (4) Problem of Assembling ontologies (1) Problem of data recognition (3) Problem of Annotation formatting, storing, indexing, sharing Document

  9. “Main Drawback of Using Automated IE”[Kiryakov04] • “none of these approaches expects an input or produces output with respect to ontologies” • “a set of heuristics for post-processing and mapping of the IE results to an ontology … not sufficient for large-scale, domain-independent semantic annotation.” • “IE and wrapper induction techniques need to use the ontology more directly during the process of extraction.”

  10. Ontology-driven Paradigm (Data-Extraction Ontology)for Semantic Annotation Ontology-based IE Wrapper Non-ontology-based IE Wrapper Document Document

  11. Ontology-driven Paradigmfor Semantic Annotation Some Arguments • Resiliency w.r.t. web page layouts (helps scale to large set of web pages) • Adpativeness w.r.t. domain specifications (helps scale to large size domains) • Creation of ontologies: still a problem but no longer a drawback • Speed of execution: still a drawback (but we are going to propose a solution next)

  12. Similar Documents Two-Layer Annotation Model Massive Annotation Process Structural Annotator Document Sample Annotation Process Conceptual Annotator using an ontology-based IE tool

  13. Structural Annotator • Major components • HTML hierarchical path that leads to concept locations • Local context around locations • Dependencies among multiple semantic categories • Significance • Identify both categories and their semantic meanings

  14. Ontology Factors in Semantic Annotation Tasks • Knowledge specification • Semantic web community • Web Ontology Language (OWL) • Knowledge instantiation • IE and database community • Object-oriented System Model in XML (OSMX)

  15. Ontology Conversion • Similarities (OWL vs. OSMX) • Class vs. object set • ObjectProperty vs. relationship set • Cardinality restriction vs. participation constraint • subclassOf vs. is-a relationship • Unique features • OWL • subpropertyOf • symmetric and transitive property • namespace declaration • ontology importing • OSMX • arbitrary n-ary relationship sets • data frames • general constraints

  16. Ontology ConstructionAn Unavoidable Problem • Semantic annotation tasks require ontologies. • The ontology for a specific semantic annotation task is not promised to be available all the time.

  17. Ontology ConstructionGeneral and Special • Generally speaking • Until now, main stream, manual construction • Automatic and semi-automatic ontology generation, many research papers, few or none practical, a very hard problem • Special to semantic annotation purpose • Very dynamic and variant domains • Much overlapped information • Limited size of scope for one web page • Flat structure

  18. Ontology ConstructionKnowledge Reusing • “What has been will be again, what has been done will be done again; there is nothing new under the sun.” (The Holy Bible, Ecclesiastes, 1:9, NIV translation) • A “new” ontology is a new assembly with unions and projections of several pre-existed ontologies.

  19. Collection of Knowledge Selected Knowledge Components … …… … Architecture on Dynamically Assembling Domain of Interest Web Page (1) (2) Assembled Ontology • Knowledge-component selection • Ontology assembly

  20. Thesis Statement Propose a new solution to perform semantic annotation on normal HTML web pages, specifically • apply ontology-based automatic IE techniques • augment OWL with knowledge recognition extension • combine conceptual annotator and layout-based annotator • assemble a new domain ontology for an annotation task dynamically

  21. Standard Evaluation • Annotation performance • Precision • Recall • Speed of execution • Testing bed • 5 ~ 10 different domains, with over 10 lexical concepts in each domain ontology • 20 ~ 50 web pages on each domain

  22. Ontology Converter Test • A complete and sound checking is costly and difficult to implement. • Our simple test • Start with an OSMX ontology A • Covert it to OWL and then transform it back to be OSMX ontology B • Process both A and B to annotate a same set of web pages (say 30 – 50 web pages) • Annotation results should be identical

  23. Two-Layer Annotation Model Evaluation • Standard evaluation • In addition • About five large web sites with machine-generated web pages, each of which contains at least dozens of web pages

  24. Dynamic Ontology Assembler Evaluation • Regular precision and recall study according to selected knowledge components • A pilot study on when ontology assembler works better than manual ontology construction • Record the time to use a tool to create an ontology from scratch • Record the time to assemble a same ontology • Compare their differences and the special conditions for each case • Make empirical suggestions about how to build a knowledge base that favors ontology assembly

  25. Delimitations • Automatic ontology creation from scratch • Annotation storing, indexing, and sharing mechanisms • Semantic annotation for multimedia content • Parallel or distributional computing to further scale the semantic annotation system to a large number of web pages

  26. Contributions • To convert current web pages into machine-understandable semantic web pages • Producing a pure ontology-driven semantic annotator using ontology-based IE wrapper • Proposing a novel two-layer annotation model to do fast, accurate, and resilient annotation • Studying a dynamic ontology assembler that helps maximize the reuse of existing knowledge and minimize the load of manual ontology creation • Implementing an ontology converter so that this work is useful to the rest of the semantic web society.

More Related