260 likes | 268 Views
This dissertation proposal explores the use of data-extraction ontologies for annotating web documents to transform them into the semantic web. It discusses the motivation, current research status, problems, and proposes a new ontology-driven paradigm for semantic annotation. The proposal also covers ontology construction, knowledge reusing, and a solution for performing semantic annotation on HTML web pages.
E N D
Annotating Documents for the Semantic Web Using Data-Extraction Ontologies Dissertation Proposal Yihong Ding
Motivation • The representation of web content limits its usability • A machine understandable web • Shared, explicit, formal conceptualizations (ontologies) • The semantic web
A Problem • How to transform current web to be the semantic web?
A Solution: Semantic Annotation • Add explicit, formal, and unambiguous metadata to web documents • Explicit: publicly accessible • Formal: publicly agreeable • Unambiguous: publicly identifiable
Implicit Annotation Explicit Annotation Annotation Representation
Semantic Annotation Current Research Status • Manual annotation through friendly interfaces [Annotea, etc.] • Automatic annotation with ontology generation [SCORE] • Automatic annotation using automated IE tool based on pre-defined ontologies [SemTag, MnM, etc.]
Non-ontology-based IE Wrapper Rules and extracting categories Domain Ontology Current Automatic Annotatora typical paradigm (2) Alignment (1) Extraction (3) Annotation Document
Non-ontology-based IE Wrapper Rules and extracting categories Domain Ontology Current Automatic Annotator Problems (2) Problem of concept disambiguation (4) Problem of Assembling ontologies (1) Problem of data recognition (3) Problem of Annotation formatting, storing, indexing, sharing Document
“Main Drawback of Using Automated IE”[Kiryakov04] • “none of these approaches expects an input or produces output with respect to ontologies” • “a set of heuristics for post-processing and mapping of the IE results to an ontology … not sufficient for large-scale, domain-independent semantic annotation.” • “IE and wrapper induction techniques need to use the ontology more directly during the process of extraction.”
Ontology-driven Paradigm (Data-Extraction Ontology)for Semantic Annotation Ontology-based IE Wrapper Non-ontology-based IE Wrapper Document Document
Ontology-driven Paradigmfor Semantic Annotation Some Arguments • Resiliency w.r.t. web page layouts (helps scale to large set of web pages) • Adpativeness w.r.t. domain specifications (helps scale to large size domains) • Creation of ontologies: still a problem but no longer a drawback • Speed of execution: still a drawback (but we are going to propose a solution next)
Similar Documents Two-Layer Annotation Model Massive Annotation Process Structural Annotator Document Sample Annotation Process Conceptual Annotator using an ontology-based IE tool
Structural Annotator • Major components • HTML hierarchical path that leads to concept locations • Local context around locations • Dependencies among multiple semantic categories • Significance • Identify both categories and their semantic meanings
Ontology Factors in Semantic Annotation Tasks • Knowledge specification • Semantic web community • Web Ontology Language (OWL) • Knowledge instantiation • IE and database community • Object-oriented System Model in XML (OSMX)
Ontology Conversion • Similarities (OWL vs. OSMX) • Class vs. object set • ObjectProperty vs. relationship set • Cardinality restriction vs. participation constraint • subclassOf vs. is-a relationship • Unique features • OWL • subpropertyOf • symmetric and transitive property • namespace declaration • ontology importing • OSMX • arbitrary n-ary relationship sets • data frames • general constraints
Ontology ConstructionAn Unavoidable Problem • Semantic annotation tasks require ontologies. • The ontology for a specific semantic annotation task is not promised to be available all the time.
Ontology ConstructionGeneral and Special • Generally speaking • Until now, main stream, manual construction • Automatic and semi-automatic ontology generation, many research papers, few or none practical, a very hard problem • Special to semantic annotation purpose • Very dynamic and variant domains • Much overlapped information • Limited size of scope for one web page • Flat structure
Ontology ConstructionKnowledge Reusing • “What has been will be again, what has been done will be done again; there is nothing new under the sun.” (The Holy Bible, Ecclesiastes, 1:9, NIV translation) • A “new” ontology is a new assembly with unions and projections of several pre-existed ontologies.
Collection of Knowledge Selected Knowledge Components … …… … Architecture on Dynamically Assembling Domain of Interest Web Page (1) (2) Assembled Ontology • Knowledge-component selection • Ontology assembly
Thesis Statement Propose a new solution to perform semantic annotation on normal HTML web pages, specifically • apply ontology-based automatic IE techniques • augment OWL with knowledge recognition extension • combine conceptual annotator and layout-based annotator • assemble a new domain ontology for an annotation task dynamically
Standard Evaluation • Annotation performance • Precision • Recall • Speed of execution • Testing bed • 5 ~ 10 different domains, with over 10 lexical concepts in each domain ontology • 20 ~ 50 web pages on each domain
Ontology Converter Test • A complete and sound checking is costly and difficult to implement. • Our simple test • Start with an OSMX ontology A • Covert it to OWL and then transform it back to be OSMX ontology B • Process both A and B to annotate a same set of web pages (say 30 – 50 web pages) • Annotation results should be identical
Two-Layer Annotation Model Evaluation • Standard evaluation • In addition • About five large web sites with machine-generated web pages, each of which contains at least dozens of web pages
Dynamic Ontology Assembler Evaluation • Regular precision and recall study according to selected knowledge components • A pilot study on when ontology assembler works better than manual ontology construction • Record the time to use a tool to create an ontology from scratch • Record the time to assemble a same ontology • Compare their differences and the special conditions for each case • Make empirical suggestions about how to build a knowledge base that favors ontology assembly
Delimitations • Automatic ontology creation from scratch • Annotation storing, indexing, and sharing mechanisms • Semantic annotation for multimedia content • Parallel or distributional computing to further scale the semantic annotation system to a large number of web pages
Contributions • To convert current web pages into machine-understandable semantic web pages • Producing a pure ontology-driven semantic annotator using ontology-based IE wrapper • Proposing a novel two-layer annotation model to do fast, accurate, and resilient annotation • Studying a dynamic ontology assembler that helps maximize the reuse of existing knowledge and minimize the load of manual ontology creation • Implementing an ontology converter so that this work is useful to the rest of the semantic web society.