A New Web Semantic Annotator Enabling A Machine Understandable Web

A New Web Semantic Annotator Enabling A Machine Understandable Web BYU Spring Research Conference 2005 Yihong Ding Sponsored by NSF

Ontology Machine Understandable Web • Content is represented in • commonly shared, • explicitly defined, • generic conceptualizations. • Also known as the Semantic Web

Why Machine Understandable? • Meaningful data • Exchangeable information • Interoperable programs/services • “… allows data to be shared and reused across application, enterprise, and community boundaries …” --- Tim Berners-Lee etc. 2001

Semantic Annotation: A Way to Achieve Machine Understandable • Add explicit, formal, and unambiguous notes to web documents • Explicit: publicly accessible • Formal: publicly agreeable • Unambiguous: publicly identifiable

Ontology-based IE Wrapper Document Semantic Annotation Using Automated IE Engines Non-ontology-based IE Wrapper Document

Augmentations for the Annotator Semantic annotator using data-extraction ontologies: • a two-layer annotation model to achieve fast, high accurate, and resilient semantic annotation • a divide-and-conquer style architecture to scale system to large domains • a web ontology language augmentation to compliment OWL for semantic annotation purposes

Same-Layout Documents Two-Layer Annotation Model Massive Annotation Process Structural Annotator Document Sample Annotation Process Conceptual Annotator using ontology-based IE tool

Two-Layer Annotation Model, Benefits • Achieve both resiliency and fast speed of execution • Require no training for generating structural annotators • Demand no labeling to results from structural annotators

Scalability Issues • Large domain containing many concepts • Large annotation task dealing with many web pages

Observation • A large domain is a combination of several small domains. • Consistently clustered domains exist, where each this type of domain is • Composed with same cluster of concepts • Consistent to any larger domain in which it participates • Usually with small number of concepts

(1) Selected Domain Ontologies (2) Document Document • Text classification • Scalable annotation Collection of small atomic domain ontologies …… Divide-and-Conquer Style Architecture for Scalability Issue

Divide-and-Conquer, Benefits • Comparing to large ontologies, small ontologies are • Simpler to construct • Faster to execute • Easier to check and update • More convenient to reuse • Identify the range of an ontology dynamically in the web page level • Avoid the problem of narrowing a large domain ontology down to the web page level • Maximize the reuse of existing ontologies

Ontology Representation • Two ontology languages • Data-extraction ontology (OSMX) • Semantic web ontology (OWL) • Language unification

Contributions • Automatically semantic annotator using ontology-based IE wrapper • Two level annotation: layout-based annotator on top of conceptual annotator • Divide-and-conquer style solution to scale annotation process to large number of concepts • Web ontology language unification

A New Web Semantic Annotator Enabling A Machine Understandable Web