Ontology-based Annotation

Ontology-basedAnnotation Sergey Sosnovsky@PAWS@SIS@PITT

Outline O-based Annotation Conclusion Questions

WhyDo We Need Annotation What is Added by O-based Annotation • Ontology-driven processing (effective formal reasoning) • Connecting other O-based Services (O-mapping, O-visualization…) • Unified vocabulary • Connecting to the rest of SW knowledge • Annotation-based Services • Integration of Disperse Information (knowledge-based linking) • Better Indexing and Retrieval (based on the document semantics) • Content-based Adaptation (modeling document content in terms of domain model) • Knowledge Management • Organization’s Repositories as mini Webs (Boeing, Rolls Royce, Fiat, GlaxoSmithKline, Merck, NPSA, …) • Collaboration Support • Knowledge sharing and communication

Definition O-based Annotation is a process ofcreating a mark-up of Web-documents using a pre-existing ontologyand/orpopulating knowledge bases by marked up documents our: plays our: Athlete our: Sports rdf: type rdf: type our: plays MichaelJordan Basketball “Michael Jordan plays basketball”

List of Tools AeroDAML / AeroSWARM Annotea / Annozilla Armadillo AktiveDoc COHSE GOA KIM Semantic Annotation Platform MagPie Melita MnM OntoAnnotate Ontobroker OntoGloss ONTO-H Ont-O-Mat / S-CREAM / CREAM Ontoseek Pankow SHOE Knowledge Annotator Seeker Semantik SemTag SMORE Yawas … • Information Extraction Tools: • Alembic • Amilcare / T-REX • Annie • Fastus • Lasie • Poteus • SIFT • …

Important Characteristics Automation of Annotation(manual / semiautomatic / automatic / editable) Ontology-related issues: pluggable ontology (yes/no); ontology language (RDFS / DAML+OIL / OWL / …); local / anywhere access; ontology elements available for annotation (concept / instances / relations / triples); where annotations are stored (in the annotated document / on the dedicated server / where specified) annotation format (XML / RDF / OWL / …). Annotated Documents: document kinds (text / multimedia) document formats (plain text / html / pdf / …) documents access (local / web) Architecture / Interface / Interoperability Standalone tool / web interface / web component / API / … Annotation Scale (large – the WWW size / small - a hundred) Existing Documentation / Tutorial Availability

SMORE our: plays our: Athlete our: Sports rdf: type rdf: type our: plays MichaelJordan Basketball “Michael Jordan plays basketball” Manual Annotation OWL-based Markup Simultaneous O modification (if necessary) ScreenScraper mines metadata from annotated pages and suggests as candidates for the mark-up Post-annotation O-based Inference

Problems of Manual Annotation Solution: Dedicated Automatic Annotation Services (“SearchEngine”-like) • Expensive / Time-consuming • Difficult / Error prone • Subjective (two people annotating the same documents have in 15–30% annotate them differently) • Never ending • new documents • new versions of ontologies • Annotation storage problem • where? • Trust owner’s annotation • incompetence • Spam (Google does not use <META> info)

Automatic O-based Annotation • Supervised • MnM • S-Cream • Melita & AktiveDoc • Unsupervised • SemTag - Seeker • Armadillo • AeroSWARM

MnM • Ontology-based Annotation Interface: • Ontology browser (rich navigation capabilities) • Document browser (usually Web-browser) • The annotation is mainly based on select-drag-N-drop association of text fragments with ontology elements • Built-in or External ML component classifies the main corpus of documents • Activity Flow: • Markup (A human user manually annotate training set of documents by ontology elements) • Learn (A learning algorithm is run over the marked up corpus to learn the extraction rules) • Extract (An IE mechanism is selected and run over a set of documents) • Review (A human user observes the results and correct them if necessary)

Amilcare and T-REX • Amilcare: • Automatic IE component • Is used in at least five O-based A tools (Melita, MnM, Ontoannotate, Ontomat, SemantiK) • Released to about 50 Industrial and Academic sites • Java API • Recently succeeded by T-REX

Pankow • Input: A web page. • Step 1: Web page is scanned for phrases that might be categorized as instances of the ontology (partof-speech tagger to findcandidate proper nouns) • Result 1: set of candidate proper nouns • Step 2: The system iterates through allcandidate proper nouns and all candidate ontology conceptsto derive hypothesis phrasesusing preset linguistic patterns. • Result 2: Set of hypothesis phrases. • Step 3: Google is queried for the hypothesis phrases through • Result 3: the number of hits for each hypothesis phrase. • Step 4: The system sums up the query results to a total for each instance-concept pair. Then the system categorizes the candidate proper nouns into their highest ranked concepts • Result 4: an ontologically annotated web page.

SemTag - Seeker • IBM-developed • ~264 million web pages • ~72 thousand of concepts (TAP taxonomy) • 434 million automatically disambiguated semantic tags • Spotting pass • Documents are retrieved from the Seeker store, and tokenized • Tokens are matched against the TAP concepts. • Each resulting label is saved with ten words to either side as a ``window'' of context around the particular candidate object. • Learning pass • A representative sample of the data is scanned to determine the corpus-wide distribution of terms at each internal node of the taxonomy. TBD (taxonomy-based disambiguation) algorithm is used. • Tagging pass • “Windows” are scanned once more to disambiguate each reference determine an TAP object • A record is entered into a database of final results containing the URL, the reference, and any other associated metadata.

Conclusions • Web-document A is a necessary thing • O-based A benefits (O-based post-processing, unified vocabularies, etc.) • Manual Ais a bad thing • Automatic Ais a good thing: • Supervised O-based A: • Useful O-based interface for annotating training set • Traditional IE tools for textual classification • Unsupervised O-based A: • COHSE – matches concept names from the ontology and a thesaurus against tokens from the text • Pankow – uses ontology to build candidate queries, then uses community wisdom to choose the best candidate • SemTag – uses concept names to match tokens and hierarchical relations in the ontology to disambiguate between candidate concepts for a text fragment

? Questions ? ?

Ontology-based Annotation