Semantic web Bootstrapping & Annotation

Semantic web Bootstrapping & Annotation Hassan Sayyadi sayyadi@ce.sharif.edu Semantic web research laboratory Computer department Sharif university of technology

Outline • What is annotation? • Why use annotation? • Crawler • Annotation model • Annotation methods • Our Implementation

What is annotation? • People make notes to themselves in order to preserve ideas that arise during a variety of activities • The purpose of these notes is often to summarize, criticize, or emphasize specific phrases or events • Semantic annotations are to tag ontology class instance data and map it into ontology classes.

Why use annotation? • To have the world knowledge at one's finger tips seems possible. • The Internet is the platform for information. • Unfortunately most of the information is provided in an unstructured and non-standardized form.

Why use annotation? (continue)

Crawler • A crawler is a program which traverses the Internet following these links from one page to the next.

Focused crawler • Not all the Internet knowledge is required for every query. • This assumption seems reasonable because most people work on a restricted domain and do not need the knowledge of the whole Internet • Searching the whole Internet in this case is very inefficient and expensive. • Free texts in the Internet contain various information in diverse domains.

Focused crawler (continue) • The focus can be achieved by examining keywords • Problems: • “Understanding“ the semantic of document • Extremely focusing on one topic • Another way to focus is the Internet connectivity structure

Annotation models • Mark in web page • Example: • SUT is one of the largest engineering schools in the Islamic Republic of Iran • <university>SUT</university> is one of the largest universities in the <country>Islamic Republic of Iran</country>

Annotation models (continue) • Generate RDF • Example: • SUT is one of the largest engineering schools in the Islamic Republic of Iran • <rdf:Description rdf:about="http://sharif.edu/#SUT"> <rdf:type>university</rdf:type> <SHARIF:be_in rdf:resource="http://sharif.edu/#Islamic+Republic+of+Iran"/> </rdf:Description> <rdf:Descriptionrdf:about="http://sharif.edu/#Islamic+Republic+of+Iran”> <rdf:type>Country</rdf:type> </rdf:Description>

Annotation methods • Manually • Semi-automatically • Automatically

Automatic Annotation • The fully automatic creation of semantic annotations is an unsolved problem. • Automatic semantic annotation for the natural language sentences in these pages is a daunting task and we are often forced to do it manually or semi-automatically using handwritten rules

Manual Annotation • Manual annotation is more easily accomplished today, using authoring tools, which provide an integrated environment for simultaneously authoring and annotating text. • However, the use of human annotators is often fraught with errors due to factors such as annotator familiarity with the domain, amount of training, personal motivation and complex schemas • Manual annotation is also an expensive process

Semi-automatic Annotation • To overcome the annotation acquisition bottleneck, semiautomatic annotation of documents has been proposed.

Semi-automatic annotation • assumptions: • vocabulary set is limited • word usage has patterns • semantic ambiguities are rare • terms and jargon of the domain appear frequently

Semantic Annotation Platform (SAP)

Multistrategy SAPs • Multistrategy SAPs are able to combine methods from both pattern-based and machine learning-based systems. • No SAP currently implements the multistrategy approach for semantic annotation, although it has been implemented in systems for ontology extraction (such as On-To-Knowledge)

Semi-automatic annotation (continue) • Example • I go to Shanghai • Link structure is more like a RDF graph

The accuracy of concepts and relations about different algorithm

Automatic annotation

Source preprocessing • Document Object Model (DOM) • Text Model • Layout Model • NLP Model

Information Identification • Operators • perform extraction actions on document access models • Retrieval, Check, Execute • Strategies • build operator sequences according to user time and quality requirements • Source Description • build operator sequences according to user time and quality requirements

Ontology population • The final stage of the overall process is to decide which hypothesis represents the extracted information to insert into the ontology • The module simulates insertions and calculates the cost according to the number of new instance creations, instance modifications or inconsistencies found

Our implementation • Crawler: • Crawl all link that contains: • sharif.ir • sharif.edu • sharif.ac.ir

Our implementation • Source pre-processing • Html to text • text = text.replaceAll("\n", "*_newline_*"); • text = text.replaceAll("\\<script.*?\\</script\\>", ""); • text = text.replaceAll("\\<style.*?</style.*\\>", ""); • text = text.replaceAll("<\\!--.*?--\\>", ""); • text = text.replaceAll("\\<.*?\\>", ""); • text = text.replaceAll(" ", " "); • text = text.replaceAll("<", "<"); • … • text = text.replaceAll("\\*_newline_\\*", "\n"); • Additional • text = text.replaceAll("\n(\n|| )*\n","."); • text = text.replaceAll(",", " and ");

Our implementation • Information extraction: • JMontyLingua • SUT is one of the largest engineering schools in the Islamic Republic of Iran • ("be" "SUT" "one" "of largest engineering school" "in Islamic Republic" "of Iran")

Our implementation • JMontyLingua problem: • SUT has computer, mechanic and electric engineering departments • ("have" "SUT" "computer mechanic and electric engineering departments") • ("have" "SUT" "computer and mechanic and electric engineering departments")

Our inplementation • ("be" "SUT" “university" "in Islamic Republic" "of Iran") • => ("be" "SUT" “university" "in Islamic Republic of Iran") • =>SUT,be,university & SUT,be_in,Islamic Republic of Iran • <rdf:Description rdf:about="http://sharif.edu/#SUT"> <rdf:type>university</rdf:type> <SHARIF:be_in rdf:resource="http://sharif.edu/#Islamic+Republic+of+Iran"/> </rdf:Description>

Any question?

Semantic web Bootstrapping & Annotation