Introduction to Web Science

Introduction to Web Science Harvesting the SW

Six challenges of the Knowledge Life Cycle • Acquire • Model • Reuse • Retrieve • Publish • Maintain

Information Extraction vs. Retrieval IR IE 

A couple of approaches … • Active learning to reduce annotation burden • Supervised learning • Adaptive IE • The Melita methodology • Automatic annotation of large repositories • Largely unsupervised • Armadillo

The Seminar Announcements Task • Created by Carnegie Mellon School of Computer Science • How to retrieve • Speaker • Location • Start Time • End Time • From seminar announcements received by email

Seminar Announcements Example Dr. Steals presents in Dean Hall at one am. becomes <speaker>Dr. Steals</speaker> presents in <location>Dean Hall</location> at <stime>one am</stime>.

Information Extraction Measures • How many documents out of the retrieved documents are relevant? • How many retrieved documents are relevant out of all the relevant documents? • Weighted harmonic mean of precision and recall

IE Measures Examples • If I ask the librarian to search for books on cars, there are 10 relevant books in the library and out of the 8 he found, only 4 seem to be relevant books. What is his precision, recall and f-measure?

IE Measures Answers • If I ask the librarian to search for books on cars, there are 10 relevant books in the library and out of the 8 he found, only 4 seem to be relevant books. What is his precision, recall and f-measure? • Precision = 4/8 = 50% • Recall = 4/10 = 40% • F =(2*50*40)/(50+40) = 44.4%

Adaptive IE • What is IE? • Automated ways of extracting unstructured or partially structured information from machine readable files • What is AIE? • Performs tasks of traditional IE • Exploits the power of Machine Learning in order to adapt to • complex domains having large amounts of domain dependent data • different sub-language features • different text genres • Considers important the Usability and Accessibility of the system

Amilcare • Tool for adaptive IE from Web-related texts • Specifically designed for document annotation • Based on (LP)2 algorithm *Linguistic Patterns by Learning Patterns • Covering algorithm based on Lazy NLP • Trains with a limited amount of examples • Effective on different text types • free texts • semi-structured texts • structured texts • Uses Gate and Annie for preprocessing

CMU: detailed results • Best overall accuracy • Best result on speaker field • No results below 75%

Gate • General Architecture for Text Engineering • provides a software infrastructure for researchers and developers working in NLP • Contains • Tokeniser • Gazetteers • Sentence Splitter • POS Tagger • Semantic Tagger (ANNIE) • Co-reference Resolution • Multi lingual support • Protégé • WEKA • many more exist and can be added • http://www.gate.ac.uk

is complex is time consuming needs annotation by experts Annotation Current practice of annotation for knowledge identification and extraction Reduce burden of text annotation for Knowledge Management

Different Annotation Systems • SGML • TEX • Xanadu • CoNote • ComMentor • JotBot • Third Voice • Annotate.net • The Annotation Engine • Alembic • The Gate Annotation Tool • iMarkup, Yawas • MnM, S-CREAM

Melita • Tool for assisted automatic annotation • Uses an Adaptive IE engine to learn how to annotate (no use of rule writing for adapting the system) • Users: annotates document samples • IE System: • Trains while users annotate • Generalizes over seen cases • Provides preliminary annotation for new documents • Performs smart ordering of documents • Advantages • Annotates trivial or previously seen cases • Focuses slow/expensive user activity on unseen cases • User mainly validates extracted information • Simpler & less error prone / Speeds up corpus annotation • The system learns how to improve its capabilities

Amilcare Learns in background Bare Text User Annotates Methodology: Melita Bootstrap Phase

User Annotates Amilcare Annotates Learning in background from missing tags, mistakes Bare Text Methodology: Melita Checking Phase

Corrections used to retrain Bare Text Amilcare Annotates Methodology: Melita Support Phase User Corrects

User Annotates Bare Text Learns annotations Smart ordering of Documents Tries to annotate all the documents and selects the document with partial annotations

Intrusivity • An evolving system is difficult to control • Goal: • Avoiding unwelcome/unreliable suggestions • Adapting proactivity to user’s needs • Method: • Allow users to tune proactivity • Monitor user reactions to suggestions

Methodology: Melita Control Panel Ontology defining concepts Document Panel

60 30 Results

Future Work • Research better ways of annotating concepts in documents • Optimise document ordering to maximise the discovery of new tags • Allow users to edit the rules • Learn to discover relationships !! • Not only suggest but also corrects user annotations !!

Annotation for the Semantic Web • Semantic Web requires document annotation • Current approaches • Manual (e.g. Ontomat)or semi-automatic (MnM, S-Cream, Melita) • BUT: • Manual/Semi-automatic annotation of • Large diverse repositories • Containing different and sparse information is unfeasible • E.g. a Web site (So: 1,600 pages)

Redundancy • Informationon the Web (or large repositories) is Redundant • Information repeated in different superficial formats • Databases/ontologies • Structured pages (e.g. produced by databases) • Largely structured pages (bibliography pages) • Unstructured pages (free texts)

The Idea • Largely unsupervised annotation of documents • Based on Adaptive Information Extraction • Bootstrapped using redundancyof information • Method • Use the structured information (easier to extract) to bootstrap learning on less structured sources (more difficult to extract)

Example: Extracting Bibliographies • Mines web-sites to extract biblios from personal pages Tasks: • Finding people’s names • Finding home pages • Finding personal biblio pages • Extract biblio references • Sources • NE Recognition (Gate’s Annie) • Citeseer/Unitrier (largely incomplete biblios) • Google • Homepagesearch

Annotates known names • Trains on annotations to discover the HTML structure of the page • Recovers all names and hyperlinks Mining Web sites (1) • Mines the site looking for People’s names • Uses • Generic patterns (NER) • Citeseer for likely bigrams • Looks for structured lists of names

Experimental Results II - Sheffield • People • discovering who works in the department • using Information Integration • Total present in site 139 • Using generic patterns + online repositories • 35 correct, 5 wrong • Precision 35 / 40 = 87.5 % • Recall 35 / 139 = 25.2 % • F-measure 39.1 % • Errors • A. Schriffin • Eugenio Moggi • Peter Gray

Experimental Results IE - Sheffield • People • using Information Extraction • Total present in site 139 • 116 correct, 8 wrong • Precision 116 / 124 = 93.5 % • Recall 116 / 139 = 83.5 % • F-measure 88.2 % • Errors • Speech and Hearing • European Network • Department Of • Enhancements – Lists, Postprocessor • Position Paper • The Network • To System

Experimental Results - Edinburgh • People • using Information Integration • Total present in site 216 • Using generic patterns + online repositories • 11 correct, 2 wrong • Precision 11 / 13 = 84.6 % • Recall 11 / 216 = 5.1 % • F-measure 9.6 % • using Information Extraction • 153 correct, 10 wrong • Precision 153 / 163 = 93.9 % • Recall 153 / 216 = 70.8 % • F-measure 80.7 %

Experimental Results - Aberdeen • People • using Information Integration • Total present in site 70 • Using generic patterns + online repositories • 21 correct, 1 wrong • Precision 21 / 22 = 95.5 % • Recall 21 / 70 = 30.0 % • F-measure 45.7 % • using Information Extraction • 63 correct, 2 wrong • Precision 63 / 65 = 96.9 % • Recall 63 / 70 = 90.0 % • F-measure 93.3 %

Mining Web sites (2) • Annotates known papers • Trains on annotations to discover the HTML structure • Recovers co-authoring information

Experimental Results (1) • Papers • discovering publications in the department • using Information Integration • Total present in site 320 • Using generic patterns + online repositories • 151 correct, 1 wrong • Precision 151 / 152 = 99 % • Recall 151 / 320 = 47 % • F-measure 64 % • Errors - Garbage in database!! @misc{ computer-mining, author = "Department Of Computer", title = "Mining Web Sites Using Adaptive Information Extraction Alexiei Dingli and Fabio Ciravegna and David Guthrie and Yorick Wilks", url = "citeseer.nj.nec.com/582939.html" }

Experimental Results (2) • Papers • using Information Extraction • Total present in site 320 • 214 correct, 3 wrong • Precision 214 / 217 = 99 % • Recall 214 / 320 = 67 % • F-measure 80 % • Errors • Wrong boundaries in detection of paper names! • Names of workshops mistaken as paper names!

Artists domain • Task • Given the name of an artist, find all the paintings of that artist. • Created for the ArtEquAKT project

Artists domain Evaluation

User Role • Providing … • A URL • List of services • Already wrapped (e.g. Google is in default library) • Train wrappers using examples • Examples of fillers (e.g. project names) • In case … • Correcting intermediate results • Reactivating Armadillo when paused

Armadillo • Library of known services (e.g. Google, Citeseer) • Tools for training learners for other structured sources • Tools for bootstrapping learning • From un/structured sources • No user annotation • Multi-strategy acquisition of information using redundancy • User-driven revision of results • With re-learning after user correction

Rationale • Armadillo learns how to extract information • From large repositories By integrating information • from diverse and distributed resources • Use: • Ontology population • Information highlighting • Document enrichment • Enhancing user experience

Data Navigation (1)

IE for SW: The Vision • Automatic annotation services • For a specific ontology • Constantly re-indexing/re-annotating documents • Semanticsearch engine • Effects: • No annotation in the document • As today’s indexes are not stored in the documents • No legacy with the past • Annotation with the latest version of the ontology • Multiple annotations for a single document • Simplifies maintenance • Page changed but not re-annotated

Questions?

Introduction to Web Science