1.31k likes | 1.51k Views
Human Language Technology for the Semantic Web http://gate.ac.uk/ http://nlp.shef.ac.uk/ Hamish Cunningham Kalina Bontcheva Diana Maynard Valentin Tablan ESWS, Crete, May 2004 [This work has been supported by AKT ( http://aktors.org/ ) and SEKT ( http://sekt.semanticweb.org/ )].
E N D
Human Language Technology for the Semantic Web http://gate.ac.uk/http://nlp.shef.ac.uk/ Hamish Cunningham Kalina Bontcheva Diana Maynard Valentin Tablan ESWS, Crete, May 2004 [This work has been supported by AKT (http://aktors.org/) and SEKT (http://sekt.semanticweb.org/)]
Are you wasting your time? 2(130)
Structure of the Tutorial • Motivation, background • Information Extraction - definition • Evaluation – corpora & metrics • IE approaches – some examples • Rule-based approaches • Learning-based approaches • Semantic Tagging • Using “traditional” IE • Ontology-based IE • Platforms for large-scale processing • Language Generation [Slides: http://gate.ac.uk/sale/talks/esws2004-tutorial.ppt] 3(130)
Gartner, December 2002: taxonomic and hierarchical knowledge mapping and indexing will be prevalent in almost all information-rich applications through 2012 more than 95% of human-to-computer information input will involve textual language A contradiction: formal knowledge in semantics-based systems vs. ambiguous informal natural language The challenge: to reconcile these two opposing tendencies The Knowledge Economy and Human Language 4(130)
HLT & Knowledge: Closing the Language Loop KEY MNLG: Multilingual Natural Language GenerationOBIE: Ontology-Based Information ExtractionAIE: Adaptive & Mixed-Initiative IECLIE: Controlled Language IE (M)NLG Semantic Web; Semantic Grid;Semantic Web Services Formal Knowledge(ontologies andinstance bases) HumanLanguage OBIE (A)IE ControlledLanguage CLIE 5(130)
Like other areas of computer science, HLT has typical data structures and infrastructure requirements Annotation: associating arbitrary data with areas of text or speech Defacto standard: Stand-off Markup (e.g. TEI/XCES, NITE, ATLAS, GATE) Other issues: visualisation and editing; persistence and search; metrics; component model; baseline NLP tools; ... To cut a long story short: HLT has a lot of T underneath it which comes in many shapes and sizes Background and Examples (1) 6(130)
Infrastructure & (many) examples in this tutorial: GATE, a General Architecture for Text Engineering: architecture, framework & IDE Why? I happen to know a little about it Free software, relatively comprehensive, widely used, has extensive Semantic Web support It means we can ignore the infrastructural issues Not a claim that it is the best or only in all cases! Background and Examples (2) 7(130)
Structure of the Tutorial • Motivation, background • Information Extraction - definition • Evaluation – corpora & metrics • IE approaches – some examples • Rule-based approaches • Learning-based approaches • Semantic Tagging • Using “traditional” IE • Ontology-based IE • Platforms for large-scale processing • Language Generation [Slides: http://gate.ac.uk/sale/talks/esws2004-tutorial.ppt]
Information Extraction (IE) pulls facts and structured information from the content of large text collections. Contrast IE and Information Retrieval NLP history: from NLU to IE (if you can’t score, why not move the goalposts?) Information Extraction (1) 9(130)
When you can measure what you are speaking about, and express it in numbers, you know something about it; but when you cannot measure it, when you cannot express it in numbers, your knowledge is of a meager and unsatisfactory kind: it may be the beginning of knowledge, but you have scarcely in your thoughts advanced to the stage of science. (Kelvin) Not everything that counts can be counted, and not everything that can be counted counts. (Einstein) IE progress driven by quantitative measures MUC: Message Understanding Conferences ACE: Automatic Content Extraction Information Extraction (2) 10(130)
Held in 1997, around 15 participants inc. 2 UK. Broke IE down into component tasks: NE: Named Entity recognition and typing CO: co-reference resolution TE: Template Elements TR: Template Relations ST: Scenario Templates MUC-7 tasks 11(130)
“The shiny red rocket was fired on Tuesday. It is the brainchild of Dr. Big Head. Dr. Head is a staff scientist at We Build Rockets Inc.” ST: rocket launch event with various participants An Example • NE: "rocket", "Tuesday", "Dr. Head“, "We Build Rockets" • CO:"it" = rocket; "Dr. Head" = "Dr. Big Head" • TE: the rocket is "shiny red" and Head's "brainchild". • TR: Dr. Head works for We Build Rockets Inc. 12(130)
Vary according to text type, domain, scenario, language NE: up to 97% (tested in English, Spanish, Japanese, Chinese, etc. etc.) CO: 60-70% resolution TE: 80% TR: 75-80% ST: 60% (but: human level may be only 80%) Performance levels 13(130)
NE involves identification of proper names in texts, and classification into a set of predefined categories of interest Person names Organizations (companies, government organisations, committees, etc) Locations (cities, countries, rivers, etc) Date and time expressions What are Named Entities? 14(130)
Other common types: measures (percent, money, weight etc), email addresses, Web addresses, street addresses, etc. Some domain-specific entities: names of drugs, medical conditions, names of ships, bibliographic references etc. MUC-7 entity definition guidelines [Chinchor’97] http://www.itl.nist.gov/iaui/894.02/related_projects/muc/proceedings/ne_task.html What are Named Entities (2) 15(130)
Artefacts – Wall Street Journal Common nouns, referring to named entities – the company, the committee Names of groups of people and things named after people – the Tories, the Nobel prize Adjectives derived from names – Bulgarian, Chinese Numbers which are not times, dates, percentages, and money amounts What are NOT NEs (MUC-7) 16(130)
Variation of NEs – e.g. John Smith, Mr Smith, John. Ambiguity of NE types: John Smith (company vs. person) May (person vs. month) Washington (person vs. location) 1945 (date vs. time) Ambiguity with common words, e.g. "may" Basic Problems in NE 17(130)
Issues of style, structure, domain, genre etc. Punctuation, spelling, spacing, formatting, ... all have an impact: Dept. of Computing and Maths Manchester Metropolitan University Manchester United Kingdom Tell me more about Leonardo Da Vinci More complex problems in NE 18(130)
Structure of the Tutorial • Motivation, background • Information Extraction - definition • Evaluation – corpora & metrics • IE approaches – some examples • Rule-based approaches • Learning-based approaches • Semantic Tagging • Using “traditional” IE • Ontology-based IE • Platforms for large-scale processing • Language Generation [Slides: http://gate.ac.uk/sale/talks/esws2004-tutorial.ppt] 19(130)
“Gold standard” data created by manual annotation Corpora are divided typically into a training and testing portion Rules and/or learning algorithms are developed or trained on the training part Tuned on the testing portion in order to optimise Rule priorities, rules effectiveness, etc. Parameters of the learning algorithm and the features used (typical routine: 10-fold cross validation) Evaluation set – the best system configuration is run on this data and the system performance is obtained No further tuning once evaluation set is used! Corpora and System Development 20(130)
MUC-6 and MUC-7 corpora - English CONLL shared task corpora http://cnts.uia.ac.be/conll2003/ner/ - NEs in English and Germanhttp://cnts.uia.ac.be/conll2002/ner/ - NEs in Spanish and Dutch TIDES surprise language exercise (NEs in Cebuano and Hindi) ACE – English - http://www.ldc.upenn.edu/Projects/ACE/ Some NE Annotated Corpora 21(130)
100 documents in SGML News domain Named Entities: 1880 Organizations (46%) 1324 Locations (32%) 887 Persons (22%) Inter-annotator agreement very high (~97%) http://www.itl.nist.gov/iaui/894.02/related_projects/muc/proceedings/muc_7_proceedings/marsh_slides.pdf The MUC-7 corpus 22(130)
<ENAMEX TYPE="LOCATION">CAPE CANAVERAL</ENAMEX>, <ENAMEX TYPE="LOCATION">Fla.</ENAMEX> &MD; Working in chilly temperatures <TIMEX TYPE="DATE">Wednesday</TIMEX> <TIMEX TYPE="TIME">night</TIMEX>, <ENAMEX TYPE="ORGANIZATION">NASA</ENAMEX> ground crews readied the space shuttle Endeavour for launch on a Japanese satellite retrieval mission. <p> Endeavour, with an international crew of six, was set to blast off from the <ENAMEX TYPE="ORGANIZATION|LOCATION">Kennedy Space Center</ENAMEX> on <TIMEX TYPE="DATE">Thursday</TIMEX> at <TIMEX TYPE="TIME">4:18 a.m. EST</TIMEX>, the start of a 49-minute launching period. The <TIMEX TYPE="DATE">nine day</TIMEX> shuttle flight was to be the 12th launched in darkness. The MUC-7 Corpus (2) 23(130)
MUC NE tags segments of text whenever that text represents the name of an entity In ACE (Automated Content Extraction), these names are viewed as mentions of the underlying entities. The main task is to detect (or infer) the mentions in the text of the entities themselves Rolls together the NE and CO tasks Domain- and genre-independent approaches ACE corpus contains newswire, broadcast news (ASR output and cleaned), and newspaper reports (OCR output and cleaned) ACE – Towards Semantic Tagging of Entities 24(130)
Dealing with Proper names – e.g., England, Mr. Smith, IBM Pronouns – e.g., he, she, it Nominal mentions – the company, the spokesman Identify which mentions in the text refer to which entities, e.g., Tony Blair, Mr. Blair, he, the prime minister, he Gordon Brown, he, Mr. Brown, the chancellor ACE Entities 25(130)
<entity ID="ft-airlines-27-jul-2001-2" GENERIC="FALSE" entity_type = "ORGANIZATION"> <entity_mention ID="M003" TYPE = "NAME" string = "National Air Traffic Services"> </entity_mention> <entity_mention ID="M004" TYPE = "NAME" string = "NATS"> </entity_mention> <entity_mention ID="M005" TYPE = "PRO" string = "its"> </entity_mention> <entity_mention ID="M006" TYPE = "NAME" string = "Nats"> </entity_mention> </entity> ACE Example 26(130)
Annotation Tools (1): GATE 27(130)
Annotation Tools (2): Alembic 28(130)
Evaluation metric – mathematically defines how to measure the system’s performance against human-annotated gold standard Scoring program – implements the metric and provides performance measures For each document and over the entire corpus For each type of NE Performance Evaluation 29(130)
Most common are “Precision” and “Recall” Precision = correct answers/answers produced Recall = correct answers/total possible correct answers Trade-off between precision and recall F-Measure = (β2 + 1)PR / β2R + P [van Rijsbergen 75] β reflects the weighting between precision and recall, typically β=1 Some tasks sometimes use other metrics, e.g.: false positives (not sensitive to doc richness) cost-based (good for application-specific adjustment) Evaluation Metrics 30(130)
We may also want to take account of partially correct answers: Precision = Correct + ½ Partially correct Correct + Incorrect + Partial Recall = Correct + ½ Partially correctCorrect + Missing + Partial Why: NE boundaries are often misplaced, sosome partially correct results The Evaluation Metric (2) 31(130)
The GATE Evaluation Tool 32(130)
Need to track system’s performance over time When a change is made we want to know implications over whole corpus Why: because an improvement in one case can lead to problems in others GATE offers automated tool to help with the NE development task over time Corpus-level Regression Testing 33(130)
Regression Testing (2) At corpus level – GATE’s corpus benchmark tool – tracking system’s performance over time 34(130)
Detection of entities and events, given a target ontology of the domain. Disambiguation of the entities and events from the documents with respect to instances in the given ontology. For example, measuring whether the IE correctly disambiguated “Cambridge” in the text to the correct instance: Cambridge, UK vs Cambridge, MA. Decision when a new instance needs to be added to the ontology, because the text contains a new instance, that does not already exist in the ontology. SW IE Evaluation tasks 35(130)
Challenge:Evaluating Richer NE Tagging • Need for new metrics when evaluating hierarchy/ontology-based NE tagging • Need to take into account distance in the hierarchy • Tagging a company as a charity is less wrong than tagging it as a person 36(130)
Structure of the Tutorial • Motivation, background • Information Extraction - definition • Evaluation – corpora & metrics • IE approaches – some examples • Rule-based approaches • Learning-based approaches • Semantic Tagging • Using “traditional” IE • Ontology-based IE • Platforms for large-scale processing • Language Generation [Slides: http://gate.ac.uk/sale/talks/esws2004-tutorial.ppt] 37(130)
Knowledge Engineering rule based developed by experienced language engineers make use of human intuition requires only small amount of training data development could be very time consuming some changes may be hard to accommodate Learning Systems use statistics or other machine learning developers do not need LE expertise requires large amounts of annotated training data some changes may require re-annotation of the entire training corpus annotators are cheap (but you get what you pay for!) Two kinds of IE approaches 38(130)
System that recognises only entities stored in its lists (gazetteers). Advantages - Simple, fast, language independent, easy to retarget (just create lists) Disadvantages – impossible to enumerate all names, collection and maintenance of lists, cannot deal with name variants, cannot resolve ambiguity A) NE Baseline: list lookup approach 39(130)
Internal evidence – names often have internal structure. These components can be either stored or guessed, e.g. location: Cap. Word + {City, Forest, Center, River} e.g. Sherwood Forest Cap. Word + {Street, Boulevard, Avenue, Crescent, Road} e.g. Portobello Street B) Shallow parsing approach using internal structure 40(130)
Ambiguously capitalised words (first word in sentence)[All American Bank]vs. All[State Police] Semantic ambiguity"John F. Kennedy" = airport (location) "Philip Morris" = organisation Structural ambiguity[Cable and Wireless]vs.[Microsoft]and[Dell];[Center for Computational Linguistics]vs.message from[City Hospital]for[John Smith] Problems ... 41(130)
Use of context-based patterns is helpful in ambiguous cases "David Walton" and "Goldman Sachs" are indistinguishable But with the phrase "David Walton of Goldman Sachs" and the Person entity "David Walton" recognised, we can use the pattern "[Person] of [Organization]" to identify "Goldman Sachs“ correctly. C) Shallow parsing with context 42(130)
[PERSON] earns [MONEY] [PERSON] joined [ORGANIZATION] [PERSON] left [ORGANIZATION] [PERSON] joined [ORGANIZATION] as [JOBTITLE] [ORGANIZATION]'s [JOBTITLE] [PERSON] [ORGANIZATION] [JOBTITLE] [PERSON] the [ORGANIZATION] [JOBTITLE] part of the [ORGANIZATION] [ORGANIZATION] headquarters in [LOCATION] price of [ORGANIZATION] sale of [ORGANIZATION] investors in [ORGANIZATION] [ORGANIZATION] is worth [MONEY] [JOBTITLE] [PERSON] [PERSON], [JOBTITLE] Examples of context patterns 43(130)
ANNIE – A Nearly-New IE system A version distributed as part of GATE GATE automatically deals with document formats, saving of results, evaluation, and visualisation of results for debugging GATE has a finite-state pattern-action rule language, used by ANNIE A reusable and easily extendable set of components Example Rule-based System - ANNIE 44(130)
NE Components 45(130)
Needed to store the indicator strings for the internal structure and context rules: Internal location indicators – e.g., {river, mountain, forest} for natural locations; {street, road, crescent, place, square, …}for address locations Internal organisation indicators – e.g., company designators {GmbH, Ltd, Inc, …} Produces Lookup results of the given kind Gazetteer lists for rule-based NE 46(130)
Phases run sequentially and constitute a cascade of FSTs over the pre-processing results Hand-coded rules applied to annotations to identify NEs Annotations from format analysis, tokeniser, sentence splitter, POS tagger, and gazetteer modules Use contextual information Finds person names, locations, organisations, dates, addresses. The Named Entity Transducers 47(130)
NE Rule in JAPE • JAPE: a Java Annotation Patterns Engine • Light, robust regular-expression-based processing • Cascaded finite state transduction • Low-overhead development of new components • Simplifies multi-phase regex processing • Rule: Company1 • Priority: 25 • ( • ( {Token.orthography == upperInitial} )+ //from tokeniser • {Lookup.kind == companyDesignator} //from gazetteer lists • ):match • --> • :match.NamedEntity = • { kind=company, rule=“Company1” } 48(130)
Named Entities in GATE 49(130)
Orthographic co-reference module matches proper names in a document Improves NE results by assigning entity type to previously unclassified names, based on relations with classified NEs May not reclassify already classified entities Classification of unknown entities very useful for surnames which match a full name, or abbreviations, e.g. [Bonfield]will match[Sir Peter Bonfield]; [International Business Machines Ltd.]will match[IBM] Using co-reference to classify ambiguous NEs 50(130)