590 likes | 695 Views
Applications of Natural Language Processing. Course 6 – 29 March 2012 Diana Trandab ă ț dtrandabat@info.uaic.ro. Content. What is Named Entity Recognition Corpora , annotation Evaluation and testing Preprocessing Approaches to NE Baseline R ule-based approaches
E N D
Applications of Natural Language Processing Course 6 – 29 March 2012 Diana Trandabățdtrandabat@info.uaic.ro
Content • What is Named Entity Recognition • Corpora, annotation • Evaluation and testing • Preprocessing • Approaches to NE • Baseline • Rule-based approaches • Learning-based approaches • Multilinguality • Applications
Remember • Information Extraction (IE) proposes techniques to extract relevant information from non-structured or semi-structured texts • Extracted information is transformed so that it can be represented in a fixed (computer-readable) format
NamedEntityRecognition (NER) • Named Entity Recognition (NER) is an IE task that seeks to locate and classify text segments into predefined classes (for exemplePerson, Location, Time expression) • We are proud to announce that Friday, February 17, we will have two sessions in theEducation Seminar. At 12:30pm, at the Student Center Room 207, Joe Mertz will present "Using a Cognitive Architecture to Design Instructions“. His session ends at 1pm. After a small lunch break, at 14:00, we meet again at Student Center Room 208, where Brian McKenzie will start his presentation. He will present “Information Extraction: how to automatically learn new models”. This session ends around 15h.
NamedEntityRecognition (NER) • PersonEntity Recognition (NER) is an IE task Locationto locate and classify text segments Timepredefined classes (for exemplePerson, Location, Time expression) • We are proud to announce that Friday, February 17, we will have two sessions in theEducation Seminar. At 12:30pm, at the Student Center Room 207, Joe Mertz will present "Using a Cognitive Architecture to Design Instructions“. His session ends at 1pm. After a small lunch break, at 14:00, we meet again at Student Center Room 208, where Brian McKenzie will start his presentation. He will present “Information Extraction: how to automatically learn new models”. This session ends around 15h.
NamedEntityRecognition (NER) • Person Entity Recognition (NER) is an IE task Locationto locate and classify text segments Timepredefined classes (for exemplePerson, Location, Time expression) • We are proud to announce that Friday, February 17, we will have two sessions in theEducation Seminar. At 12:30pm, at the Student Center Room 207, Joe Mertz will present "Using a Cognitive Architecture to Design Instructions“. His session ends at 1pm. After a small lunch break, at 14:00, we meet again at Student Center Room 208, where Brian McKenzie will start his presentation. He will present “Information Extraction: how to automatically learn new models”. This session ends around 15h.
NamedEntityRecognition (NER) • Person Entity Recognition (NER) is an IE task Locationto locate and classify text segments Timepredefined classes (for exemplePerson, Location, Time expression) • We are proud to announce that Friday, February 17, we will have two sessions in theEducation Seminar. At 12:30pm, at the Student Center Room 207, Joe Mertz will present "Using a Cognitive Architecture to Design Instructions“. His session ends at 1pm. After a small lunch break, at 14:00, we meet again at Student Center Room 208, where Brian McKenzie will start his presentation. He will present “Information Extraction: how to automatically learn new models”. This session ends around 15h.
What are Named Entities? • NER involves two sub-tasks: • Identification of proper names in texts (Named Entity Identification – NEI) • Classification into a set of predefined categories of interest (Named Entity Classification – NEC)
What are Named Entities • Usualcategories: • Person names, Organizations (companies, government organisations, committees, etc), Locations (cities, countries, rivers, etc), Date and time expressions • Other common types: • measures (percent, money, weight etc), email addresses, Web addresses, street addresses, etc. • Some domain-specific entities: • names of drugs, medical conditions, names of ships, bibliographic references etc.
Basic Problems in NE • Variation of NEs – e.g. John Smith, Mr Smith, John. • Ambiguity of NE types: • John Smith (company vs. person) • May(person vs. month) • Washington (person vs. location) • 1945 (date vs. time) • Ambiguity with common words, e.g. "may"
More complex problems in NE • Issues of style, structure, domain, genre etc. • Punctuation, spelling, spacing, formatting, ... all have an impact: Dept. of Computing and Maths Manchester Metropolitan University Manchester United Kingdom • Tell me more about Leonardo • Da Vinci
Some NE Annotated Corpora • MUC (Message Understanding Conference)-6 and MUC-7 corpora - English • CONLL shared task corpora http://cnts.uia.ac.be/conll2003/ner/ - NEs in English and Germanhttp://cnts.uia.ac.be/conll2002/ner/ - NEs in Spanish and Dutch • TIDES surprise language exercise (NEs in Cebuano and Hindi) • ACE (Automatic Content Extraction) – Englishhttp://www.ldc.upenn.edu/Projects/ACE/
The MUC-7 corpus • 100 documents in SGML • News domain • 1880 Organizations (46%) • 1324 Locations (32%) • 887 Persons (22%) • Inter-annotator agreement very high (~97%) • http://www.itl.nist.gov/iaui/894.02/related_projects/muc/proceedings/muc_7_proceedings/marsh_slides.pdf
The MUC-7 Corpus (2) <ENAMEX TYPE="LOCATION">CAPE CANAVERAL</ENAMEX>, <ENAMEX TYPE="LOCATION">Fla.</ENAMEX> &MD; Working in chilly temperatures <TIMEX TYPE="DATE">Wednesday</TIMEX> <TIMEX TYPE="TIME">night</TIMEX>, <ENAMEX TYPE="ORGANIZATION">NASA</ENAMEX> ground crews readied the space shuttle Endeavour for launch on a Japanese satellite retrieval mission. <p> Endeavour, with an international crew of six, was set to blast off from the <ENAMEX TYPE="ORGANIZATION|LOCATION">Kennedy Space Center</ENAMEX> on <TIMEX TYPE="DATE">Thursday</TIMEX> at <TIMEX TYPE="TIME">4:18 a.m. EST</TIMEX>, the start of a 49-minute launching period. The <TIMEX TYPE="DATE">nine day</TIMEX> shuttle flight was to be the 12th launched in darkness.
NE Annotation Tools - GATE 15(110)
Pre-processing for NER • Format detection • Word segmentation (for languages like Chinese) • Tokenisation • Sentence splitting • POS tagging
NER Systems • NER systems have been created that use linguistic grammer-based techniques as well as statistical methods. • Hand-crafted grammar-based systems typically obtain better precision, but at the cost of lower recall and months of work by experienced computational linguistics. • Statistical NER systems typically require a large amount of manually annotated training data.
FromCorpora toSystem Development • Corpora are divided typically into a training and testing portion • Rules/Learning algorithms are trained on the training part • Tuned on the testing portion in order to optimise • Rule priorities, rules effectiveness, etc. • Parameters of the learning algorithm and the features used • Evaluation set – the best system configuration is run on this data and the system performance is obtained • No further tuning once evaluation set is used!
Knowledge Engineering rule based developed by experienced language engineers make use of human intuition requires only small amount of training data development could be very time consuming some changes may be hard to accommodate Learning Systems use statistics or other machine learning developers do not need advanced language engineering expertise requires large amounts of annotated training data some changes may require re-annotation of the entire training corpus Two kinds of NE approaches
Baseline: list lookup approach • System that recognises only entities stored in its lists (gazetteers). • Advantages - Simple, fast, language independent, easy to retarget (just create lists) • Disadvantages – impossible to enumerate all names, collection and maintenance of lists, cannot deal with name variants, cannot resolve ambiguity
Creating Gazetteer Lists • Online phone directories and yellow pages for person and organisation names • Locations lists • http://ro.wikipedia.org/wiki/Format:Listele_localit%C4%83%C8%9Bilor_din_Rom%C3%A2nia_pe_jude%C8%9Be • Nameslists • http://ro.wikipedia.org/wiki/List%C4%83_de_nume_rom%C3%A2ne%C8%99ti • Automatic collection from annotated training data
Shallow Parsing Approach (internal structure) • Internal evidence – names often have internal structure. These components can be either stored or guessed, e.g. location: • Cap. Word + {City, Forest, Center, River} • e.g. Sherwood Forest • Cap. Word + {Street, Boulevard, Avenue, Crescent, Road} • e.g. Portobello Street
Problems with the shallow parsing approach • Ambiguously capitalised words (first word in sentence)[All American Bank] vs. All [State Police] • Semantic ambiguity "John F. Kennedy" = airport (location) "Philip Morris" = organisation • Structural ambiguity [Cable and Wireless] vs. [Microsoft] and [Dell];[Center for Computational Linguistics] vs. message from [City Hospital] for [John Smith]
Shallow Parsing Approach with Context • Use of context-based patterns is helpful in ambiguous cases • "David Walton" and "Goldman Sachs" are indistinguishable • But in "David Walton of Goldman Sachs" ifwehave"David Walton”recognised as Person wecanuse the pattern "[Person] of [Organization]“ andidentify "Goldman Sachs“ correctly.
Examples of context patterns • [PERSON] earns [MONEY] • [PERSON] joined [ORGANIZATION] • [PERSON] left [ORGANIZATION] • [PERSON] joined [ORGANIZATION] as [JOBTITLE] • [ORGANIZATION]'s [JOBTITLE] [PERSON] • [ORGANIZATION] [JOBTITLE] [PERSON] • the [ORGANIZATION] [JOBTITLE] • part of the [ORGANIZATION] • [ORGANIZATION] headquarters in [LOCATION] • price of [ORGANIZATION] • sale of [ORGANIZATION] • investors in [ORGANIZATION] • [ORGANIZATION] is worth [MONEY] • [JOBTITLE] [PERSON] • [PERSON], [JOBTITLE]
Contextpatterns • Patterns are only indicators based on likelihood • Can set priorities based on frequency thresholds • Need training data for each domain • More semantic information would be useful (e.g. to cluster groups of verbs)
Example Rule-based System - ANNIE • Created as part of GATE • GATE – Sheffield’s open-source infrastructure for language processing • GATE automatically deals with document formats, saving of results, evaluation, and visualisation of results for debugging • GATE has a finite-state pattern-action rule language, used by ANNIE • ANNIE modified for MUC guidelines – 89.5% f-measure on MUC-7 corpus
NE Components The ANNIE system – a reusable and easily extendable set of components
Gazetteer lists for rule-based NE • Needed to store the indicator strings for the internal structure and context rules • Internal location indicators – e.g., {river, mountain, forest} for natural locations; {street, road, crescent, place, square, …}for address locations • Internal organisation indicators – e.g., company designators {GmbH, Ltd, Inc, …} • Produces Lookup results of the given kind
Using co-reference to classify ambiguous NEs • Orthographic co-reference module that matches proper names in a document • Improves NE results by assigning entity type to previously unclassified names, based on relations with classified NEs • May not reclassify already classified entities • Classification of unknown entities very useful for surnames which match a full name, or abbreviations, e.g. [Napoleon]will match [Napoleon Bonaparte]; [International Business Machines Ltd.] will match [IBM]
Machine Learning Approaches • ML approaches frequently break down the NERtask in two parts: • Recognising the entity boundaries • Classifying the entities in the NE categories • Workis usuallyonly on one task or the other • Tokens in text are often coded with the IOB scheme • O – outside, B-NE – first word in NE, I-NE– all other words in NE • Argentina B-LOCplayed Owith ODel B-PERBosque I-PER
IdentiFinder [Bikel et al 99] • Based on Hidden Markov Models • Features • Capitalisation • Numeric symbols • Punctuation marks • Position in the sentence • 14 features in total, combining above info, e.g., containsDigitAndDash (09-96), containsDigitAndComma (23,000.00)
IdentiFinder (2) • MUC-6 (English) and MET-1(Spanish) corpora used for evaluation • Mixed case English • IdentiFinder - 94.9% f-measure • Spanish mixed case • IdentiFinder– 90% • Lower case names, noisy training data, less training data • Training data: 650,000 words, but similar performance with half of the data. Less than 100,000 words reduce the performance to below 90% on English
Fine-grained Classification of NEs [Fleischman 02] • Finer-grained categorisation needed for applications like question answering • Person classification into 8 sub-categories: athlete, politician/government, clergy, businessperson, entertainer/artist, lawyer, doctor/scientist, police. • Approach using local context and global semantic information such as WordNet • Used a decision list classifier and Identifinder to construct automatically training set from untagged data • Held-out set of 1300 instances hand annotated
Fine-grained Classification of NEs (2) • Word frequency features – how often the words surrounding the target instance occur with a specific category in training • For each 8 categories 10 distinct word positions = 80 features per instance • 3 words before & after the instance • The two-word bigrams immediately before and after the instance • The three-word trigrams before/after the instance
Fine-grained Classification of NEs (3) • Topic signatures and WordNet information • Compute lists of terms that signal relevance to a topic/category [Lin&Hovy 00] & expand with WordNet synonyms to counter unseen examples • Politician – campaign, republican, budget • The topic signature features convey information about the overall context in which each instance exists • Due to differing contexts, instances of the same name in a single text were classified differently
Performance Evaluation • Evaluation metric – mathematically defines how to measure the system’s performance against a human-annotated, gold standard • Scoring program – implements the metric and provides performance measures • For each document and over the entire corpus • For each type of NE
The Evaluation Metric • Precision = correct answers/answers produced • Recall = correct answers/total possible correct answers • Trade-off between precision and recall • F-Measure = (β2 + 1)PR / β2R + P [van Rijsbergen 75] • β reflects the weighting between precision and recall, typically β=1
The Evaluation Metric (2) • We may also want to take account of partially correct answers: • Precision = Correct + ½ Partially correct Correct + Incorrect + Partial • Recall = Correct + ½ Partially correctCorrect + Missing + Partial • Why: NE boundaries are often misplaced, sosome partially correct results
Multilingual Named Entity Recognition • Recent experiments are aimed at NE recognition in multiple languages • TIDES surprise language evaluation exercise measures how quickly researchers can develop NLP components in a new language • CONLL’02, CONLL’03 focus on language-independent NE recognition
Analysis of the NE Task in Multiple Languages [Palmer&Day 97]
Analysis of Multilingual NE (2) • Numerical and time expressions are very easy to capture using rules • Constitute together about 20-30% of all NEs • All numerical expressions in the 6 languages required only 5 patterns • Time expressions similarly require only a few rules (less than 30 per language) • Many of these rules are reusable across the languages
What is needed for multilingual NE • Extensive support for non-Latin scripts and text encodings, including conversion utilities • Automatic recognition of encoding [Ignat et al03] • Occupied up to 2/3 of the TIDES Hindi effort • Bi-lingual dictionaries • Annotated corpus for evaluation • Internet resources for gazetteer list collection (e.g., phone books, yellow pages, bi-lingual pages)
Multilingual Data - GATE All processing, visualisation and editing tools use GUK
Gazetteer-based Approach to Multilingual NE [Ignat et al 03] • Deals with locations only • Even more ambiguity than in one language: • Multiple places that share the same name, such as the fourteen cities and villages in the world called ‘Paris’ • Place names that are also words in one or more languages, such as ‘And’ (Iran), ‘Split’ (Croatia) • Places have varying names in different languages (Italian ‘Venezia’ vs. English ‘Venice’, German ‘Venedig’, French ‘Venise’)
Gazetteer-based multilingual NE (2) • Disambiguation module applies heuristics based on location size and country mentions (prefer the locations from the country mentioned most) • Performance evaluation: • 853 locations from 80 English texts • 96.8% precision • 96.5% recall
Machine Learning for Multilingual NE • CONLL’2002 and 2003 shared tasks were NE in Spanish, Dutch, English, and German • The most popular ML techniques used: • Maximum Entropy (5 systems) • Hidden Markov Models (4 systems) • Connectionist methods (4 systems) • Combining ML methods has been shown to boost results
ML for NE at CONLL (2) • The choice of features is at least as important as the choice of ML algorithm • Lexical features (words) • Part-of-speech • Orthographic information • Affixes • Gazetteers • External, unmarked data is useful to derive gazetteers and for extracting training instances
Applications of NER NamedEntityRecognition in Web Search Medical NER (Medlineabstracts)