420 likes | 693 Views
Demonstration of Text Mining. Bill Black National Centre for Text Mining W.Black@manchester.ac.uk University of Manchester. Text mining tasks and resources. Information retrieval Gather, select, filter, documents that may prove useful Find what is known Information extraction
E N D
Demonstration of Text Mining Bill Black National Centre for Text Mining W.Black@manchester.ac.uk University of Manchester
Text mining tasks and resources • Information retrieval • Gather, select, filter, documents that may prove useful • Find what is known • Information extraction • Partial, shallow language analysis • Find relevant entities, facts about entities • Find only what looking for • Mining • Combine, link facts • Discover new knowledge, find new facts • Resources: ontologies, lexicons, terminologies, grammars, annotated corpora (machine learning, evaluation)
CAFETIERE • Conceptual annotations for facts, events, terms, individual entities and relations • Integration of terminological processing and ontological resources • Rule based temporal processing • Scalability issues • Distributed data and processing • Incremental processing • On the fly rapid access to ontologies • Annotation and rule editors
Common Annotation Scheme • XML-based Representation • Stores textual, linguistic and metadata information. • Document Structure: • <front>: source, title, doc. Timestamp… • <body>: text, sections, paragraphs, sentences, tokens. • <back>: conceptual info name instances, entities, relations, events.
Annotation Overview <front> Title Abstract Source Date Section Head Paragraph Sentence Token <body> <back> Lexical Conceptual
Denmark: Danisco defends fat replacer after concerns in local press Torben Svejgaard, President of Emulsifiers at Danish ingredients giantDanisco has responded to the recent debate in the Danish press about the company's fat replacer - Salatrium. Lexical Annotations Person organization location money percent quantity definition product role term title consumer pcycle bcycle
Danish ingredients giant Salatrium Torben Svejgaard President of emulsifiers Fat replacer Conceptual Annotationsrelating lexical annotations Danisco company DEFINED AS PRODUCES WORKS FOR DEFINED AS ROLE PLAYED
Common Annotation Scheme Illustration <ParDoc> <stage stage="1"/> <front> <dateline>22/11/1999</dateline> <title>22Nov1999 PHILIPPINES: IN BRIEF - Rebels nab son</title> </front> <body><sec><p><s> <tok id="t16" pos="NIL" lem="zamboanga" lookup="NIL" orth="uppercase" zone="" sepAfter=" ">ZAMBOANGA</tok> <tok id="t17" pos="NIL" lem="city" lookup="NIL" orth="uppercase" zone="" sepAfter=":">CITY</tok>. . .</s></p></sec>. . . </body> <back> <ParLex> <PNAMEX id="pn3" type="individual" tokref="t46 t47" features ="id=Individual,path =Thing:Individual,name = IndividualPosition=,Nation=,Importance="/> </ParLex> <ParCon> <PEntity id="pe1" type="irish republican army" mnem="Irish Republican Army" refid="pn1"/> <PEvent id="ev1" type="abduction" text="was abducted" refId="t26 t27" class="OCCURRENCE" tense="PAST" polarity="POSITIVE" aspect="NONE" slot1="" slot2="" slot3="" slot4="" slot5="" slot6="" slot7="" slot8="" /> </ParCon> </back> </ParDoc>
Parmenides Resource ManagerDefines pipelines, queues documents between processes
Document capture and conversion • Web and directory crawling • Batch and interactive use • Format conversion • Word, HTML, PDF, etc. to Common Annotation Scheme • Text zoning • Separate front matter from body text • Attempt to annotate headings etc.
Cafetiere Information Extraction Framework • Texts are analysed at several levels leading to a template representation of events • Tokenization and tagging • Sentence splitting and optional term discovery • Ontology or gazetteer lookup • Phrasal analysis to classify name expressions • Phrasal analysis to fill slots of template representations of entities and events.
From Tokenization to Semantic Parsing • Tokenization • words, numbers, punct., tel.nos., chemical formulae, etc. • Tagging • Part of speech labelling; disambiguation in local context • Semantic dictionary/ontology lookup • Known names, terms, heads of terms and names • Partial parsing • Identify phrasal chunks - names, domain terms and other NPs; temporal elements - tensed verbs, adverbials,… • Semantic information extraction • Build template or graph rep. of events/facts
The NLP components • Part of Speech Tagger • Based on Brill algorithm, locally trained from publicly available data • Ontology Lookup • Accesses semantic category and properties of application-interesting words and phrases. • Rule-based phrasal analyser • Finds and labels phrases of application-interest, using tag, lookup, orthography and output of other rules. Returns feature values as well as span labels.
Tokenization and POS tagging • Tokenizer separates words and other tokens, analyzes each orthographically. • Transformation-based Learning is used to train a part of speech tagger. • New Java implementation of Brill algorithm • Fast in operation, circa 100K words/sec. • Tagger available separately, distributed with VisualText. • Sentence splitter differentiates sentence punctuation from other usages of .?! • Next slide shows tabular view of token attributes
Ontology Lookup • IE systems typically consult lists of known names of places, people, organizations, artifacts, etc., and tokens that heuristically indicate class of name, e.g. Dr. , Plc. • Cafetiere consults a knowledge base, which associates ontology class and/or entity identifier, as well as slot names and type constraints. • Previous slide showed ontology class in the ‘lookup’ column of the token attributes.
Phrasal Analysis • Cafetiere finds instances of • Proper names of people, places, organizations, or other application-motivated named entities, e.g. genes, proteins. • Temporal expressions, including adverbials, dates, verb groups. • Descriptive phrases, e.g. NPs in apposition to names. • and classifies them by conceptual category. • Next slides show show these can be accessed after analysis in a document browser.
Entity extraction • Typically, named entities are mentioned several times in a text. • Cafetiere groups together the instances of each named entity when creating a conceptual annotation. • Resolves some co-references, especially variant forms of proper names. • An initial entity is created when a name expression is found. Later occurrences are added to form equivalence classes. • Next slide shows the phrasal instances of a single entity mentioned in the text.
Event extraction • At the phrasal analysis stage, verb groups and noun phrases denoting events have timeML features assigned. • Event extraction is either rule-based or ontology-driven. • Each event type has a number of conceptual slots, and Cafetiere matches conceptually annotated text fragments within the sentence, where they match the slot types. • Next slide shows the event browser after analysis, with one event’s slots in a detail view.
Semi-automatic annotation • The Cafetiere browser also has editing controls, which enable a corrected annotation to be saved. • Useful for applications where validated extracted data are presented to the end users. • Useful for annotating documents to create training or evaluation gold standard data. • For events, browser shows all compatible entities as alternative slot fillers to the one selected by the automatic analysis.
Mining Scientific Literature • In Parmenides project, a case study has been conducted with Unilever on mining scientific papers on weight management. • In NacTem, work on Term Management focuses on domain terminology for biological sciences and medicine. • Joint work with Lancaster University seeks to advance conceptual summarization by extracting causal relations expressed in text.
Unilever Case Study • Weight management experimental papers • Template-level representation of key features of the study • Clinical study subjects • Study population • Clinical study design • Nutritional metabolic phenomenon • Work function (effect) • Health benefit
Phrasal Analysis • Same techniques as for news NE analysis • Targeting descriptive phrases rather than complex proper names. • Domain terms important • Sentence-based fact extraction won’t work, because information distributed throughout text.
Template slot filling • Each slot has many candidate fillers • Often benignly paraphrases of each other, but can emphasize different attributes • Selection is heuristic, based on proximity to trigger word and similarity of terms. • For Unilever’s purposes, acceptable to get filler wrong sometimes if the candidates can be easily substituted in the template editor. • Not yet handled properly-multiple slot fillers
Conceptual Abstracting • Joint work with Chris Paice at Lancaster University. • Paice and Jones (1993) pioneered template-based abstracting of ‘crop science’ papers. • Similar analysis to Unilever case study, leading to a template, and then a generated short informative abstract. • Major drawback is the domain-specific resource development needed.
Ameliorating the resource bottleneck • One approach (Paice and Oakes) to develop rules by supervised machine learning • Transfers effort from rule-based analysis to corpus annotation. • Current approach (Paice and Black) to develop domain-independent extraction of causal and other key relationships expressed in scientific papers • Incorporates term discovery, ontology lookup, stemming, tagging. Implemented in Cafetiere framework.
NLP Components as Services • Stratified processing modules well-defined • Common Annotation Scheme facilitates interoperability • Each module potentially a Web service • Queuing between modules should minimize network traffic • User defines pipeline; Processing takes place on servers; Users share cached common module analyses; Users access curated data.
Conclusion • Cafetiere is a framework for the information extraction phases of text mining. • Incorporates context-sensitive partial parsing of names, terms, chunks. • Linkage to event ontology enables template slot filling and hence fact/event extraction. • Domain-specific resources (ontology, rules) can be developed for diverse domains: business, science.
Future Developments • Use corpus-trained components for phrasal analysis of bio-medical literature. • Improve rule-application engine performance. • Cluster processing to balance IE processing against more superficial processes.