Lynette Hirschman The MITRE Corporation Bedford, MA, USA June 6-8, 2007 Cambridge, UK

Text Mining for MIMS Lynette Hirschman The MITRE Corporation Bedford, MA, USA June 6-8, 2007 Cambridge, UK

Outline • Why am I here? • We have NSF funding for “Critical Assessment of Information Extraction in Biology” (BioCreatve) • NSF Program Manager (Sylvia Spengler) urged me to look at text mining for ecology/environment • Outline • State of the Art in Text Mining: BioCreative • Applications to MIGS/MIMS

Text Mining and Evaluation • Builds on history of assessment/evaluation in both biology (e.g., CASP1) & natural language (MUC,TREC)2 • Approach • Define a relevant task • In terms of input and output • Have experts prepare “gold standard” as basis of comparison • Some data used as “training” so developers can build their systems to do the right thing • Evaluate by comparing system output against the “gold standard” • Also useful to compare two experts against each other, to check consistency 1Critical Assessment of Techniques for Protein Structure Prediction2Message Understanding Conference; Text Retrieval Conference

State of the Art: Extraction • For news, automated systems exist now that can: • Identify entities (85-95% F-measure*)Entities include persons, places,time, money,… • Extract relations among entities (70-80% F), e.g., person lives_in place • Answer simple factual questions using large document collections at 75-85% accuracy(question answering) • Goals of BioCreative: • Find out how good text mining is applied to biology • Use evaluation to drive progress in text mining and natural language processing for biology F-measure is harmonic mean of precision and recall: 2*P*R/(P+R)Precision = TP/TP+FP; Recall = TP/TP+FN

Critical Assessment of Information Extraction for Biology • 1st evaluation March 2004: 27 teams/10 countries • 2nd evaluation April 2007: 44 teams/13 countries • Organizers: • CNIO (Spanish National Center for Cancer Research): Krallinger and Valencia • MITRE: Hirschman, Colosimo, Morgan (Stanford) • NCBI: Wilbur, Smith, Tanabe 1Critical Assessment of Techniques for Protein Structure Prediction2Message Understanding Conference; Text Retrieval Conference

BioCreative Tasks • Gene mention: identify all gene or protein mentions in running text • Results: ~90% balanced precision/recall • Gene normalization: list unique identifiers (EntrezGene) of genes/proteins in a text • Results: ~ 80% balanced precision/recall • Biological tasks • 1st BioCreative: functional annotation for proteins (identifying evidence passage supporting assignment of a GO term to a protein) • 2nd BioCreative: supporting the curation pipeline for protein-protein interaction, based on IntAct and MINT • Results: these are very hard tasks!

BioCreative Gene Name Finding • Identify all mentions of genes, proteins in sentences from PubMed abstracts • Serves as building block for more complex tasks • Results based on 15,000 expert annotated sentences • Many approaches based on statistical word co-occurrence (HMM, Conditional Random Fields) • High score: F-measure (balanced precision/recall) • 1st BioCreative: 0.82 • 2nd BioCreative: 0.87; 0.90 combined systems Mutation of TTF-1-bindingsites (TBE) 1, 3, and 4 in combination markedly decreased transcriptional activity of SP-A promoter-chloramphenicol acetyltransferase constructs containing SP-Agene sequences from -256 to +45.

Finding Other “Entity” Mentions • Examples of other kinds of entities: • Place names and geographic locations • Lat/long expressions • Other kinds of quantitative expressions • Technology • For place names: gazetteer (list of geographic names + meta-data) is very useful • For quantitative expressions, capture via regular expressions • Performance up to 80-90% F-measure • Provided that you have access to “clean” textual data (not pdf, not tables, not images…)

BioCreative Gene Normalization List unique gene IDs mentioned in abstract for fly, human… Abstract ID Organism Gene ID fly_00035_training FBgn0000592 fly_00035_training FBgn0026412 A locus has been found, an allele of which causes a modification of some allozymes of the enzyme esterase 6 in Drosophila melanogaster. There are two alleles of this locus, one of which is dominant to the other and results in increased electrophoretic mobility of affected allozymes. The locus responsible has been mapped to 3-56.7 on the standard genetic map (Est-6is at 3-36.8). Of 13 other enzyme systems analyzed, only leucine aminopeptidase is affected by the modifier locus. Neuraminidase incubations of homogenates altered the electrophoretic mobility of esterase 6 allozymes, but the mobility differences found are not large enough to conclude that esterase 6 is sialylated. Sample Gene ID and synonyms: FBgn0000592:Est-6, Esterase 6, CG6917, Est-D, EST6, est-6, Est6, Est, EST-6, Esterase-6, est6, Est-5, Carboxyl ester hydrolase

Gene Normalization: Results • Serves as building block for complex tasks, for example, protein-protein interaction • Gold standard derived from annotations in model organism databases, EBI GOA • Results depend heavily on consistency of naming conventions for an organism • Also on availability of synonym lists for gene and protein names • High score measured as balanced precision/recall: • 1st BioCreative high F-measure: Fly: 0.82 Mouse: 0.79 Yeast: 0.93 • 2nd BioCreative high F-measure: Human: 0.81; Combined: 0.83

Protein-Protein Interaction • Several sub-tasks, including • Selection of articles for curation: 78% correct • SwissProt IDs for interacting protein pairs: 0.52 F • Experimental method extraction: 0.65 F • Results are lower on this task because: • This reflects an actual biological application • The tasks were more complex, e.g., it was necessary to identify organism and protein to map to correct SwissProt ID • There was limited training data • This was the first time we ran it

What Could Text Mining Do for MIGS? • Renzo Kottmann reports ~70% F-measure extracting geographic location information from text (part of the MetaFunctions project) • Text mining could extract other information from the literature related to environmental information (height/depth, pH, salinity) • Requires a stable ontology/CV • Requires examples of human extracted information • Requires access to full text + supplement + … • Many tools may be useful to annotators in an interactive environment • Keep the user in the loop! Thanks to Dawn Field, Tanya Gray , Renzo Kottman, Norman Morrison, Trish Whetzel, for discussions about what MIGS needs

A Small Thought Experiment for MIGS • Looked up a few metagenomics articles and tried to find information about the environment from which the samples are collected • Where is the information? • How is it expressed? • Is this a good match for current text mining tools? • Where I found useful information • Tables: Table columns could be useful to define the ontological categories (but are hard to parse) • Free text comment fields in existing databases • Full articles: information is scattered in a number of places, including the supplementary material (in pdf – not great for text mining)

Table Headings

Examples from the MegX Database (Marine Ecological GenomiX)

Information in Full Text Full text article, Methods section1 Further details for all methods used in this study are provided in Supplementary Information. O. algarvensis specimens were collected off Capo di Sant' Andrea, Elba, Italy. Supplementary Material (pdf!): Juvenile and adult Olavius algarvensis specimens were collected in May and September 2004 from 5.6 m water depth in silicate sediments around sea grass beds of Posidonia oceanica in a bay off Capo di Sant’ Andrea, Elba, Italy (42°48'26"N, 010°08'28"E). 1Symbiosis insights through metagenomic analysis of a microbial consortium, Woyke T et al., Nature 443, 950-955 (26 October 2006) doi:10.1038/nature05192 2http://www.nature.com/nature/journal/v443/n7114/extref/nature05192-s1.pdf

What Are the Other Possibilities? • Can text mining help in building ontologies? • There are existing term extraction systems, e.g., TerMine1 • There are techniques to define local patterns to extract specific types of information • But these systems aren’t very mature yet • Could GSC provide a “challenge problem” to the next BioCreative? 1http://www.nactem.ac.uk/software/termine/

TerMine Example

The Real Problem: Capturing the Meta-Data • Who provides the data? • Too much work for authors… • Who enforces standards? • Too much work for databases (INSDC) and publishers! • Possible solution: • Provide a free open-source meta-data checker (like a spell checker) to help authors capture meta-data WHILE WRITING • I had conversations with editors at PLoS a few years ago; there was potential interest • Is MIGS/MIMS a good test case?

Possible Next Steps • Determine if there are annotated data sets that define a well-structured problem for BioCreative • With good human inter-curator agreement • With sufficient expert annotated data for training an extraction system • Such an application could be contributed to next BioCreative • Discuss meta-data checker – is this practical?

Thank You!

Checking Interannotator Agreement:An Experiment from BioCreative 1 • Camon et al did 1st inter-curator agreement expt* • 3 EBI GOA annotators annotated 12 overlapping documents for GO terms (4 docs/pair of curators) • Results after developing consensus gold standard: • Avg precision (% annotations correct): ~95% • Avg recall (% correct annotations found): ~72% • Lessons learned • Very few wrong annotations, but some were missed • Annotators differed on specificity of annotation, depending on their biological knowledge • Annotation by paper meant evidence standard was less clear (normal annotation is by protein) • Annotation is a complex task for people! • Camon et al.,BMC Bioinformatics 2005, 6(Suppl 1):S17 (2005)

State of the Art: Terminology Extraction • (say something here – work of Ananiadou, Manchester (what else??) • Also issue of free text capture (and pdf) • Talk about Cambridge pipeline (Flybase – Briscoe)

Conclusions • Text mining can provide a methodology to assess consistency of annotation • Text mining can provide tools • To manage the curation queue • To assist curators, particularly in normalization & mapping into ontologies • Next steps • Define intended uses of RegCreative data • Establish curator training materials • Identify key bottlenecks in curation • Provide data, user input to develop tools • Major stumbling block for text mining • Handling of pdf documents!

Acknowledgements • US National Science Foundation for funding of BioCreAtIvE I and BioCreAtIve II* • MITRE colleagues who worked on BioCreAtIvE • Alex Morgan (now at Stanford) • Marc Colosimo • Jeff Colombe • Alex Yeh (also KDD Challenge Cup) • Collaborators at CNB and CNIO • Alfonso Valencia • Christian Blaschke (now at bioalma) • Martin Krallinger * Contract numbers EIA-0326404 and IIS-0640153 .

Life-Cycle of a Typical Biological Database (1-6 months) Purpose: background for specific project Users: one researcher Format: text file or spreadsheet (1-2 years) Purpose: data sharing within project; linkage to resources outside project Users: internal team of researchers Format: spreadsheet (3-5 years) Purpose: sharing with external collaborators Users: cross-institution research “team” Format: database (5-10 years) Purpose: resource for community Users: larger biology/bioinformatics community Format: database, standard nomenclature, ontologies

Emergence of a Biological Database:Transition from Private to Shared Resource • Transition from private to shared resource • Requires shared standards • Nomenclature (organism, species, cell lines, genes, …) • Ontologies (anatomical location, protein function, …) • Requires maintainable infrastructure • Database with well-defined fields • Interfaces for data entry • Documented annotation standards • Tools to facilitate curation

MEDLINE The MOD Curation Pipeline and Text Mining 3. Curate genes from paper BioCreAtIve: Gene Normalization Extract gene names & normalize: 20 participants 2. List genes for curation 1. Select papers BioCreAtIvE II: Protein annotation Find relations & supporting evidence in text: 28 participants KDD 2002 Task 1;TREC Genomics 2004 Task 2BioCreAtIvE II: PPI article selection

MEDLINE ORegAnno Curation Pipeline & Text Mining 3. Curate genes from paper Gene & TF Normalization: Extract gene, protein names & normalize to standard ID 2. List TFBS for curation 1. Select papers Extract evidence passages and map to evidence types/sub-types Curation queue management

Assessments: Document Classification • TREC Genomics track focused on retrieval • Part of Text Retrieval Conf, run by National Institutes of Standards and Technology • Tasks have included retrieval of • Documents to identify gene function • Documents for MGI curation pipeline • Documents, passages to answer queries, e.g., “what effect does the insulin receptor gene have on tumorigenesis?” • 40+ groups participating starting 2004 • KDD Challenge Cup task 2002 • Yeh et al, MITRE; Gelbart, Mathew et al, FlyBase task

FlyBase: Evidence for Gene Products

KDD Challenge Cup • Task: automate part of FlyBase curation: • Determine which papers need to be curated for Drosophila gene expression information • Curate only those papers containing experimental results on gene products (RNA transcripts and proteins) • Teamed with FlyBase, who provided • Data annotation plus biological expertise • Input on the task formulation • Venue: ACM conference on Knowledge Discovery and Data Mining (KDD) • Alex Yeh (MITRE) ran Challenge Cup task

Results • 18 teams submitted results (32 entries) • Winner: a team from ClearForest and Celera • Used manually generated rules and patterns to perform information extraction • Subtask results Best MedianRanked-list for curation: 84% 69% Yes/No curate paper: 78% 58%Yes/No gene products: 67% 35% • Conclusion: ranking papers for curation promising; open question: would this help curators?

How Text Mining Can Help • Terminology • Tools to identify key concepts for CV or ontology • Quality& Consistency • Methods to assess consistency of annotation • Determine consistency of human performance on classification or annotation tasks • Use agreement studies to improve annotation guidelines and resources • Coverage • Speed up curation for better coverage of literature • Tools to improve ranking of articles in curation queue • Currency • Faster curation improves currency of annotations

Tasks for Text Mining: Step 1 is finding the text • Get the relevant sets of articles • Identifying the articles is easy now (not that much published yet) • Getting full text may be harder (permissions) • Getting text in ready-to-process form may also be harder – need it in xml or html; pdf is still hard • Locate the relevant portions of the article • Need to “parse” the document structure (headings, sections) • Tricky because this is non standard • Need to get information from supplementary material – with same issues of xml vs pdf

Tasks for Text MiningStep 2 is Defining What to Extract • An ontology or controlled vocabulary provides the “target” for information extraction • If this doesn’t exist yet, • Text mining systems require a set of examples of

Output of TerMine

Some More Examples The coastal water sample was collected from Boothbay Harbor, Maine, from 1-m depth at the Bigelow Laboratory dock (43°50'40'', 69°38'27''W) on March 28 at 9:45 a.m. during high tide (water temperature 7.0°C). (Materials and Methods)1 Our sampling site, Hawaii Ocean Time-series (HOT) station ALOHA (22°45' N, 158°W), represents one of the most comprehensively characterized sites in the global ocean and has been a focal point for time series–oriented oceanographic studies since 1988 … Specifically, seawater samples from the upper euphotic zone (10 m and 70 m), the base of the chlorophyll maximum (130 m), below the base of the euphotic zone (200 m), well below the upper mesopelagic (500 m), in the core of the dissolved oxygen minimum layer (770 m), and in the deep abyss, 750 m above the seafloor (4000 m), were collected for preparing microbial community DNA libraries …(Study Site & Sampling Strategy2 1Stepanauskas et al, PNAS, May 22, 2007, vol. 104 , no. 21, 9052-90572DeLong et al, Science 27 January 2006: Vol. 311. no. 5760, pp. 496 - 503

BioCreAtIvE I Results: Gene Normalization • Yeast results good: • High: 0.93 F • Smallest vocab • Short names • Little ambiguity • Fly: • 0.82 F • High ambiguity • Mouse: 0.79 F • Large vocabulary • Long names • Human: ~80% (BioCreAtIvE II)

Impact of BioCreAtIvE I • BioCreAtIvE showed state of the art: • Gene name mentions: F = 0.83 • Normalized gene IDs: F = 0.8 - 0.9 • Functional annotation: F ~ 0.3 • BioCreAtIvE II • Participation 2-3x higher! • Results and workshop April 23-25, Madrid • What next? • New model of curator/text mining cooperation • Have biological curators contribute data (training and test sets) • Text mining developers work on real biological problems • RegCreative is an instance of this model

Basic Premise: Text Mining Can (at best) Reproduce What Experts Do • Therefore: if people cannot do a task consistently, it will be difficult to apply text mining • No good way to judge consistency of results, because there is no consistent “gold standard” • Natural language processing (and text mining) use a rigorous “gold standard” evaluation procedure • Creating a human baseline: • Two humans perform same classification task on a “blind” data set, using classification guidelines • Results are compared via a scoring metric • Used to determine whether guidelines are sufficient to ensure consistent classification • The same method can then be used to compare automated methods against human “gold standard”

ExperimentalData Databases SwissProt Literature Collections Genbank MEDLINE But If There Is No Ontology… metadata Without ontology, text mining tools have no target for mapping & no examples to learn from However, term mining might help to identify concepts and terms to build the ontologies Ontologies Term Mining

Basic Premise: Text Mining Can (at best) Reproduce What Experts Do • Constructing an information extraction system requires examples • Either so a developer can build a set of rules to capture information • Or so machine learning can be used to learn the statistics of a particular task • If experts cannot do a task consistently, it will be difficult to apply text mining • No consistent examples to “learn” from • No good way to judge consistency of results, because there is no consistency among experts (“gold standard”)

ExperimentalData Databases SwissProt Literature Collections Genbank MEDLINE The Biological Data Cycle metadata Data goes from experiments into the literature and into biological databases For metagenomics, it is critical to capture metadata as well This requires ontologies/ controlledvocabularies Ontologies

ExperimentalData Databases SwissProt Literature Collections Genbank MEDLINE Where Text Mining Can Help metadata Text Mining Expert biologists transfer information from structured experimental data and free text, capturing it via ontologies or CV Text mining tools can use expert annotations as basis for what to extract (“training data”) Ontologies

Lynette Hirschman The MITRE Corporation Bedford, MA, USA June 6-8, 2007 Cambridge, UK

Lynette Hirschman The MITRE Corporation Bedford, MA, USA June 6-8, 2007 Cambridge, UK

Presentation Transcript

David Griesinger Consultant Cambridge MA USA www.DavidGriesinger.com

W3C Workshop, Bedford, MA – February 2007

Cambridge Cab MA

www.gics.com.au 8 th June 2007

MITRE Corporation

Roma, 6 June 2007

European Central Bank Berne, 6-8 June 2007

June 8, 2007

IHPP June 8, 2007

CIIEM 2007  Energetic Installations  Badajoz, 6 - 8 June 2007

The MITRE Corporation Jeff Heier jheier@mitre 310-297-8364

October, 28 2004 Cambridge, MA USA

Symposium June 8, 2007

Jerry D. Smith, Director November 8, 2012 New Bedford, MA

Praying for Bedford and the UK

PRISME Forum Meeting Biogen Idec, Cambridge, MA, USA

6 th June 2007

Optical Climatology RaDyO Planning Meeting June 6-8, 2007

Lynette Hirschman The MITRE Corporation Bedford, MA, USA RegCreative Jamboree Nov 29-Dec 1, 2006

Electron-Cyclotron-Resonance, Vacuum-Ultraviolet Light Source Spire Corporation Bedford MA

June 6, 2007

Locksmith Cambridge MA