550 likes | 778 Views
Text Annotation. Soliciting knowledge from biologists Coordinating communication among biologists. Junichi TSUJII Microsoft Research Asia UK National Centre for Text Mining, UK . Plan of Talk. Text annotation for information extraction Current State of the Art :Event Recognition
E N D
Text Annotation Soliciting knowledge from biologistsCoordinating communication among biologists Junichi TSUJII Microsoft Research Asia UK National Centre for Text Mining, UK
Plan of Talk • Text annotation for information extraction • Current State of the Art :Event Recognition • Shared Task: BioNLP 09 • Shared Task: BioNLP 11 • Information Integration: Text and knowledge • Fine-Grained IR and PathText • Text annotation as knowledge solicitation • User feedback • Communication through comments on text • Concluding remarks
Plan of Talk • Text annotation for information extraction • Current State of the Art :Event Recognition • Shared Task: BioNLP 09 • Shared Task: BioNLP 11 • Information Integration: Text and knowledge • Fine-Grained IR and PathText • Text annotation as knowledge solicitation • User feedback • Communication through comments on text • Concluding remarks
Information Extraction From text to structured representation Text Annotation Non-Trivial Mappings Terminology Parsing Paraphrasing Knowledge Domain Language Domain Concepts and Relationships among Them Linguistic expressions Motivated Independently of language
Syntactic variability of single event STAT protein nuclear translocation (GO:0007262) In the training set (800 abstracts), there are no occurrences of “STAT protein nuclear translocation”. However, one found 10 occurrences of this concept. nuclear translocationof STAT6 nucleartranslocation of the latent transcription factor, STAT6 translocation into nucleus of signal transducers and activators of transcription (STAT) STAT5A and STAT5B containing complexes . . . these complexes rapidly translocated (within 1 min) into the nucleus STAT5B containing complexes . . . these complexes rapidly translocated(within 1 min) into the nucleus STAT1nuclearimport nuclear import of NF-kappa B, AP-1, NFAT, and STAT1 STAT1 in Jurkat T lymphocytes is significantly inhibited by a cell-permeable peptide carrying the NLS of the NF-kappa B p50 subunit. NLS peptide-mediated disruption of the nuclear import ... Sophia Ananiadou
Text-Bound Annotation contained_indomain: Protein_domainregion: Protein_domain|Protein Deletiontheme:Protein_domain Inhibitionagent: Process theme: Process Bindingtheme1:Proteintheme2:Protein proteindomain protein ONTOLOGY Inhibitionagent: theme: Deletiontheme:RM40 Bindingtheme1:RelAtheme2:NFKBIA RelA NFKBIA RM40 RHD contained_in events inhibit bind delete contained_in contained_in ANNOTATION relations entities RM40 RHD RelA NFKBIA … 3) selective deletion of the functional nuclear localization signalpresent in theRel homology domain ofNF-kappa B p65disrupts its ability to engageI kappa B/MAD-3, and 4) … TEXT PMID:1493333
Information Extraction From text to structured represesntation Text Annotation Non-Trivial Mappings Terminology Parsing Paraphrasing Knowledge Domain Language Domain Linguistic Annotation Annotation Ontology Semantic Annotation Mesh POS (part-of-Speech) GENIA NE ontology Term/NE GO Phrase Structure Event GENIA Event ontology Static Relation GENIA Relation ontology Deep Argument Structure of HPSG Co-Reference Meta-Knowledge ontology Discourse Meta-Knowledge
Part-of-speechannotation2,000 abstracts Annotation of GENIA corpus – Term&POS Term (entity)annotation2000+400abstracts
Plan of Talk • Text annotation for information extraction • Current State of the Art :Event Recognition • Shared Task: BioNLP 09 • Shared Task: BioNLP 11 • Information Integration: Text and knowledge • Fine-Grained IR and PathText • Text annotation as knowledge solicitation • User feedback • Communication through comments on text • Concluding remarks
(567) (1,568 / 2) GENIA event ontology (1,733) (214) (21,616 / 4,552) (287) (4,712) (494) (12,352) (4) (567) (10,411 / 1,250) (2,448) (683) (6,030 / 114) (464 / 32) (5) (84) (343) (3,633) (671 / 58) (154) (415 / 28) • GENIA event ontology • 30 GO terms under Biological Process • Regulation • Regulatory events • Causal relationship • Artificial process (experimental) • Artificially performed processes. • E.g. Transfection, treatment, • Correlation (experimental) • meaning ‘any’ relation between events. (2) (1) (12) (326) (0) (40) Events of the Shared Tasks (BioNLP 09) (6) (44) (26) (1,122) (632 / 388) (244)
Graph Kernel using all shortest paths Ex. (NMOD:IP PMOD:IP, 0.4), … NMOD IP COOD IP PMOD PMOD NMOD IP Parse Tree ENTITY1 NN IP interact VBZ IP with IN IP multiple JJ subunit NNS of IN PROT NN and CC with IN IP ENTITY2 NN IP protein NN IP protein NN IP . . SBJ IP NMOD NMOD PMOD IP CC COORD IP P Sequence of words ENTITY1 NN interact VBZ M with IN M multiple JJ M subunit NNS M of IN M PROT NN M and CC M with IN M ENTITY2 NN protein NN A protein NN M . .
(567) (1,568 / 2) GENIA event ontology (1,733) (214) (21,616 / 4,552) (287) (4,712) (494) (12,352) (4) (567) (10,411 / 1,250) (2,448) (683) (6,030 / 114) (464 / 32) (5) (84) (343) (3,633) (671 / 58) (154) (415 / 28) • GENIA event ontology • 30 GO terms under Biological Process • Regulation • Regulatory events • Causal relationship • Artificial process (experimental) • Artificially performed processes. • E.g. Transfection, treatment, • Correlation (experimental) • meaning ‘any’ relation between events. (2) (1) (12) (326) (0) (40) Events of the Shared Tasks (6) (44) (26) (1,122) (632 / 388) (244)
EvaluationBioNLP 2009 Shared Task Data • BioNLP ST 2009 evaluation server 24 teams joined the campaign. The performances of the other systems were less than 45.00.
Plan of Talk • Text annotation for information extraction • Current State of the Art :Event Recognition • Shared Task: BioNLP 09 • Shared Task: BioNLP 11 • Information Integration: Text and knowledge • Fine-Grained IR and PathText • Text annotation as knowledge solicitation • User feedback • Communication through comments on text • Concluding remarks
BioNLP 2011 Shared Task • Theme • Generalization • Text styles • Abstract • Full text • Domains • GENIA (transcription factors in human blood cells) • Infectious disease (NaCTeM, Virginia Bioinformatics Institute) • Model Bacteria (DBCLS, INRA) • Epigenetic change (Univ. Tokyo) • Event types • e.g. post-translational modification • acetylation, methylation, … Robustness, Transfer learning, Adaptation Fine-Grained Information Access, Meta-knowledge Base components for Application
(567) (1,568 / 2) GENIA event ontology (1,733) (214) (21,616 / 4,552) (287) (4,712) (494) (12,352) (4) (567) (10,411 / 1,250) (2,448) (683) (6,030 / 114) (464 / 32) (5) (84) (343) (3,633) (671 / 58) (154) (415 / 28) • GENIA event ontology • 30 GO terms under Biological Process • Regulation • Regulatory events • Causal relationship • Artificial process (experimental) • Artificially performed processes. • E.g. Transfection, treatment, • Correlation (experimental) • meaning ‘any’ relation between events. (2) (1) (12) (326) (0) (40) Events of the Shared Tasks (BioNLP 09) (6) (44) (26) (1,122) (632 / 388) (244)
Main Task Epigeneticsand post-translational modifications U-Tokyo/NaCTeM • Basic task setting and data following BioNLP'09 shared task format • DNA modification and PTM events similar to '09 Phosphorylation events • Existing retrainable systems can be applied with little modification • New event types: DNA methylation, six PTM types, reverse reactions (e.g. deacetylation) and catalysis: 15 event types in total • New PTM-specific participant roles (optional subtask) • Side chain attached to proteins in Glycosylation • Context gene affected by histonemodifications • Annotation for PubMed abstracts relevant to these events • No further subdomain restrictions, data selected to avoid bias • Representative of general distribution of epigenetics and PTM-related publications in the whole literature
Main Task Epigeneticsand post-translational modifications • Epigenetic control of gene expression without changes in DNA sequence major focus of recent study • Key events DNA methylation and histone post-translational modifications (acetylation and methylation) • Important roles in many biological processes, implicated in cancer • Phosphorylation, a protein post-translational modification (PTM), most reliably extracted event at the BioNLP'09 shared task • 76% F-score for extraction of phosphorylated protein and site
Main Task : Infectious Diseases NaCTeM, U-Tokyo, Virginia Tech • Task setup and core events following BioNLP'09 Shared Task • Expression, Catabolism, Localization, Binding, etc. • New event type: Process • High-level biological processes such as “virulence” frequently discussed without stating specific participants (e.g. Theme) • New entity types (given, NER not required) • Chemical, Organism, Two-component system • New subtask (optional) • Identification of environmental variables (Acidity and Temperature) specifying the conditions in which events are stated to occur
Plan of Talk • Text annotation for information extraction • Current State of the Art :Event Recognition • Shared Task: BioNLP 09 • Shared Task: BioNLP 11 • Information Integration: Text and knowledge • Fine-Grained IR and PathText • Text annotation as knowledge solicitation • User feedback • Communication through comments on text • Concluding remarks
Semantics-based, Fine-Grained Information Access Document Retrieval, Information Retrieval • Unit of retrieval : Article, Document • Expression of User Intention: • Controlled or non-controlled keywords • Indexes: character sequences, keywords Semantics-based, Fine-Grained Information Access system Unit of retrieval : paragraphs, sentences, phrases Expression of user intention: Simple but semantically enriched Indexes: Semantics-based structured meta-data Question Answering
Example:PATHTEXTNaCTEM (U-Manchester), U-Tokyo, SBI B.Kemper,T.Matsuzaki,Y.Matsuoka,Y.Tsuruoka,H.Kitano,S. Ananiadou, J.Tsujii : PathText: a text mining integrator for biological pathway visualizations, Bioinformatics, Vol.26 (12), Oxford University Press, 2010
Toll-Like Receptor (TLR) pathway Nodes : 652 Links: 444 600 papers were read to construct the pathway Oda K, Matsuoka Y, Funahashi A, Kitano H: A comprehensive pathway map of epidermal growth factor receptor signaling. Mol SystBiol2005, 1:2005 0010.
Knowledge Integration Pathways and Literature Pathways Pathways integrate biological knowledge pieces into coherent interpretations Pathways have been recognized as important means of representing biological knowledge. Literature Medline contains over 18 million articles More than 0.5 million articles are being added every year, which means 1.3 thousand articles per day
Pathways and Literature • Pathways construction and literature • Pathway construction mostly relies on literature • Most important discoveries are reported by paper publications. • The full context of each discovery is described by the paper reporting it. • Pathway maintenance and literature • New discovery should lead to revisions of the relevant portions of pathways. • However, rapidly growing amount of literature makes it extremely difficult to identify relevant new discoveries. PathText NaCTeM, U-Tokyo, SBI
Interpretation, Abstraction Pathway : Qualitative Model Cell Designer Network for Simulation : Quantitative Model SBML Enrichment, Grounding Literature : Piecewise Knowledge University of Tokyo NaCTeM/University of Manchester Systems Biology Institute/OIST
SBML Network Network by Cell Designer TAK1 Kinetic parameters IKK IKK_p Textual Semantics User Semantics Text Mining Resources FACTA KLEIO MEDIE GUI Visualization
Use biological pathway visualizations as a user interface for knowledge discovery.
Use biological pathway visualizations as a user interface for knowledge discovery. Curators’ workbench: Argo
Event RepresentationBioNLP ST Event representation • An Event type is assigned from a fixed ontology • Bound to a specific expression in text • Associated with an arbitrary number of participants Ohta, T., Pyysalo, S. Ananiadou, S. Tsujii, J 2011 Pathway Curation Support as an Information Extraction Task, LBM 2011
Mapping reactions to text: PathText Link to text mining results
Plan of Talk • Text annotation for information extraction • Current State of the Art :Event Recognition • Shared Task: BioNLP 09 • Shared Task: BioNLP 11 • Information Integration: Text and knowledge • Fine-Grained IR and PathText • User feedback • Text annotation as knowledge solicitation • Communication through comments on text • Concluding remarks
Overview • Query • Relevance feedback • Curation Text mining tool kit GENIA tagger, EventMine, Pathways Users • Improved querying • Improved ranking • Accurate search • High coverage
Improved ranking for PathText • Document-to-reaction relevance judgments • Use for ranking method evaluation and ML method training Relevant Partly relevant
Transcription factor NF-kappa B (p50/p65) is generally localized to the cytoplasmby its inhibitor I kappa B alpha. (8319912-S2) NF-kB, cytoplasm, interaction with IkBa The active nuclear form of the NF-kappa B transcription factor complex is composed of two DNA binding subunits, … (1493333-S2) NF-kB, active, nuclear
Plan of Talk • Text annotation for information extraction • Current State of the Art :Event Recognition • Shared Task: BioNLP 09 • Shared Task: BioNLP 11 • Information Integration: Text and knowledge • Fine-Grained IR and PathText • Text annotation as knowledge solicitation • User feedback • Communication through comments on text • Concluding remarks
Processing Components Search engines, NLP, NER, export to XML, editors, etc. Developers Workflows GUI for creating single-flow and multi-branch workflows Web Service Workflow Designer Third-party applications Remote Processing Workflows processed on remote machines. No attendance needed User Interaction Annotation Editor allows for making changes while processing Annotator/Curator
Annotation tools • Argo • brat