1 / 55

Text Annotation

Text Annotation. Soliciting knowledge from biologists Coordinating communication among biologists. Junichi TSUJII Microsoft Research Asia UK National Centre for Text Mining, UK . Plan of Talk. Text annotation for information extraction Current State of the Art :Event Recognition

deliz
Download Presentation

Text Annotation

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Text Annotation Soliciting knowledge from biologistsCoordinating communication among biologists Junichi TSUJII Microsoft Research Asia UK National Centre for Text Mining, UK

  2. Plan of Talk • Text annotation for information extraction • Current State of the Art :Event Recognition • Shared Task: BioNLP 09 • Shared Task: BioNLP 11 • Information Integration: Text and knowledge • Fine-Grained IR and PathText • Text annotation as knowledge solicitation • User feedback • Communication through comments on text • Concluding remarks

  3. Plan of Talk • Text annotation for information extraction • Current State of the Art :Event Recognition • Shared Task: BioNLP 09 • Shared Task: BioNLP 11 • Information Integration: Text and knowledge • Fine-Grained IR and PathText • Text annotation as knowledge solicitation • User feedback • Communication through comments on text • Concluding remarks

  4. Information Extraction From text to structured representation Text Annotation Non-Trivial Mappings Terminology Parsing Paraphrasing Knowledge Domain Language Domain Concepts and Relationships among Them Linguistic expressions Motivated Independently of language

  5. Syntactic variability of single event STAT protein nuclear translocation (GO:0007262) In the training set (800 abstracts), there are no occurrences of “STAT protein nuclear translocation”. However, one found 10 occurrences of this concept. nuclear translocationof STAT6 nucleartranslocation of the latent transcription factor, STAT6 translocation into nucleus of signal transducers and activators of transcription (STAT) STAT5A and STAT5B containing complexes . . . these complexes rapidly translocated (within 1 min) into the nucleus STAT5B containing complexes . . . these complexes rapidly translocated(within 1 min) into the nucleus STAT1nuclearimport nuclear import of NF-kappa B, AP-1, NFAT, and STAT1 STAT1 in Jurkat T lymphocytes is significantly inhibited by a cell-permeable peptide carrying the NLS of the NF-kappa B p50 subunit. NLS peptide-mediated disruption of the nuclear import ... Sophia Ananiadou

  6. Text-Bound Annotation contained_indomain: Protein_domainregion: Protein_domain|Protein Deletiontheme:Protein_domain Inhibitionagent: Process theme: Process Bindingtheme1:Proteintheme2:Protein proteindomain protein ONTOLOGY Inhibitionagent: theme: Deletiontheme:RM40 Bindingtheme1:RelAtheme2:NFKBIA RelA NFKBIA RM40 RHD contained_in events inhibit bind delete contained_in contained_in ANNOTATION relations entities RM40 RHD RelA NFKBIA … 3) selective deletion of the functional nuclear localization signalpresent in theRel homology domain ofNF-kappa B p65disrupts its ability to engageI kappa B/MAD-3, and 4) … TEXT PMID:1493333

  7. Information Extraction From text to structured represesntation Text Annotation Non-Trivial Mappings Terminology Parsing Paraphrasing Knowledge Domain Language Domain Linguistic Annotation Annotation Ontology Semantic Annotation Mesh POS (part-of-Speech) GENIA NE ontology Term/NE GO Phrase Structure Event GENIA Event ontology Static Relation GENIA Relation ontology Deep Argument Structure of HPSG Co-Reference Meta-Knowledge ontology Discourse Meta-Knowledge

  8. Part-of-speechannotation2,000 abstracts Annotation of GENIA corpus – Term&POS Term (entity)annotation2000+400abstracts

  9. Annotation of GENIA corpus – Process&Tree

  10. Event Annotation - Example

  11. Plan of Talk • Text annotation for information extraction • Current State of the Art :Event Recognition • Shared Task: BioNLP 09 • Shared Task: BioNLP 11 • Information Integration: Text and knowledge • Fine-Grained IR and PathText • Text annotation as knowledge solicitation • User feedback • Communication through comments on text • Concluding remarks

  12. (567) (1,568 / 2) GENIA event ontology (1,733) (214) (21,616 / 4,552) (287) (4,712) (494) (12,352) (4) (567) (10,411 / 1,250) (2,448) (683) (6,030 / 114) (464 / 32) (5) (84) (343) (3,633) (671 / 58) (154) (415 / 28) • GENIA event ontology • 30 GO terms under Biological Process • Regulation • Regulatory events • Causal relationship • Artificial process (experimental) • Artificially performed processes. • E.g. Transfection, treatment,  • Correlation (experimental) • meaning ‘any’ relation between events. (2) (1) (12) (326) (0) (40) Events of the Shared Tasks (BioNLP 09) (6) (44) (26) (1,122) (632 / 388) (244)

  13. Event Annotation - Example

  14. Graph Kernel using all shortest paths Ex. (NMOD:IP  PMOD:IP, 0.4), … NMOD IP COOD IP PMOD PMOD NMOD IP Parse Tree ENTITY1 NN IP interact VBZ IP with IN IP multiple JJ subunit NNS of IN PROT NN and CC with IN IP ENTITY2 NN IP protein NN IP protein NN IP . . SBJ IP NMOD NMOD PMOD IP CC COORD IP P Sequence of words ENTITY1 NN interact VBZ M with IN M multiple JJ M subunit NNS M of IN M PROT NN M and CC M with IN M ENTITY2 NN protein NN A protein NN M . .

  15. (567) (1,568 / 2) GENIA event ontology (1,733) (214) (21,616 / 4,552) (287) (4,712) (494) (12,352) (4) (567) (10,411 / 1,250) (2,448) (683) (6,030 / 114) (464 / 32) (5) (84) (343) (3,633) (671 / 58) (154) (415 / 28) • GENIA event ontology • 30 GO terms under Biological Process • Regulation • Regulatory events • Causal relationship • Artificial process (experimental) • Artificially performed processes. • E.g. Transfection, treatment,  • Correlation (experimental) • meaning ‘any’ relation between events. (2) (1) (12) (326) (0) (40) Events of the Shared Tasks (6) (44) (26) (1,122) (632 / 388) (244)

  16. EvaluationBioNLP 2009 Shared Task Data • BioNLP ST 2009 evaluation server 24 teams joined the campaign. The performances of the other systems were less than 45.00.

  17. Plan of Talk • Text annotation for information extraction • Current State of the Art :Event Recognition • Shared Task: BioNLP 09 • Shared Task: BioNLP 11 • Information Integration: Text and knowledge • Fine-Grained IR and PathText • Text annotation as knowledge solicitation • User feedback • Communication through comments on text • Concluding remarks

  18. BioNLP 2011 Shared Task • Theme • Generalization • Text styles • Abstract • Full text • Domains • GENIA (transcription factors in human blood cells) • Infectious disease (NaCTeM, Virginia Bioinformatics Institute) • Model Bacteria (DBCLS, INRA) • Epigenetic change (Univ. Tokyo) • Event types • e.g. post-translational modification • acetylation, methylation, … Robustness, Transfer learning, Adaptation Fine-Grained Information Access, Meta-knowledge Base components for Application

  19. (567) (1,568 / 2) GENIA event ontology (1,733) (214) (21,616 / 4,552) (287) (4,712) (494) (12,352) (4) (567) (10,411 / 1,250) (2,448) (683) (6,030 / 114) (464 / 32) (5) (84) (343) (3,633) (671 / 58) (154) (415 / 28) • GENIA event ontology • 30 GO terms under Biological Process • Regulation • Regulatory events • Causal relationship • Artificial process (experimental) • Artificially performed processes. • E.g. Transfection, treatment,  • Correlation (experimental) • meaning ‘any’ relation between events. (2) (1) (12) (326) (0) (40) Events of the Shared Tasks (BioNLP 09) (6) (44) (26) (1,122) (632 / 388) (244)

  20. Main Task Epigeneticsand post-translational modifications U-Tokyo/NaCTeM • Basic task setting and data following BioNLP'09 shared task format • DNA modification and PTM events similar to '09 Phosphorylation events • Existing retrainable systems can be applied with little modification • New event types: DNA methylation, six PTM types, reverse reactions (e.g. deacetylation) and catalysis: 15 event types in total • New PTM-specific participant roles (optional subtask)‏ • Side chain attached to proteins in Glycosylation • Context gene affected by histonemodifications • Annotation for PubMed abstracts relevant to these events • No further subdomain restrictions, data selected to avoid bias • Representative of general distribution of epigenetics and PTM-related publications in the whole literature

  21. Main Task Epigeneticsand post-translational modifications • Epigenetic control of gene expression without changes in DNA sequence major focus of recent study • Key events DNA methylation and histone post-translational modifications (acetylation and methylation)‏ • Important roles in many biological processes, implicated in cancer • Phosphorylation, a protein post-translational modification (PTM), most reliably extracted event at the BioNLP'09 shared task • 76% F-score for extraction of phosphorylated protein and site

  22. Main Task : Infectious Diseases NaCTeM, U-Tokyo, Virginia Tech‏ • Task setup and core events following BioNLP'09 Shared Task • Expression, Catabolism, Localization, Binding, etc. • New event type: Process • High-level biological processes such as “virulence” frequently discussed without stating specific participants (e.g. Theme)‏ • New entity types (given, NER not required)‏ • Chemical, Organism, Two-component system • New subtask (optional)‏ • Identification of environmental variables (Acidity and Temperature) specifying the conditions in which events are stated to occur

  23. Plan of Talk • Text annotation for information extraction • Current State of the Art :Event Recognition • Shared Task: BioNLP 09 • Shared Task: BioNLP 11 • Information Integration: Text and knowledge • Fine-Grained IR and PathText • Text annotation as knowledge solicitation • User feedback • Communication through comments on text • Concluding remarks

  24. Semantics-based, Fine-Grained Information Access Document Retrieval, Information Retrieval • Unit of retrieval : Article, Document • Expression of User Intention: • Controlled or non-controlled keywords • Indexes: character sequences, keywords Semantics-based, Fine-Grained Information Access system Unit of retrieval : paragraphs, sentences, phrases Expression of user intention: Simple but semantically enriched Indexes: Semantics-based structured meta-data Question Answering

  25. Coarse-grained text retrieval

  26. Fine-grained information access

  27. Example:PATHTEXTNaCTEM (U-Manchester), U-Tokyo, SBI B.Kemper,T.Matsuzaki,Y.Matsuoka,Y.Tsuruoka,H.Kitano,S. Ananiadou, J.Tsujii : PathText: a text mining integrator for biological pathway visualizations, Bioinformatics, Vol.26 (12), Oxford University Press, 2010

  28. Toll-Like Receptor (TLR) pathway Nodes : 652 Links: 444 600 papers were read to construct the pathway Oda K, Matsuoka Y, Funahashi A, Kitano H: A comprehensive pathway map of epidermal growth factor receptor signaling. Mol SystBiol2005, 1:2005 0010.

  29. Knowledge Integration Pathways and Literature Pathways Pathways integrate biological knowledge pieces into coherent interpretations Pathways have been recognized as important means of representing biological knowledge. Literature Medline contains over 18 million articles More than 0.5 million articles are being added every year, which means 1.3 thousand articles per day

  30. Pathways and Literature • Pathways construction and literature • Pathway construction mostly relies on literature • Most important discoveries are reported by paper publications. • The full context of each discovery is described by the paper reporting it. • Pathway maintenance and literature • New discovery should lead to revisions of the relevant portions of pathways. • However, rapidly growing amount of literature makes it extremely difficult to identify relevant new discoveries. PathText NaCTeM, U-Tokyo, SBI

  31. Interpretation, Abstraction Pathway : Qualitative Model Cell Designer Network for Simulation : Quantitative Model SBML Enrichment, Grounding Literature : Piecewise Knowledge University of Tokyo NaCTeM/University of Manchester Systems Biology Institute/OIST

  32. SBML Network Network by Cell Designer TAK1 Kinetic parameters IKK IKK_p Textual Semantics User Semantics Text Mining Resources FACTA KLEIO MEDIE GUI Visualization

  33. Use biological pathway visualizations as a user interface for knowledge discovery.

  34. Use biological pathway visualizations as a user interface for knowledge discovery. Curators’ workbench: Argo

  35. Event RepresentationBioNLP ST Event representation • An Event type is assigned from a fixed ontology • Bound to a specific expression in text • Associated with an arbitrary number of participants Ohta, T., Pyysalo, S. Ananiadou, S. Tsujii, J 2011 Pathway Curation Support as an Information Extraction Task, LBM 2011

  36. Mapping reactions to text: PathText Link to text mining results

  37. Plan of Talk • Text annotation for information extraction • Current State of the Art :Event Recognition • Shared Task: BioNLP 09 • Shared Task: BioNLP 11 • Information Integration: Text and knowledge • Fine-Grained IR and PathText • User feedback • Text annotation as knowledge solicitation • Communication through comments on text • Concluding remarks

  38. Overview • Query • Relevance feedback • Curation Text mining tool kit GENIA tagger, EventMine, Pathways Users • Improved querying • Improved ranking • Accurate search • High coverage

  39. Improved ranking for PathText • Document-to-reaction relevance judgments • Use for ranking method evaluation and ML method training Relevant Partly relevant

  40. Limitation of Text-bound annotation

  41. NF-kB pathway, the GENIA version

  42. Transcription factor NF-kappa B (p50/p65) is generally localized to the cytoplasmby its inhibitor I kappa B alpha. (8319912-S2) NF-kB, cytoplasm, interaction with IkBa The active nuclear form of the NF-kappa B transcription factor complex is composed of two DNA binding subunits, … (1493333-S2) NF-kB, active, nuclear

  43. Plan of Talk • Text annotation for information extraction • Current State of the Art :Event Recognition • Shared Task: BioNLP 09 • Shared Task: BioNLP 11 • Information Integration: Text and knowledge • Fine-Grained IR and PathText • Text annotation as knowledge solicitation • User feedback • Communication through comments on text • Concluding remarks

  44. Processing Components Search engines, NLP, NER, export to XML, editors, etc. Developers Workflows GUI for creating single-flow and multi-branch workflows Web Service Workflow Designer Third-party applications Remote Processing Workflows processed on remote machines. No attendance needed User Interaction Annotation Editor allows for making changes while processing Annotator/Curator

  45. Annotation tools • Argo • brat

More Related