1 / 31

The Research Assistant for Biological Text Mining

Software for Biotech and Pharma Research. The Research Assistant for Biological Text Mining. Luc Dehaspe Other Members of the BioMinT Consortium. Text Mining in the biological domain . Emerging field of research and development 40+ articles in “Bioinformatics 2004”

jontae
Download Presentation

The Research Assistant for Biological Text Mining

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Software for Biotech and Pharma Research The Research Assistant for Biological Text Mining Luc Dehaspe Other Members of the BioMinT Consortium

  2. Text Mining in the biological domain • Emerging field of research and development • 40+ articles in “Bioinformatics 2004” • Dedicated workshops, competitions and interest groups • Information retrieval and extraction to deal with information overflow • 12 million citations in Medline from 4600 journals • Many more resources on the web • Essential link in the semantic integration of the numerous biological resources.

  3. Use of text mining for database annotation • curated protein sequence database • high level of annotation of proteins • high level of integration with other databases Swiss-Prot Entry Creation Flowchart

  4. Use of database annotations for text mining • Tools for information retrieval, filtering, classification, extraction rely on • Corpora of examples used by machine learning methods; • Linguistic analysis and controlled vocabularies, (ontologies, thesauri, biological dictionaries). • Databases provide semi-structured information that could be used • for corpus elaboration • as specific vocabulary resources

  5. University of Antwerp (BE) Artificial Intelligence Austrian Research Institute for AI Biological Sciences University of Manchester (UK) Coordinator PharmaDM (BE) Swiss Institute of Bioinformatics University of Geneva (CH) • 3 year FP5 European Project, started in January 2003 • Official web site: www.biomint.org • Interdisciplinary consortium:

  6. The goals of BioMinT • To develop a generic text mining tool that: • interprets different types of queries • retrieves relevant documents from the biological literature • extracts the required information • outputs the result as a database slot filler or as a structured report • The tool thus provides two essential research supportservices: • Curator's Assistant:accelerate, by partially automating, the annotation and update of databases; • Researcher's Assistant: generate readable reports in response to queries from biological researchers.

  7. Comments Definition Gene name Reference content Reference comments Keywords Sequence features Curator’s Assistant forSwiss-Prot Annotation

  8. Family Super-family Domain-family High level function High level structure Disease associations Subcellular location Tissue distribution etc… Low level function Super-family structure Disease associations Number of subtypes etc… Domain structure Domain function Curator’s Assistant for PRINTS annotation • PRINTS deals with groups of proteins • Annotation of 3 types of protein fingerprints Extracted Information

  9. Swiss-Prot Entry Creation Flowchart Biological Researcher’s Literature Screening Flowchart The Biological Research Assistant • Overlap with Curator’s Assistant • All biologists occasionally in the curator’s seat • Keep ahead of Swiss-Prot in research area of interest • Include private (confidential) document collections

  10. G U I IR Query expansion PubMed search Document filtering/ranking Document organisation IE Sentence extractor NLP tools Case frame generator Information retrieval and extraction modules

  11. Information retrieval and extraction modules G U I IR Query expansion PubMed search Document filtering/ranking Document organisation IE Sentence extractor NLP tools Case frame generator

  12. Information Retrieval • A meta-query engine built round PubMed • Expansion of the initial query with synonyms using a gene/protein synonym database (GPSDB) • the goal being to retrieve an exhaustive set of documents containing information on a protein. • Filtration and ranking of the retrieved documents • Pre-classification according to information topics.

  13. GPSDB • Database for synonym expansion of gene and protein names • Populated by the main resources on model organisms • Contains 559’294 synonyms referring to 292’472 proteins

  14. LocusLink TWIST1 H-twist LocusLink BPES2 SCS ACS3 HUGO HUGO ACSL3 BPES3 TWIST ACS3 twist PRO2194 ACSL3 TWIST1 FACL3 FACL3 H-twist ACS3 BPES2 SCS PRO2194 ACS3 BPES3 TWIST Swiss-Prot OMIM ACSL3 FACL3 TWIST1 TWIST OMIM Swiss-Prot TWISTTWIST1 ACS3 LACS3 FACL3 GPSDB • Cross-reference links are used to connect database entries that refer to a same gene/protein entity, thus pointing out the problem of homonymy when it occurs

  15. GPSDB screenshot lap2 is a synonym of three separate protein entities Erbin HSP 86 Thymopoietin

  16. GPSDB screenshot

  17. GPSDB used for query expansion lap2 Original user query: Query expansion based on GPSDB

  18. Document filtering and ranking • Interactive modules which permit a flexible selection of relevant documents for the IE process. • Algorithmic approaches • Query dependent: • Lucene Ranker: java-based indexing engine giving a ranked output of queried documents • Query independent: • Naive Bayes Ranker: using pre-trained classification of relevant documents on specific topics

  19. Document filtering and ranking Output of query dependent ranking

  20. Document filtering and ranking Output of query independent ranking with respect to topic “Disease”

  21. Information retrieval and extraction modules G U I IR Query expansion PubMed search Document filtering/ranking Document organisation IE Sentence extractor NLP tools Case frame generator

  22. Sentence extractor • Goal: extract sentences with information relevant for protein annotation • Method: machine learning from corpora with manually labeled sentences • Data representation: bag-of-words approach • Best results with Support Vector Machines (linear/Radial Basis Function)

  23. Sentence extractorSample output • set of sentences extracted from the top 5 ranked papers • query-terms are highlighted • sentences classified according to topics (function, structure, disease) • sentences linked to the PubMed abstract they originate from

  24. Case frame generator A protein containing the N-terminal domain with the first transmembrane segment of MAN1 is retained in the inner nuclear membrane. TARGETED_TO {X: MAN1} {Y: inner nuclear membrane}

  25. Case frame generator • Goal: Automatic identification of selected types of entities, relations, or events in free text • Methods: • Given a set of pre-labeled sentences, learn IE templates with Inductive Logic Programming (ILP) • Background knowledge: • Syntactic & semantic information from shallow-parser • Ontologies providing entities in a given domain • Text analysis tools • Shallow Parser (MBSP) based on Machine Learning (TiMBL) • Shallow parser adapted to biomedical field using Genia corpus

  26. subject object object Case frame generatorSample output shallow parser The mouse lymphoma assay (MLA) utilizing the Tk gene is widely used to identify chemical mutagens. Cell-line The mouse lymphoma assay MLA DNA part to identify utilizing chemical mutagens the TK gene

  27. Case frame generatorSample output • Information extracted by the Case Frame Generator, which applied machine learned IE rules to output of the Shallow Parser

  28. Summary • The BioMinT prototype is a workingunified system for Biological Text Mining • Information Retrieval: • query expansion • doc filtering/ranking • Information extraction • Extraction of sentences on user-specified topics • Extraction of relationships between entities (Case frames) • Based on variety of resources/technologies/expertises • Biological sciences: corpus annotation, database annotation, fingerprints, ontologies, … • Artificial intelligence: IR, machine learning (SVM, ILP, …), Natural Language Processing (Shallow Parser), Case Frames, … • Software development: databases, web-server, GUI, …

  29. Future BioMinT developments • Integration of BioMinT prototype in the future annotation environment of Swiss-Prot & PRINTS • Release Q4-2005 • Free web-based version, with restrictions on • Simultaneous users • Resources per user (computing & storage) • Customization services provided by PharmaDM • Integration into researcher’s IT environment (E-mail alerts …) • Mining in-house document collections • Combination with DMax data analysis software • Incorporation of highly specialized background knowledge (ontologies, thesauri, biological dictionaries, etc…) • Custom reports and GUI, etc…

  30. WWW • BioMinT home page: http://www.biomint.org • GPSDB synonyms database: http://biomint.oefai.at • BioMinT prototype Quick Tour: http://biomint-server.pharmadm.com:8080/xwiki/bin/view/BioMinT/ProtopQuickTour

  31. Melanie Hilario Jee-Hyub Kim Walter Daelemans Jo Meyhi Frederik Durant Terri Attwood Alex Mitchell Paul Bradley Kurt De Grave Fred Lefever Walter Luyten Kristof Van Belleghem Andre Vandecandelaere Johann Petrak Alexander Seewald Anne-Lise Veuthey Marc Zehnder Violaine Pillet Swiss-Prot Curators Acknowledgements Artificial Intelligence Biological Sciences Interested? Demo? Leave your card at POSTER 49

More Related