210 likes | 348 Views
Integrating BioMedical Text Mining Services into a Distributed Workflow Environment. UK E-Science All Hands Meeting Nottingham September 1-3, 2004. Rob Gaizauskas , Neil Davis, George Demetriou, Yikun Guo, Ian Roberts. Outline.
E N D
Integrating BioMedical Text Mining Services into a Distributed Workflow Environment UK E-Science All Hands MeetingNottinghamSeptember 1-3, 2004 Rob Gaizauskas, Neil Davis, George Demetriou, Yikun Guo, Ian Roberts
Outline • Introduction: Workflows, Web Services and Text Mining for Bioinformatics • Two Case Studies: Graves’ Disease and Williams Syndrome • Text Services • Text Collection Server • Text Services Workflow Server • Interface/Browsing Client • Conclusions/Future Work All Hands Meeting, Nottingham
Workflows, Web Services and Text Mining for Bioinformatics • Workflows • useful computational models for processes that require repeated execution of a series of complex analytical tasks • E.g. biologist researching genetic basis of a disease repeatedly • maps reactive spot in microarray data to gene sequence • uses a sequence alignment tool to find proteins/DNA of similar structure • mines info about these homologues from remote DBs • annotates unknown gene sequence with this discovered info All Hands Meeting, Nottingham
Workflows, Web Services and Text Mining for Bioinformatics • Web services • Processing resources that are • available via the Internet • use standardised messaging formats, such as XML • enable communication between applications without being tied to a particular operating system/programming language • Useful for bioinformatics where data used in research is • heterogeneous in nature – DB records, numerical results, NL texts • distributed across the internet in research institutions around the world • available on a variety of platforms and via non-uniform interfaces All Hands Meeting, Nottingham
Workflows, Web Services and Text Mining for Bioinformatics • Text mining • any process of revealing information – regularities, patterns or trends – in textual data • includes more established research areas such as information extraction (IE), information retrieval (IR), natural language processing (NLP), knowledge discovery from databases (KDD) • relevant to bioinformatics because of • explosive growth of biomedical literature • availability of some information in textual form only, e.g. clinical records All Hands Meeting, Nottingham
Workflows, Web Services and Text Mining for Bioinformatics Workflows Web services Text mining Bioinformatics All Hands Meeting, Nottingham
Context • Objective: deliver text services for the myGrid and CLEF projects • myGrid has adopted the workflow model for delivering an e-biologist’s workbench • Scufl workflow specification language • Taverna workflow design tool • Freefluo workflow enactment engine • Problem: how to integrate text mining into a biological workflow? • Most text mining runs off-line and supports interactive browsing of results • Most workflows run end to end with no user intervention • What are the inputs to text mining to be? • Solution: tap off result of a workflow step and treat as implicit query All Hands Meeting, Nottingham
Two Case Studies in the Genetic Basis of Disease • Graves’ Disease • an autoimmune condition affecting tissues in the thyroid and orbit • being investigated using the micro-array methods • micro-array shows which genes are differentially expressed in normal patients vs patients with the disease = candidate genes • sequence alignment search (e.g. BLAST) finds genes/proteins with similar structure • function of these “homologues” may suggest function of candidate gene • key step for text mining follows BLAST search • for homologous proteins BLAST report contains references to proteins in SWISSPROT protein database • Swissprot records contain ids of abstracts describing the protein in Medline abstract database • abstracts can be mined directly or used as ``seed'' documents to assemble a set of related abstracts All Hands Meeting, Nottingham
Two Case Studies in the Genetic Basis of Disease • Williams Syndrome • congenital disorder resulting in mental retardation caused by deletion of genetic material on 7th chromosome • area in which deletions occur not well characterised – better sequence info is becoming available • as new sequence information becomes available • gene finding software run against it • BLAST is run against new putative genes to identify homologues whose function may be known • BLAST reports provide links to abstracts in the literature All Hands Meeting, Nottingham
Workflow definition + parameters User Client Workflow Server Clustered PubMed Ids + titles Initial Workflow Cluster Abstracts Workflow Enactment Swissprot/Blast record Extract PubMed Id Get Related Abstracts Term-annotated Medline abstracts Get Medline Abstract Medline Server Medline Abstracts PubMed Ids Medline: pre-processed offline to extract biomedical terms + indexed PubMed Ids Text Services Architecture All Hands Meeting, Nottingham
Text Services Architecture • 3-way division of labour sensible way to deliver distributed text mining services • Providers of e-archives, such as Medline, will make archives available via web-services interface • Cannot offer tailored sevices for every application • Will provide core, common services • Specialist workflow designers will add value to basic services from archive to meet their organization’s needs • Users will prefer to execute predefined workflows via standard light clients such as a browser • Architecture appropriate for many research areas, not just bioinformatics All Hands Meeting, Nottingham
Workflow definition + parameters User Client Workflow Server Clustered PubMed Ids + titles Initial Workflow Cluster Abstracts Workflow Enactment Swissprot/Blast record Extract PubMed Id Get Related Abstracts Term-annotated Medline abstracts Get Medline Abstract Medline Server Medline Abstracts PubMed Ids Medline: pre-processed offline to extract biomedical terms + indexed PubMed Ids Text Services Architecture All Hands Meeting, Nottingham
Text Collection Server • Text collection is Medline (www.ncbi.nlm.nih.gov/) • > 10 million abstracts since 1950’s • largest repository of biomedical abstracts • copies made available for research, updated annually • records contain semi-structured information annotated in XML • Unique id – PubMed id • Citation information – author(s), journal, year, etc. • Manually assigned controlled vocabulary keywords (MeSH terms) • Text of abstract All Hands Meeting, Nottingham
Text Collection Server (cont) • Local copy • Loaded in mySQL, indexed on various fields, e.g. MeSH terms • Text portion indexed with for search engines (Lucene, Madcow) • Text pre-preprocessed with text mining tools • Tokenisation • Terminology look-up and indexes built for term classes (proteins, genes, diseases, etc.) • Server accepts web service calls to, e.g. • Return text of abstract given a PubMed id • Return MeSH terms of abstracts given PubMed ids • Return PubMed ids of abstracts with given MeSH terms • Return PubMed ids of abstracts matching a free text query • Return PubMed ids of abstracts containing a specific term • Part-of-speech tagging • Term Parsing All Hands Meeting, Nottingham
Workflow definition + parameters User Client Workflow Server Clustered PubMed Ids + titles Initial Workflow Cluster Abstracts Workflow Enactment Swissprot/Blast record Extract PubMed Id Get Related Abstracts Term-annotated Medline abstracts Get Medline Abstract Medline Server Medline Abstracts PubMed Ids Medline: pre-processed offline to extract biomedical terms + indexed PubMed Ids Text Services Architecture All Hands Meeting, Nottingham
Workflow Server • Workflow server runs Freefluo enactment engine to execute Scufl workflow (designed using Taverna) • Graves’ disease workflow: All Hands Meeting, Nottingham
Workflow definition + parameters User Client Workflow Server Clustered PubMed Ids + titles Initial Workflow Cluster Abstracts Workflow Enactment Swissprot/Blast record Extract PubMed Id Get Related Abstracts Term-annotated Medline abstracts Get Medline Abstract Medline Server Medline Abstracts PubMed Ids Medline: pre-processed offline to extract biomedical terms + indexed PubMed Ids Text Services Architecture All Hands Meeting, Nottingham
Interface/Browsing Client • Two components • Submit workflow for enactment • Explore results and launch follow-on queries • Three types of follow-on search • Find other texts containing terms in current text • Find texts containing a specific search string (free text search) • Find others text “like” current one (with same MeSH terms) • Implemented as a Java-Swing applet for easy inclusion in portals All Hands Meeting, Nottingham
Abstract Titles MeSH Tree Abstract body Search scope restrictors Linked terms Get Related Abstracts Free text search Interface/Browsing Client All Hands Meeting, Nottingham
Conclusion • Have implemented a set of text mining web services that run in a workflow to support biologists in exploring the genetic basis of disease • Implementation based on a generic 3 component architecture (archive server, workflow server, browser client) with wider applicability • Basic idea is to glean an implicit query from a workflow operation (e.g. sequence alignment) • find abstracts of papers related to abstracts describing homologous proteins/genes of gene of interest • Cluster results and present to user • User can explore results and issue follow-on queries via a richly-featured graphical interface All Hands Meeting, Nottingham
Future Work • Integrate in practice with rest of Graves’/Williams workflows in myGrid and get feedback from biologists • Explore other intepretations of “relatedness” for abstracts in addition to MeSH terms • in assembling corpus of related abstracts (e.g. vector space/language model notions of similarity) • in clustering results (e.g. k-means/agglomerative clustering) • Explore other ways of deriving implicit queries from workflows – e.g. mining provenance data • Explore further interface search filtering operations and interface design issues • Scale up to process all of Medline for term/entity identification All Hands Meeting, Nottingham