680 likes | 855 Views
Service-oriented architecture for integration of bioinformatic data and applications. Xiaorong Xiang Department of Computer Science and Engineering University of Notre Dame. Contributions. Survey of research issues and challenges in service-oriented computing (Chapter 2)
E N D
Service-oriented architecture for integration of bioinformatic data and applications Xiaorong Xiang Department of Computer Science and Engineering University of Notre Dame Ph.D defense
Contributions • Survey of research issues and challenges in service-oriented computing (Chapter 2) • Built a SOA based system for supporting bioinformatics research (Chapter 3) • Explored the deep phylogeny of the plastid with the system (Chapter 4) • Enhanced the system with semantic web technology and a novel approach of reuse workflows (Chapters 5 & 6) Ph.D defense
Outline • Introduction to SOA • MoG project and MoGServ • Ontological data and service representation model • Knowledge and workflow reuse Ph.D defense
SOA – an architectural style of distributed computing • Why SOA • Reusability • Interoperability • Security • Maintenance • Save cost when integrating applications • Adoption of SOA • e-Business • e-Science • e-Government Service Requester Invoke Discovery 4 2 5 3 Service Provider Service Broker 1 Publish interface Ph.D defense
Web services – one realization of SOA Transactions Additional WS* Standards … Management Business Process Execution BPEL4WS, WFML, WSFL, BizTalk, … Security Service Publishing & Discovery UDDI Universal Description, Discovery and Integration Services Description WSDL Web Service Description Language Services Communication SOAP Simple Object Access Protocol Meta Language XML Network Transport Protocols TCP/IP, HTTP, SMTP, FTP, etc Ph.D defense
SOA research orientations Semantic Web Service Semantic Grid 2 Semantic Grid Service 3 1 Open Grid Service Architecture (OGSA) The P2P technology plays an important role of increasing the scalability and reliability in Service discovery and workflow execution process Ph.D defense
Bioinformatics today From the article “Genome Sequencing vs. Moore’s Law: Cyber Challenges for the Next Decade” by Folker Meyer in journal CTWatch Quarterly August, 2006 volume 2 number 3 • Rapidly accumulating data: DNA sequences, contigs, expression data, ontologies, annotations, etc. • Non-standard independently developed heterogeneous data sources • Data sharing, data integration, and security Ph.D defense
SOA in Bioinformatics Middleware projects Large public database MORE • Community efforts needed to provide more shared and reliable services • More demonstration projects needed => best practices, measured utility, feedback to middleware projects, etc. Recent exposure of data & analysis tools as services Others Others Provide infrastructure To compose, manage, Execute, connect the Distributed services Ph.D defense
Outline • Introduction to SOA • MoG project and MoGServ • Ontological data and service representation model • Knowledge and workflow reuse Ph.D defense
Mother of Green (MoG) project • Biological science • In collaboration with Prof. Jeanne Romero-Severson, Biological Sciences, University of Notre Dame. • Study the deep phylogeny of plastid • Computer science • Provide an environment to support scientists’ investigations • A case study of using SOA for data and application integration • A prototype for future research in service-oriented architecture domain Ph.D defense
MoG project – one motivation • Malaria causes 1.5 - 2.7 million deaths every year • 3,000 children under age five die of malaria every day • Plasmodium falciparum (P. falciparum) causes human malaria • Targeted drug design through phylogenomics • P. falciparum has three genomes: nuclear, mitochondrial, plastid (apicoplast) • Find the ancestors of the apicoplast, better understanding of the evolution of plastid • Identify genes in the ancestors • Determine gene function Apicoplast in P. falciparum P. falciparum Ph.D defense
A typical in-silico investigationData driven research workflow B: Query protein coding genes for each genome sequence A: Query complete genome sequences given a taxon C: Eliminate vector sequences D: Sequences alignment E: Phylogenetic analysis Ph.D defense
Challenges (Time consuming manual web-based operations) • Data collection and information gathering • Rapid accumulation of raw sequence information • Rate of accumulation is increasing • Information accumulates faster than analyses finish • Information in forms not readily accessible • Analysis tool usage • Experimental data recording • Repetitive experiments for scientific discovery Ph.D defense
Web Interface Applications Services Access Client Application Server Data Access Services Job Manager Service/Workflow Registry MoGServ System Architecture MoGServ Middle Layer Data Analysis Services Metadata Search Job Launcher Local Data Storage Workflow/Soap Engines Services Data/Services Providers NCBI DDBJ EMBL Others
Data storage and access services • Local database • Integrating data from multiple data sources with scientists interests • Supporting repetitive investigations against several subsets of sequences • Avoiding network traffic and service failure when retrieving data on-the-fly from public data sources • Accessing the data in the local database by services Ph.D defense
Service and workflow registry • A table-based description with necessary properties • Text description • Service location • Input/output • Provider • Version • Algorithm • Invocation method • Not intended for supporting service discovery or composition at current stage • A repository of service and workflow used for local application developers Ph.D defense
Indexing and querying metadata • Metadata • Service and workflow description • Description of sequence data in order to track the origination of data • Experimental data output, input, and intermediate data • Indexing and querying with keyword • Lucene • Implemented as services Ph.D defense
Service/Workflow Registry INPUT Parameters Task Name Timer Find the service/workflow definition using the task name Job Launcher Job Manager Job Information Form a Job Description Instances of Workflow/Service Engines Output Job ID Service and workflow enactment Ph.D defense
Implementation • Development and deployment • J2EE, JSP, XSLT • Tomcat 5.0.18 / Axis 1_2RC2 • Database • PostgresSQL 8.1 • Index and search of metadata • Apache Lucene library • Service implementation • Java2WSDL • Wrap command line applications with JLaunch library • Workflow • Taverna workbench, part of myGrid project • Freefluo workflow engine Ph.D defense
Taverna workbench Ph.D defense
A more complex workflow Ph.D defense
Issues with the first prototype • Meta data description • Solution • Index-based (keyword syntactic search) • Capture most properties to support the end-users requirement • Support data provenance • Limitation • Similar to most services in the bioinformatics community • Lack of semantic description (goal => semantic search) • Failure tolerance and recovery • Solution • Statically encode alternative services in the workflow to prevent service failure • Record status of the service and workflow execution into the database for possible recovery strategy • Multiple workflow engines deployment to prevent the hardware or network failure • Limitation • No dynamic service selection (more semantic description support) during execution time • No fine grained resource management and monitoring • Security Ph.D defense
Outline • Introduction to SOA • MoG project and MoGServ • Ontological data and service representation model • Knowledge and workflow reuse Ph.D defense
Semantic web • Semantic web vision • Giving meaning (semantics) to web-based information • Machine-understandable such that software agents can autonomously process them • Two standards: OWL & RDF • The Web Ontology Language (OWL) • Defines common vocabularies for specifying the concepts and relationship among concepts • Resource Description Framework (RDF) • Formal format for encoding web content using defined vocabularies • Semantic web for Bioinformatics • UniProt RDF project • Semantic web for SOA • Automated service discovery, composition Ph.D defense
Resource Description Framework (RDF) • A graph model of statements, a set of triples: Predicate (Subject, Object) • Representations: • RDF/XML • N-triples • Turtle • A standard format to connect web information #bioinformatics #foundation MoG is a … project #hasFundedBy #hasResearchTopic #hasTextDescription http://www.nd.edu/~mog #hasCreator #gmadey #hasPersonalSite #hasFullName #hasTitle Gregory Madey #professor http://www.nd.edu/~gmadey # URI provided the definition of these vocabularies Resource Literal Ph.D defense
Ontological modules used for semantic description of data, services & workflows MoGServ application Domain Ontology (MoGServ) Generic Service Description Ontology (myGrid/Feta model) Service Domain Ontology (myGrid) Software components for annotation RDF Store Data Services Workflows Ph.D defense
MoGServ Application Domain Ontology Example concepts and properties defined in MoGServ • To better track the data origination • To support the automation of workflow creation • To better share the data on the web in the future Ph.D defense
Sample data annotation – metadata from MoG local database Displayed by Rdf-Gravity Ph.D defense
Sample service/workflow annotation Question: Which service has an operation that accepts nucleotide_sequence as a parameter Answer: Uri: http://www.ebi.ac.uk …/alignment:blastn_ncbi OperationName: Run Displayed by Rdf-Gravity Ph.D defense
Implementation of annotation and query components for data, services & workflows Annotation Templates (Service) Annotation components Sesame RDF store Annotation Templates (Data) • Sesame 1.2.6 library • Supports files, RDBMS, SeRQL Query Components Query templates Service: http://almond.cse.nd.edu:10000/ axis/services/ClustalW?wsdl Operation: runClustalWdf inputParameter: setid result SeRQL Select Y, W, X from {Y} mg:hasOperation{W} mg:inputParameter {X} rdf:type {mog:set} using namespace rdf = <http://www.w3.org/1999/02/22-rdf-syntax-ns#>, mg = <http://www.mygrid.org.uk/ontology#>, mog = <http://almond.cse.nd.edu:10000/mog#> Ph.D defense
Limitations • The MoGServ ontology is not complete • Contains a small portion of necessary concepts used for tracking the data provenance • Service domain ontology is not complete • Needs more concepts as more services are published • Challenges of using semantic web in general • Ontology creation, never complete • Data and service annotation accuracy, efficiency • Ontology integration Ph.D defense
Outline • Introduction to SOA • MoG project and MoGServ • Ontological data and service representation model • Knowledge and workflow reuse Ph.D defense
Three user-defined workflows from different views Question: “are gene genealogies for ATP subunitαβγ different?” Retrieving queryGene queryGene queryGene Aligning setIds Workflow A defined by a less experienced user using the functional definition of services setFilter queryGene clustalW clustalW clustalW clustalW Workflow C defined by an expert user with two extra executable services to ensure the accurate output of the biological process Workflow B defined by an intermediate user with executable services Ph.D defense
Limitations of current workflow management systems • Existing workflow management system and bioinformatics middleware • Taverna, Kepler, Triana, Pegasus • Design, execute, monitor, re-run • Support ad-hoc, semi-automated and automated service discovery and composition from scratch • Our approach: reuse the verified knowledge and workflow • Increase the correctness over time • Provide more accurate guidelines Ph.D defense
Workflow composer (software agent/experienced users) Knowledge base management concrete workflow Workflow execution engine Knowledge discovery Collect and manage information about data origination Semantics enabled service discovery Find appropriate service Semantics enabled service registry Data provenance management Service matchmaking Abstract workflow Annotate services using ontology DL reasoner Create abstract workflow using ontology Service Annotator User Ontology Enhanced workflow system Ph.D defense
Task A Task B Encode, convert the High level definition To low-level executable Abstract workflow Service A Service B Service D Service C Replace individual Services with their optimal alternatives Concrete workflow Service A Service B Pegasus workflow structure Invoke a workflow with Specific input data and Record the data Provenance and Performance of services, workflows. Service D Service C’ Optimal workflow input Service A Service B output Our hierarchical workflow structure Service D Service C’ Workflow instance Ph.D defense
Reusable knowledge • Connectivity • Helps to convert from abstract workflow to concrete workflow • Alternatives and quality-of-service profiles • Helps to convert from concrete workflow to optimal workflow • Mapping of abstract workflow and concrete workflow • Helps to choose reusable workflows Ph.D defense
Connectivity identification(Match detection) Service: QueryLocal Operation: createSet performTask: mygrid:retrieving inputPara: Settype(String, mog:gene) Queryterm(String, null) outputPara: Setid(string, mog:geneset) useResource: MoG Service: ClustalW Operation: runClustalWdf performTask: mygrid:aligning inputPara: Setid(String, mog:set ) Sequencetype(String, mog:sequence) outputPara: filen(string, mygrid:sequence _alignment_report) useResource:EBI Service: FormatConversion Operation: convert performtask: mygrid: translating inputPara: filen(String, mygrid:sequence _alignment_report ) outputPara: Out(String, mygrid:nexus _paup_format) useResource:MoG Parameter (data type, semantic type) Matching rule: opertation ij→ operation mn if exist parameterk is output parameter of operationij and exist parametero is input parameter of operationmn and data type (parametero) = data type (parameterk) and semantic type (parametero) = semantic type(parameterk) Ph.D defense
Need for verified service connectivity The mismatching problem Real match Inaccurate annotation Lack semantic annotation Inaccurate reasoning Accurate annotation Yes No May be detected by expertise at design time or after run Yes Match detection output No Accurate annotation Inaccurate annotation Lack of semantic annotation Inaccurate reasoning Can be detected automatically TN X Blastp In: protein sequence FP GenBankService Out:GenBank record NCBI blast In: sequence data record DDBJ-XML Out: sequence data record X Mediator, adaptor, shim Self-defined format fasta format Ph.D defense
Connectivity Graph Implementation Workflow Translation / Service composition process registry Registration process Automatically Identify the connectivity Store the connectivity Refine, update, decompose the workflow Knowledge base connect(servicea, operationai, parameterc, serviceb, operationbi, parameterd) identifyConnect (Single service, rdf repository) Search at syntactic level: search path between two nodes search next available service automatic composition base on input, output Implementation: shortest path algorithm Dijkstra Ph.D defense
Experiment • Used 418 concepts from domain ontology for semantic type, defined 10 concepts for data type. • Randomly generate service annotation. 1 input, 1 output • 1000 services connectivity graph (right side) • Intel Pentium mobile 1.5GZ Length 0 = 724, length 1= 587, length 2=448, length 3= 281, Length 4=114, length 5=71 Length 6 =28, length 7=16 Length 8 = 4, length 9 = 2 Conclusion: Feasible solution. Ph.D defense
Reuse of workflows query_term Graph view SUBDUE input format hasParameter • Reuse of abstract workflows • Reuse of concrete workflows • Compare structural similarity of two workflows • Implementation: SUBDUE algorithm v 1 input v 2 output v 3 task v 4 task v 5 query_term v 6 retrieving v 7 aligning v 8 multiple_aligning_report e 3 4 hasNext e 3 1 hasInput e 4 2 hasOutput e 3 6 performTask e 4 7 performTask e 1 5 hasParameter e 2 8 hasParameter input hasInput task performTask hasNext retrieving task performTask hasOutput output aligning hasParameter multiple_alignment_report Ph.D defense
Pro and Con • Pro • Increase the correctness of the formed workflow over time • Avoid the incorrect, inaccurate semantic annotations • Take advantage of verified knowledge • Avoid the ontological reasoning process • Better support for semi-automated and automated service composition over time • Provide more accurate guideline to users over time • Con • The connectivity graph can be big • Number of parameters • Number of services • Search the connectivity of a service when a service is registered in the system may take relative long time • More complex matching rule • Number of parameters • May not have high accuracy at the beginning Ph.D defense
Summary • Described the design and implementation of MoGServ • Explored the ontological representation of data and services • Described new approach for reuse of workflows and connectivity of services Ph.D defense
Future work • Integrate the GridSam into the MoGServ for execution, monitoring • Integrate the Grid computing technology for resource allocation • Refine the MoGServ application domain ontology • Create interface for end-user workflow creation • Create interface for individual workspace • Evaluate the scalability, accuracy of connectivity graph approach and the graph matching approach with large number real workflows and services Ph.D defense
Acknowledgements • Dr. Madey • Dr. Romero-Severson • Dr. Flynn • Dr. Striegel • Dr. Chaudhary • Dr. Collins • Mr. Eric Morgan • Dr. Jean-Christophe Ducom Partially supported by the Indiana Center for Insect Genomics (ICIG) with funding from the Indiana 21st Century fund Ph.D defense
Publications • X. Xiang, G. Madey and J. Romero-Severson, “A Service-oriented Data Integration and Analysis Environment for In-Silico Experiments and Bioinformatics Research”, Proceedings of the 40th Annual Hawaii International Conference on System Sciences (CD-ROM), January 3-6 2007, Computer Society Press. • Xiaorong Xiang and Greg Madey, "A Semantic Web Services Enabled Web Portal Architecture", IEEE International Conference on Web Services (ICWS 2004), San Diego, July 2004 • Xiaorong Xiang and Greg Madey, “Improving the reuse of scientific workflows and their by-products. In International Conference on Web Services (ICWS2007). Under review. • Xiaorong Xiang and Eric Lease Morgan, Exploiting "Light-weight" Protocols and Open Source Tools to Implement Digital Library Collections and Services. D-Lib Magazine, October 2005, Volume 11 Number 10 Ph.D defense
Publications planned • One journal paper for BMC Bioinformatics • Chapter 3, chapter 4, chapter 5 • Future IEEE ICWS proceedings • Chapter 6 • Biology journal – TBD • Results from using MoGServ Ph.D defense
Thank you Ph.D defense