1 / 35

Data integration, web services and workflow management

Data integration, web services and workflow management. Paolo Romano National Cancer Research Institute, Genova (paolo.romano@istge.it). Summary. Information and data integration Web Services CABRI and TP53 databases Implementation of Web Services (soaplab) Workflow management

dakota
Download Presentation

Data integration, web services and workflow management

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Data integration, web servicesand workflow management Paolo Romano National Cancer Research Institute, Genova (paolo.romano@istge.it) P. Romano, Tutorial BITS2005

  2. Summary • Information and data integration • Web Services • CABRI and TP53 databases • Implementation of Web Services (soaplab) • Workflow management • Demo: execution of workflows with taverna P. Romano, Tutorial BITS2005

  3. Information in biology • Biomedical research produces an increasing quantity of new information • Some domains, like genomics and proteomics, contributes to huge databases • Emerging domains, like mutation and variation analysy, polymorphisms, metabolism, and technologies, e.g., microarrays, will contribute with even huger amounts of data P. Romano, Tutorial BITS2005

  4. Information in biology • EMBL Data Library 74 (Mar 2003): • Sequences: 23,234,788, Bases: 30,356,786,718 • EMBL Data Library 81 (Dec 2004): • Sequences: 40,696,839, Bases: 44,285,259,441 • WGS sequences: 5,408,558, Bases: 34,986,041,399 • EMBL Data Library 82 (Mar 2005): • Sequences: 43,246,005, Bases: 46,927,070,905 • WGS sequences: 6,228,397, Bases: 38,207,643,477 • Size: 7,3% more vs 81 (3 months), 112,9% vs 74 (24 months) P. Romano, Tutorial BITS2005

  5. Heterogeneicity of databanks • Only a few databanks are managed in an almost homogenous way by EBI, NCBI, DDBJ (sequence) • Many databanks are created by small groups or single researchers • Secondary databases are of high quality (good and extended annotation, quality control) • Many databases are highly specialized, e.g. by gene, organism, disease, mutation, etc… • Databanks are distributed: different DBMS, data structures, information, semantics, distribution methods P. Romano, Tutorial BITS2005

  6. Softwares • Specialist softwares are essential for almost all analysis in molecular biology: • Sequence analysis, secondary and tertiary protein structure prediction, gene prediction, molecular evolution, etc… • Softwares must interoperate with databases • Databases as input for softwares • Results as new data to record and analyze P. Romano, Tutorial BITS2005

  7. Goals of the integration • Integration is needed in order to: • Achieve a better and wider view of all available information • Carry out analysis and/or searches involving more databases and softwares automatically • Perform analysis involving large data sets • Carry out a real data mining P. Romano, Tutorial BITS2005

  8. Integration longevity • Integration needs stability • Standardization…… • Good domain knowledge • Well defined data • Well defined goals • Integration fears: • Heterogeneicity of data and systems • Uncertain domain knowledge • Fast evolution of data • Highly specialized data • Lacking of predefined, clear goals • Originality, experimentalism (“let me see if this works”) P. Romano, Tutorial BITS2005

  9. Integration of biological information In biology: • Goals and needs of researchers evolve very quickly according to new theories and discoveries • A pre-analysis and reorganization of the data is very difficult, because data and related knowledge vary continuosly • Complexity of information makes it difficult to design data models which can be valid for different domains and over time P. Romano, Tutorial BITS2005

  10. Integration methods Integration methods • Explicit (reciprocal) links (xrefs) • Implicit links (e.g., names) • Common contents (vocabularies) • Shared data models and schemas • Ontologies P. Romano, Tutorial BITS2005

  11. Web Services • XML based network services • Implement standard transport protocols (SOAP, HTTP) • Standards available for their retrieval and identification (UDDI), description (WSDL) and composition (WSFL) • Allow software applications to access data “intelligently”: identification of contents, interpretation of semantics information • Metadata needed • Web Services implemented by many Institutes and service nodes (EBI, NCBI, ....) P. Romano, Tutorial BITS2005

  12. WSDL: the description Web Services Description Language (WSDL) • Standard for the description of Web Services • Define localization, access ways and detailed description • Abstract functionalities, practical details • WSDL Binding: implementation for SOAP, HTTP, MIME P. Romano, Tutorial BITS2005

  13. CABRI: Objectives Common Access to Biological Resources and Information • Setting Quality Management Guidelines • Distributing biological resources of the highest quality • Integrating searches and access to catalogues • Ad hoc search (CABRI Simple Search) • Shopping cart (pre-ordering facility) P. Romano, Tutorial BITS2005

  14. CABRI: Partners and resources Partners: • BCCM, CABI, CBS, CIP, DSMZ, ECACC, ICLC, NCCB, NCIMB (culture collections) • IST, CERDIC (ICT) Resources: • Microorganisms (bacteria, yeasts, fungi strains) • Animal cells (animal and human cell lines, hybridomas, HLA typed B lines) • Plasmids, phages, viruses, DNA probes • Overall, more than 110.000 biological resources P. Romano, Tutorial BITS2005

  15. CABRI: SRS • Reasons why • Manages heterogeneous databases • Flat file format • Simple and effective interface • Internal and external links • Link operator • Easily expandible (new databases) • Flexibility in creation of indexes P. Romano, Tutorial BITS2005

  16. CABRI: data structure For each material, three data sets identified: • Minimum Data Set (MDS): essential data, needed to identify individual resources • Recommended Data Set (RDS): all data that are useful to describe individual resources • Full Data Set (FDS): all data available on the resources P. Romano, Tutorial BITS2005

  17. CABRI: data structure For each information, data input and authentication guidelines, including: • Detailed textual description of the information • In-house reference lists of terms and controlled vocabularies • Predefined syntaxes (e.g., Literature, scientific names) P. Romano, Tutorial BITS2005

  18. CABRI: Name field P. Romano, Tutorial BITS2005

  19. CABRI: Reference paper field P. Romano, Tutorial BITS2005

  20. CABRI: integration For each material: • Common data structure and syntax • Integrated searches/results through SRS For each catalogue: • SRS and HTML links to reference dbs (media, synonyms, hazard, etc…) For many catalogues: • Explicit links to Medline, EMBL, plamisd maps P. Romano, Tutorial BITS2005

  21. IARC TP53 database IARC TP53 Mutation Database http://www.iarc.fr/p53/ • Release 9: 19,809 somatic mutations, 1,769 papers, • Information: mutation, source, patient’s life style. • Vocabularies and standardized annotations • On-line queries imply human interaction. SRS implementation of the TP53 Database http://srs.o2i.it/srs71/ • SRS based service • Definition of an ad hoc DTD • XML based data interchange • Improved automated accessibility P. Romano, Tutorial BITS2005

  22. CABRI and TP53 Web Services Implementing web services that allow: • The retrieval of information from CABRI and TP53 databases by using remote calls to SRS • The possibility of including such services in complex workflows Reproducing current behaviour: • Search by name, identifier and free text (CABRI) • Search by interesting properties (TP53) • Combine results • Integrate data with other sources by using IDs/common terms Two types of services: • Search for a specific feature and return ID • Search for an ID and return full record (or predefined sections) P. Romano, Tutorial BITS2005

  23. Soaplab: SOAP-based Analysis Web Service “Soaplab is a set of Web Services providing a programatic access to some applications on remote computers.It is often referred to as an Analysis (Web) Service” (Martin Senger, EBI). It allows for the implementation of Web Services offering access to: • local command-line applications • EMBOSS • contents of ordinary web pages (GowLab) Requirements • Apache Tomcat servlet engine and Axis SOAP toolkit, Java • perl, mySQL P. Romano, Tutorial BITS2005

  24. Soaplab P. Romano, Tutorial BITS2005

  25. Soaplab appl: getCellLineIdsByName [ documentation: "Get cell lines by name from CABRI human and animal cell lines catalogues (see www.cabri.org)" groups: "CABRI" nonemboss: "Y" comment: "launcher get" supplier: "http://www.cabri.org/CABRI/srs-bin/wgetz" comment: "method [{$libs}-nam:'$name'] -ascii“ ] string: libs [ parameter: "Y“ ] string: name [ parameter: "Y“ ] outfile: result [ ] P. Romano, Tutorial BITS2005

  26. Soaplab appl: getCellLineIdsByProperty [ documentation: "Get cell lines by properties (all text) from CABRI human and animal cell lines catalogues (see www.cabri.org)" groups: "CABRI" nonemboss: "Y" comment: "launcher get" supplier: "http://www.cabri.org/CABRI/srs-bin/wgetz" comment: "method [{$libs}-all:'$text'] -ascii" ] string: libs [ parameter: "Y“ ] string: text [ parameter: "Y“ ] outfile: ids [ ] P. Romano, Tutorial BITS2005

  27. Soaplab appl: getCellLinesById [ documentation: "Get cell lines by Id from CABRI human and animal cell lines catalogues (see www.cabri.org)" groups: "CABRI" nonemboss: "Y" comment: "launcher get" supplier: "http://www.cabri.org/CABRI/srs-bin/wgetz" comment: "method -e [{$libs}:'$id'] -ascii" ] string: libs [ parameter: "Y“ ] string: id [ parameter: "Y“ ] outfile: result [ ] P. Romano, Tutorial BITS2005

  28. Workflow management “A computerized facilitation or automation of a business process, in whole or part". (Workflow Management Coalition) Main goal is: • the implementation of data analysis processes in standardized environments Main advantages relate to: • effectiveness: being an automatic procedure, it frees bio-scientists from repetitive interactions with the web and it supports good practice, • reproducibility: analysis can be replicated over time, • reusability: intermediate results can be reused, • traceability: the workflow is carried out in a transparent analysis environment where data provenance can be checked and/or controlled. P. Romano, Tutorial BITS2005

  29. Workflow management Workflow management softwares: • Biopipe, an add-on to bioperl, • GPipe, an extension of the Pise interface • Taverna (EBI), a component of the myGrid platform, • Wildfire (Bioinformatics Institute, Singapore) • Pipeline Pilot (SciTegic). P. Romano, Tutorial BITS2005

  30. Workflow management Taverna Workbench • constructs complex analysis workflows • access both remote and local processors • defines alternative processors • runs workflows • visualizes the results • includes a bioinformatics data ontology Requirements: java, Windows or Linux P. Romano, Tutorial BITS2005

  31. Workflow management WSDL services • Web Service Description Language (WSDL) file: adds WSDL based service nodes Soaplab servers • Soaplab server: adds a list of soaplab provided services Biomoby registries • Moby Central repository: determines hosts and their services Workflows • XScufl definition file: adds the workflow as a node and processors as child node Biomart databases • Biomart data warehouse: adds all available data sets Local processors • Simple list/string processors, constant values, beanshell scripts P. Romano, Tutorial BITS2005

  32. Demo: workflows for CABRI dbs P. Romano, Tutorial BITS2005

  33. Demo: workflows for TP53 dbs P. Romano, Tutorial BITS2005

  34. Some acknoledgements….. This work has partially been supported by the Italian Ministry for Education, University and Research (MIUR), project “Oncology over Internet” (2002 – 2005) I wish to thank my colleagues: Domenico Marra (TP53 databases and Soaplab), Federico Malusa (CABRI databases), Francesca Piersigilli (CABRI databases) P. Romano, Tutorial BITS2005

  35. …and an announcement! Workshop NETTAB 2005 http://www.nettab.org/2005/ Workflows management: new abilities for the biological information overflow October 5 - 7, 2005,University of Naples Naples, Italy Take a brochure! P. Romano, Tutorial BITS2005

More Related