BioMashups: The New World of Exploratory Bioinformatics?

BioMashups: The New World of Exploratory Bioinformatics? Jiro Sumitomo, James M. Hogan, Felicity Newell, Paul Roe Microsoft QUT eResearch Centre j.hogan@qut.edu.au

An Agenda • Bioinformatics • Tools, Data, and linking them together • Exploration vs. Routine Workflow • Mashups and BioMashups • Some basics and some canonical examples • Biomashups and their limitations • Predictin’ the future

Bioinformatics Abundance of tools and data sources Traditional standalone applications Interactive web sites (More recently) web service hooks Usually purpose-specific tools Link together to solve complex problems

Linking tools together The workflow trade-off: Sophistication vs development effort Keep it simple, and keep the scientistinvolved Make it complex & make the scientist a client Bench scientists usually aren’t software engineers But they can chain operations together if they have the right primitives and the right glue

Extremes of Scientific Workflow The manual data management system Also known as cut-and-paste from Excel Cannot scale, but it presents no barriers… Robust Workflow Systems: Taverna, Kepler et. al. Essential for high-end instrumentation; well-engineered, support for provenance But significant set-up, familiarisation…

The Middle Ground… Scripting in perl, python et al. Significant programming skills needed Useful for well-defined processes, but exploratory work is time consuming Accessing remote data and linking web services beyond most scientists [A niche for biomashups?]

Mashups Mashups are web-based applications for the combination of data sources and services Earliest mashups used Javascript to link exposed service and data APIs, and to wrap existing tools Same issues as perl scripting, with the additional need to organise hosting Little incentive to standardise or share

Mashup Frameworks Development environments, hosting and publication Common interface structure Building a community? Scripting for scientists? Overcoming the programming barrier Depends on the libraries, primitive ops And there is (usually) javascript under the hood

Some of the players…

Mashups & Data Mashups are limited by data exchange Good at passing an index to the data Think latitude & longitude Bad at passing massive data sets around Client mashup architecture e.g Facebook Third Party Services Mashup Server Mashup e.g. Virtual Earth ... Client web browser

BioMashups Middle ground between cut-and-paste and full workflow management systems Corresponds best to perl scripting Ideal when user intervention is needed May be seen as a prototype for Workflow Helps to mask complex data access and search tools which frustrate experts and drive students to exasperation…

SDLM1 Perform a blastx on the sequence. Obtain the best hit/hits by inspection of the blast output page. Retrieve Genbank record of the best hit by clicking on the link in the output page. Determine the known regions by inspection, in this case an ANF_receptor. Perform an Entrez search on this region.

The New UG Biology: SDLM1 Perform a blastx on the sequence. (NCBI Blast block) Obtain the best hit/hits by inspection of the blast output page. (NCBI Blast result parser block) Retrieve Genbank record of the best hit by clicking on the link in the output page. (RDF Block, pointing to Bio2Rdf) Determine the known regions by inspection, in this case an ANF_receptor. (The mashup parses the RDF document instead - Bio2Rdf Block) Perform an Entrez search on this region. (NCBI Entrez block)

Case Study: Analysing Proteins Protein Characteristics Name, sequence Journal articles, cross-reference Protein Prediction Molecular weight, isoelectric point Secondary structure, post-translational mods

Data & Services

Mashups Architecture 13 Custom Blocks 1) Input and Output 2) Processing: protein characteristics 3) Processing: protein prediction Protein Characteristics Input Combine Output Protein Prediction

BioMashups for Proteins Given its Uniprot ID, how much can we find out about a particular protein?

BioMashups for Proteins Given its sequence, what properties can we readily obtain from web-based prediction services?

Predictin’ is difficult… but Frameworks can and will support Ad hoc exploratory bioinformatics Index-based routine computation Building (enclave) communities Varying levels of success in allowing Scientist (& student) driven mashups Sharing and re-use of components

Predictin’ is difficult… but It will be a long time before mashup frameworks: Are used to process data from high-throughput sequencing machines Process large scale collections Beat Taverna & Kepler at provenance

Predictin’ is difficult… BUT

Overcoming the barriers… Building a general BioMashups community Cross-over between frameworks Seeding the community with ‘re-usable’ components and reaching critical mass The myExperiment BioMashups group Bringing BioMashups to the curriculum The new undergraduate biology

Links MQUTeR Bio & BioMashups http://www.mquter.qut.edu.au/bio/ http://www.mquter.qut.edu.au/bio/biomashups.aspx myExperiment BioMashups Group http://www.myexperiment.org/groups/99 Protein Mashups http://www.mquter.qut.edu.au/bio/ProteinMashupsb[1].wmv http://www.popfly.com/users/fsn/Protein%20Biomashups%20Summary%20page

Acknowledgements

Questions?

BioMashups: The New World of Exploratory Bioinformatics?