Doing it again: Workflows and Ontologies Supporting Science

Doing it again: Workflows and Ontologies Supporting Science Phillip Lord Frank Gibson Newcastle University

Outline • Describe the background problem • Introduce distributed services, workflows, eScience and (a bit of) ontologies. • CARMEN • Provenance • Can we repeat an experiment?

Data-intensive bioinformatics ID MURA_BACSU STANDARD; PRT; 429 AA. DE PROBABLE UDP-N-ACETYLGLUCOSAMINE 1-CARBOXYVINYLTRANSFERASE DE (EC 2.5.1.7) (ENOYLPYRUVATE TRANSFERASE) (UDP-N-ACETYLGLUCOSAMINE DE ENOLPYRUVYL TRANSFERASE) (EPT). GN MURA OR MURZ. OS BACILLUS SUBTILIS. OC BACTERIA; FIRMICUTES; BACILLUS/CLOSTRIDIUM GROUP; BACILLACEAE; OC BACILLUS. KW PEPTIDOGLYCAN SYNTHESIS; CELL WALL; TRANSFERASE. FT ACT_SITE 116 116 BINDS PEP (BY SIMILARITY). FT CONFLICT 374 374 S -> A (IN REF. 3). SQ SEQUENCE 429 AA; 46016 MW; 02018C5C CRC32; MEKLNIAGGD SLNGTVHISG AKNSAVALIP ATILANSEVT IEGLPEISDI ETLRDLLKEI GGNVHFENGE MVVDPTSMIS MPLPNGKVKK LRASYYLMGA MLGRFKQAVI GLPGGCHLGP RPIDQHIKGF EALGAEVTNE QGAIYLRAER LRGARIYLDV VSVGATINIM LAAVLAEGKT IIENAAKEPE IIDVATLLTS MGAKIKGAGT NVIRIDGVKE LHGCKHTIIP DRIEAGTFMI

Around the world in 80 days • Biology is still largely a cottage industry • On a global stage

12181 acatttctac caacagtgga tgaggttgtt ggtctatgtt ctcaccaaat ttggtgttgt 12241 cagtctttta aattttaacc tttagagaag agtcatacag tcaatagcct tttttagctt 12301 gaccatccta atagatacac agtggtgtct cactgtgatt ttaatttgca ttttcctgct 12361 gactaattat gttgagcttg ttaccattta gacaacttca ttagagaagt gtctaatatt 12421 taggtgactt gcctgttttt ttttaattgg gatcttaatt tttttaaatt attgatttgt 12481 aggagctatt tatatattct ggatacaagt tctttatcag atacacagtt tgtgactatt 12541 ttcttataag tctgtggttt ttatattaat gtttttattg atgactgttt tttacaattg 12601 tggttaagta tacatgacat aaaacggatt atcttaacca ttttaaaatg taaaattcga 12661 tggcattaag tacatccaca atattgtgca actatcacca ctatcatact ccaaaagggc 12721 atccaatacc cattaagctg tcactcccca atctcccatt ttcccacccc tgacaatcaa 12781 taacccattt tctgtctcta tggatttgcc tgttctggat attcatatta atagaatcaa Websites everywhere

WBS Workflows: Query nucleotide sequence RepeatMasker ncbiBlastWrapper Pink: Outputs/inputs of a service Purple: Tailor-made services Green: Emboss soaplab services Yellow: Manchester soaplab services Grey: Unknowns GenBank Accession No START URL inc GB identifier Translation/sequence file. Good for records and publications prettyseq GenBank Entry Amino Acid translation Sort for appropriate Sequences only Identifies PEST seq epestfind 6 ORFs Seqret Identifies FingerPRINTS pscan MW, length, charge, pI, etc Nucleotide seq (Fasta) pepstats sixpack ORFs transeq Predicts Coiled-coil regions RepeatMasker pepcoil tblastn Vs nr, est, est_mouse, est_human databases. Blastp Vs nr Coding sequence GenScan ncbiBlastWrapper Restriction enzyme map restrict SignalP TargetP PSORTII Predicts cellular location CpG Island locations and % cpgreport InterPro PFAM Prosite Smart Identifies functional and structural domains/motifs RepeatMasker Repetative elements Hydrophobic regions Pepwindow? Octanol? Blastn Vs nr, est databases. ncbiBlastWrapper

myGrid is an EPSRC funded UK eScience Program Pilot Project Particular thanks to the other members of the Taverna project, http://taverna.sf.net

Web Services Web services support machine-to-machine interaction over a network. Note: NOT the same as services on the web Web services are a: • technology and standard for exposing code / databases with an API that can be consumed by a third party remotely. • describes how to interact with it. They are: • Self-contained • Self-describing • Modular • Platform independent

Workflows Workflow language specifies how bioinformatics processes fit together. High level workflow diagram separated from any lower level coding – you don’t have to be a coder to build workflows. Workflow is a kind of script or protocol that you configure when you run it. Easier to explain, share, relocate, reuse and repurpose. The METHODS section of a scientific publication

The Taverna Workbench http://taverna.sourceforge.net http://www.mygrid.org.uk

Workflows • Automating away cutting and pasting. • Helps to deal with distribution of data. • myGrid and Taverna built on the open nature of bioinformatics. • Can we adapt the same approach to another discipline?

Engineering and Physical Sciences Research Council CARMENCode, Analysis, Repository and Modelling for e-Neurosciencewww.carmen.org.uk

Consortium & Profile • $10M over 4 years • 20 Investigators Stirling St. Andrews Newcastle York Manchester Sheffield Leicester Cambridge Warwick Imperial Plymouth • Commenced 1st October 2006

Industry & Associates

Virtual Laboratory for Neurophysiology • Enabling sharing and collaborative exploitation of data, analysis code and expertise that are not physically collocated

Potential Barriers • Technical • Multiple propietary formats • No standardised metadata • Volume of data to be analysed • Cultural • Multiple Communities acting independently • Concerns about implications of sharing

Comparing to bioinformatics • Cottage industry • Global distribution • Need to share • But….

Age and Impact.

No sequences! • DNA and Protein sequence form a core datatype for bioinformatics • It’s simple to structure and to store, and it is of high-value • Initially, there wasn’t much of it, and textual metadata was fine. • Many people built tools over it, for transforming and manipulating.

The need for clear metadata • Most neurosciences data is relative simple in structure • But often contextually complex • Sometimes associated with behavioural features

Neuroscience spike data • The raw data is just a waveform • But what is the experiment for? • What stimulus is the organism/tissue receiving? • Even, which channel is which? • The data sets being produced are (reasonably) large (10’s of Gb, or 1Tb in three months)

Data Sharing in bioinformatics • Data Sharing was an early tradition in biology. • Gene patenting, NDAs and the like came as quite a surprise • Many political battles were fought, culminating with Clinton/Blair statement

Data Sharing in Neurosciences • The data is easy to structure, but the metadata is not • There is, therefore, less point to sharing data • Many neuroscientists come from a medical background • tends to be more of a hierarchical, secretive profession – all worried about getting sued. • A lot of neuroscientists use invasive, live animal experiments • security is more than a passing concern.

The difference in neuroscience • Less data sharing tradition • No rich ecosystem of tools • Higher barrier to entry for metadata • Larger datasets

Virtual Laboratory Node Deployment of Data & Analysis Code in Processes Raw & Derived Data File Store Structured Metadata Store Enabling Search & Annotation Raw Signal Data Search & Visualisation Analysis & Model Code Store Security Policies Controlling Access to Data & Code Search for Data & Analysis Code

Development Timeline Security (April 2008) Data and Scripting Support (April 2008) Metadata (April 2008) CARMEN v1.0 (October 2008) CARMEN CARMEN v2.0 (October 2009) Provenance (July 2008) Structured Metadata allowing data and analysis code to be described and searched Support for extended range of data formats and scripting languages Security allowing access to data and analysis code to be controlled Provenance of analysis and modelling processes leading to scientific results Release of CARMEN v 1.0 Virtual laboratory nodes open to the CARMEN consortium Release of CARMEN v 2.0 Virtual laboratory nodes “networked”

Virtual Laboratory Infrastructure Networked Nodes at Newcastle and York. More planned …

Vision – Global Laboratory

Some Unexpected Advantages • Big problem with bioinformatics services • Over time they tend to disappear • CARMEN keeps services and data together • This means we should be able to rerun analyses later. • We should be able to store provenance

What is Provenance

Replicability Rerunability Old Data New Data What does it mean to rerun an experiment? • Replicability: one scientist should be able to repeat another’s experiment, under equivalent conditions, at a different time. • Rerunability: a scientist should be able to apply an equivalent technique under new circumstances. • The addition of services into this mix complicate the issue.

Has the state of the world advanced since previously? Has the world changed, in a comparable way? Has the service changed in a comparable way? Is the specification of what happened actually right? Eager Neuroscientist Rerunability Neuroscientist comparing to existing work Tool Builder New Data New Services Replicability Error-Prone Neuroscientist Old Services Old Data

There is a difficulty • There is less tradition of data sharing • The tendancy to want to control data is much larger • If we want to data mine, we have to cope with data is mine • If we have many different repositories, this needs to be supported computationally

An Example: Licensing • Computationally amenable licenses are available • Take, for example, Creative Commons

Conclusions • Automated workflows have been applied very successfully in bioinformatics. • But applying these directly to neuroinformatics is a different issue. • Technology has to fit the domain. • We are investigating metadata for describing neuroinformatics

myGrid acknowledgements Carole Goble, Norman Paton, Robert Stevens, Anil Wipat, David De Roure, Steve Pettifer • OMII-UK Tom Oinn, Katy Wolstencroft, Daniele Turi, June Finch, Stuart Owen, David Withers, Stian Soiland, Franck Tanoh, Matthew Gamble. • Research Martin Szomszor, Duncan Hull, Jun Zhao, Pinar Alper, Antoon Goderis, Alastair Hampshire, Qiuwei Yu, Wang Kaixuan. • Current contributors Matthew Pocock, James Marsh, Khalid Belhajjame, PsyGrid project, Bergen people, EMBRACE people. • User Advocates and their bosses Simon Pearce, Claire Jennings, Hannah Tipney, May Tassabehji, Andy Brass, Paul Fisher, Peter Li, Simon Hubbard, Tracy Craddock, Doug Kell. • Past Contributors Matthew Addis, Nedim Alpdemir, Tim Carver, Rich Cawley, Neil Davis, Alvaro Fernandes, Justin Ferris, Robert Gaizaukaus, Kevin Glover, Chris Greenhalgh, Mark Greenwood, Yikun Guo, Ananth Krishna, Phillip Lord, Darren Marvin, Simon Miles, Luc Moreau, Arijit Mukherjee, Juri Papay, Savas Parastatidis, Milena Radenkovic, Stefan Rennick-Egglestone, Peter Rice, Martin Senger, Nick Sharman, Victor Tan, Paul Watson, and Chris Wroe. • IndustrialDennis Quan, Sean Martin, Michael Niemi (IBM), Chimatica. • Funding EPSRC, Wellcome Trust.

TheUniversity OfSheffield Acknowledgements Professor Colin Ingram, Professor Jim Austin, Professor Leslie Smith,Professor Paul WatsonDr. Stuart Baker,Professor Roman Borisyuk, Dr. Stephen Eglen, Professor Jianfeng Feng, Dr. Kevin Gurney, Dr. Tom JacksonDr. Marcus Kaiser, Dr. Phillip Lord, Dr. Paul Overton, Dr. Stefano Panzeri, Dr. Rodrigio Quian Quiroga, Dr. Simon Schultz, Dr. Evelyne Sernagor, Dr. V. Anne Smith, Dr. Tom Smulders Professor Miles Whittington, Christoph Echtermeyer, Martyn Fletcher, Frank Gibson, Mark Jessop Dr. Bojian Liang, Juan Martinez-Gomez, Dr. Chris Mountford, Agah Ogungboye, Georgios Pitsilis, Dr. Daniel Swan University ofSt Andrews

Doing it again: Workflows and Ontologies Supporting Science