770 likes | 935 Views
The Seven Deadly Sins of Bioinformatics. Professor Carole Goble carole.goble@manchester.ac.uk The University of Manchester, UK The myGrid project http://www.mygrid.org.uk myExperiment http://myexperiment.org OMII-UK http://www.omii.ac.uk.
E N D
The Seven Deadly Sins of Bioinformatics Professor Carole Goble carole.goble@manchester.ac.uk The University of Manchester, UK The myGrid project http://www.mygrid.org.uk myExperiment http://myexperiment.org OMII-UK http://www.omii.ac.uk
We’ve been developing software and ontologies and data and stuff with and for Bioinformatics and Bioinformaticians for a long time. “though it took two years before we understood each other!” Andy Brass, Professor of Bioinformatics
NG 3
myGrid Taverna Workflow Workbench http://www.mygrid.org.uk http://taverna.sourceforge.net
The Seven Deadly Sins of Bioinformatics BOSC 2007 Slideshare.net 8091views (02-June-2008) http://www.slideshare.net/dullhunk/the-seven-deadly-sins-of-bioinformatics/ From an original idea by Stevens and Lord
Methodology • Email a handful of bioinformaticans. • Stand well back. • Collect. • Edit. • Therapy on the cheap. • We all felt better. 7
I am grateful to… • Phil Lord (University of Newcastle) • Anil Wipat (University of Newcastle) • Matthew Pocock (University of Newcastle) • Robert Stevens (University of Manchester) • Paul Fisher (University of Manchester) • Duncan Hull (Manchester Centre for Systems Biology) • Norman Paton (University of Manchester) • Marco Roos (University of Amsterdam) • Rodrigo Lopez (EBI) • Tom Oinn (EBI) • Andy Law (Roslin Institute) • Graham Cameron (EBI) 8
The Traditional Sins…. • Lust • Gluttony • Greed • Sloth • Wrath • Envy • Pride http://en.wikipedia.org/wiki/Seven_deadly_sins 10
The Sins of Bioinformatics • Parochialism and Insularity • Exceptionalism • Autonomy or death! • Vanity: Pride and Narcissism • Monolith Meglomania • Scientific method Sloth • Instant Gratification 11
Sin 1: Parochialism and Insularity Parochialism • “being provincial, being narrow in scope, or considering only small sections of an issue.” http://en.wikipedia.org/wiki/Parochialism Insularity • “a person, group of people, or a community that is only concerned with their limited way of life and not at all interested in new ideas or other cultures.” http://en.wikipedia.org/wiki/Insularity 12
Reinvention of the Wheel • Rediscovering the same old problems, techniques, methods. • Creating…Yet another … • identity scheme. • representation mechanism for data. • ontology. • data warehouse. • integration framework. • query or ontology or workflow language. • Result? Misery. But more work for the boys…. 13
WSL-1 protein • Apoptosis-mediating receptor DR3 • Apoptosis-mediating receptor TRAMP • Death domain receptor 3 • WSL protein • Apoptosis-inducing receptor AIR • Apo-3 • Lymphocyte-associated receptor of death • LARD • GENE: Name=TNFRSF25 Q93038 = Tumor necrosis factor receptor superfamily member 25 precursor Annotation history: Q92983 O00275 O00276 O00277 O00278 O00279 O00280 O14865 O14866 P78507 P78515 Q93036 Q93037 Q99722 Q99830 Q99831 Q9BY86 Q9UME0 Q9UME1 Q9UME5 http://www.expasy.org/uniprot/Q93038 14
Andy Law's Third Law • “The number of unique identifiers assigned to an individual is never less than the number of Institutions involved in the study”... and is frequently many, many more. http://bioinformatics.roslin.ac.uk/lawslaws.html 15
Bioinformatics is about mapping one schema to another, one format to another, one id scheme to another.Comparative Genomics? Comparative Bioinformatics!What a waste of time. But a handy distraction from doing some Real Science™.
Andy Law’s First (Format) Law “The first step in developing a new genetic analysis algorithm is to decide how to make the input data file format different from all pre-existing analysis data file formats.” female male crimap 0 1 1 0 Keightly Knott and Haley 1 2 When they use ‘3’ and ‘4’ we will know they are doing it deliberately. http://bioinformatics.roslin.ac.uk/lawslaws.html 17
Yet another database … FlyBase, WormBase, SGD, BeeBase and many other large and small community databases • Organism databases • 250+ metabolic pathway databases. • Generic Model Organism Database Toolkit. Yet another Integration Platform… • Warehouses, Views, Mashups, Workflows … • Goble and Roberts, The state of the nation in data integration in the Life Sciences. JBI. 18
BioBabel • bioperl • biojava • biopython • bioruby • biophp • biosql • biouml • bioeclipse • biofoo • biobar • bio* 19
Computer Science? No thanks! Integration platforms Tools Workflow systems Ontology and Knowledge Representation Languages Database Schemas Programming Languages 20
Why don’t biologists modularise OWL* ontologies properly? *The W3C Web Ontology Language Er, well, like how should we do it “properly” and where are the tools to help us? We don’t know and we haven’t got any. But here are some vague guidelines. W3C Semantic Web for Life Sciences mailing list, 2005 21
A few months in the laboratory can save a few hours in the library. Westheimer's Law A few years at the computer can save a few hours on Google.
Reuse Rocks. Collaboration through workflow and web services VL-e Project • “instant collaboration” with Martijn Schuemie (Rotterdam) through a web service that discloses their protein synonym data. • Exchanging services and (sub)workflows with food scientists. • Web services make that easier. This isn’t the workflow – its just a picture of one! 23
Of Mice and Cows • A Trypanosomiasis in Cattle workflow (by Paul*) reused without change for Trichuris muris Infection (by Jo). • Identified the biological pathways believed to be involved in the ability of mice to expel the parasite. • Workflows are memes. Scientific commodities. To be exchanged and traded and vetted and mashed. • Users add value. *Fisher P et al A systematic strategy for large-scale analysis of genotype–phenotype correlations: identification of candidate genes involved in African trypanosomiasis, Nucleic Acids Research, 2007, 1–9 24
Sin 2: Exceptionalism • Biologist exceptionalism • Biological exceptionalism • Biology exceptionalism A root cause of Reinvention Syndrome “Bioinformatics is special” “Our domain specific outcomes requires-specific approaches and technologies” 25
Biologist exceptionalism I’m different. We are all individuals. • I know there is already a gene name for that gene, but, I don't like it and it doesn't fit in with my schema. • It would be better if I wrote the script I need so I know what it does, how it does it and how to modify it later because I haven’t specified what it was supposed to do in the first place. 26
I am considerably more complex than you… • “There are proteins, and there are records about proteins. Records come in different formats. If I make a statement using this url, is it about the record? or the protein?” Alan Ruttenberg • “[Usually] we have one entry per gene. We have several entries for a single gene when description of variations are too complicated to describe in FT lines (of course, this criteria depends on the annotator). For viruses, it is much more messy, due to ribosomal frame-shifts. Formalise that!” Eric Jain UniProtDB er…decomposition and untangling? 27
Biology Exceptionalism • Drawing graphs of data sets over time. • Stop it. • The real problem is complexity not scale. • The number of data sets, their diversity and how they overlap. • How they change. • Their reliability. 28
Biological exceptionalism“Biology is all exception.” • Don’t complicate everyone’s life for the sake of a few esoteric cases. • Cameron’s 5th Commandment of Curation • Exceptionalism paralysis. • Gather requirements expansively, prune ruthlessly • The EMBL/GenBank/DDBJ/Feature Table 29
Sin 3: Autonomy or death! With added churn, indifference to users and monopoly mentality. • Compounded by the Early Adopter tendency of the community • “Hell is other people’s systems” as Jean Paul Sartre would have said if he had been a bioinformatician. 30
Professional Autonomy IS death! • I’ll Change my interface / format whenever I feel like it, despite the fact I wanted lots of users and I have lots of users who depend on this. And I won’t bother to debug either or provide backwards compatibility. So there. • BioMART changed 4 times in the 2007. • NCBI changes as it fancies. • Ensembl relational schema. • Early BioJava. • Churning for change sake 31
No tool is an island… • Assume • only we will use it, whatever it may be. • that it will be freestanding and unlinked to anything else. • that it will always work and will keep on working. • That everyone will understand it. • “Well I know what I mean. And so does my friend. So I don’t need to specify it. Or document it properly. Or keep the metadata up to date.” • Never mind the interface, just look at my implementation! • Metadata, Models, Interfaces, Services matter. 32
Workflow commodities • Workflow published with its paper and its data set. • So what happens when I want to run this workflow again? • Is the service dead? • Is the dataset still there? • Was it designed to be reproduced or reused in the first place? 33
The myGrid Service DeCrypting Sweatshop notice how tired they look Franck Tanoh Katy Wolstencroft 34
Lincoln Stein said a while ago… “An interface is a contract between data provider and data consumer” • Document interface; warn if it is unstable • Do not make changes lightly • Even little fiddly changes break things • Provide plenty of advance warning • When possible, maintain legacy interfaces until clients can port their scripts Support as many interfaces as you can • HTML, Text only (better), HTTP, REST, SOAP • Easy Interfaces + Power User Interfaces …and he could say it again today. 35
Stability is more important than Standards, Smartness, or the latest Thing that everybody is talking about.
Sin 4 Vanity Pride Narcissism conceit, egotism or simple selfishness. Applied to a social group, denotes elitism or an indifference to the plight of others 37
And what would you suggest, Mr. Smartie Pants? I know it all. • Claiming to know everything about biology and everything about computers. • Really irritating to biologists AND computer scientists. • Even they don’t claim to know everything about biology or computer science. • Computer scientists do know a lot of stuff. And they publish too. • “Biologists are the experts on everything because we produce the data” 38
Think like me! Be like me! Misunderstanding and disrespecting users • Designing good experiments is hard. • Workflows are computational experimental protocols. Ergo…. • Writing workflows is hard. • Writing good workflows is really hard. • Writing good reusable workflows is really really hard. • Building interfaces that only you can use. • Not actually using your tools in the field. I understand workflows Workflows are for biologists. My granny can do workflows... 39
A good User Experience outweighs smart features and cool computing.Bummer. Can I use it? Is the user interface familiar? Does it fit with my needs?
Sin 5: Monolith Meglomania • delusions of grandeur. • obsession with grandiosity and extravagance. • Data mining • “my data is mine, and your data is mine” 41
Integration – the more the merrier. No. • Every link is a potential dead link. • Every dependency finds its way on to your critical path. • “Uber-tools” and “Uber-databases” • Ensembl, BioMART etc etc….cost • Apps/libraries in bioinformatics workbenches • with loads of crap bundled in, none of it kept up to date, none of it properly integrated. • Put it all in a warehouse. • Lots of warehouses and lots of toolkits…GMOD, BioWarehouse, BioMART blah blah…. • 50% warehouses fail. (Standish Group) 42
The trouble with warehouses • Warehouses work? Piffle. They never manage to maintain synchrony with the source data. Mostly they fall down of their own weight!” Graham Cameron, EMBL-EBI • "Our ability to capture and store data far outpaces our ability to process and exploit it. This growing challenge has produced a phenomenon we call the data tombs, or data stores that are effectively write-only; data is deposited to merely rest in peace, since in all likelihood it will never be accessed again. Data tombs also represent missed opportunities." Usamma Fayyad Yahoo! Research! Laboratories! • We believe that attempts to solve the issues of scientific data management by building large, centralised, archival repositories are both dangerous and unworkable” Microsoft 2020 Science report. 43
Annotation Server Annotation Server Annotation Server AC003027 M10154 AC005122 WI1029 AFM820 AFM1126 WI443 Distributed Annotation Systemhttp://www.biodas.org Reference Server Now we call this a mash up AC003027 M10154 AC005122 44
Sin 6: Scientific Method Sloth • Its easier to think of a new name than use someone else’s. • I want my own view over data and views are difficult, so I’ll create my own database. • Leads to Reinvention, Exceptionalism • Often the result of Instant Gratification 45
Ennui • Garbage in, garbage out • Running analysis over the wrong datasets • E.g. Identifying chicken proteins in mouse cells. • Configuration traditionalism • Not changing the parameters of BLAST. Ever. • Top list ennui • If there is a list only looking at the first one. • Look no further than the first Blast hit / first Google hit. • Arbitrary cut-offs on rank-ordered result list • Absolute truth above, absolute falsehood below • E.g. differentially expressed genes in microarray analyses. 46
Quality Delusions • The bioinformatics does not have to be sound, because we only trust wet-lab results anyway. • Worrying about errors in experimental data but believing that derived data is always true. • Believing Trembl is always right. • Believing computational gene predictions are always correct. 47
Irreproducible Black Box Science • Can you reproduce bioinformatics analyses? • Not collecting the provenance of the analysis. • Not testing during software development. • UniGene • What is happening during UniGene clustering? • ‘Human’ descriptions (via NCBI), are not exact. • The Human Transcriptome Map project and other microarray analysts ended up reclustering UniGene [Marco Roos]. Stuff In Stuff Out 48
“No experiment is reproducible.” “An experiment is reproducible until another laboratory tries to repeat it.” Wyszowski's Law Alexander Kohn
Sin 7: Instant Gratification • Greed? Gluttony? • Always the immediate return. • Never investing for the future. • The quick and dirty fix. • Refusing to model or abstract. • Refusing to plan for recording and exchanging. • Just getting the next quick fix. • The pressure to deliver now and pay later www.CartoonStock.com . 50
Hackery • Deliver now, pay later • Producing crap, non-reusable, software because only the biological results matter for publication X. • Collect! Analyse! Er…now what? • NO up to date or useful DOCUMENTATION! • Spaghetti-ism • Over-indulgence in PERL • Over-indulgence in Ascii Art flat files. • Modelling a system by hacking up XSD fragments on a whiteboard. • Writing perl scripts that resemble my high-school BASIC of the 80s. 51