1 / 56

Providing an environment where every data-driven researcher will thrive

Providing an environment where every data-driven researcher will thrive. Professor Carole Goble carole.goble@manchester.ac.uk University of Manchester, UK. Pipelines Scientific workflows over (web) services Data pipelines, model population and validation, simulation sweeps

rpatten
Download Presentation

Providing an environment where every data-driven researcher will thrive

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Providing an environment where every data-driven researcher will thrive Professor Carole Goble carole.goble@manchester.ac.uk University of Manchester, UK

  2. Pipelines • Scientific workflows over (web) services • Data pipelines, model population and validation, simulation sweeps • Distributed, federated datasets and analyses combined with local datasets and analysis • Opening up resources. • e-Laboratories • Crowd-sourcing, group curating and sharing/reusing scientific assets. • Web 2.0 and Semantic Web. • Social networking, community content, collaborative filtering • Sharing and exchanging “Research Objects” • Opening up capabilities and capacity.

  3. Pan European collaboration. Systems Biology of Microorganisms 13 projects, 91 institutes Different research outcomes A cross-section of microorganisms, incl. bacteria, archaea and yeast. Record and describe the dynamic molecular processes occurring in microorganisms by computerized mathematical models. Modellers meet experimentalists Pool research capacities, data, models and know-how. Retrospectively. http://www.sysmo.net BaCell-SysMO  COSMIC  SUMO  KOSMOBAC  SysMO-LAB  PSYSMO  Valla  MOSES  TRANSLUCENT  STREAM  SulfoSYS + two more

  4. Data-driven Multiple ‘omics genomics, transcriptomics proteomics, metabolomics Images, Reaction Kinetics Models Data sets + experiments + models SBML, Agent-based, Mechanics based Analysis of data

  5. Systems biology workflows in MCISB

  6. High throughput experimental methods Public data sets (e.g. EBI) Web Services ~ 1400 NAR January Issue Little databases Lab books Spreadsheets Private and Shared. Proliferation Derived data Long tail. Little Data

  7. Big Data Group Science Data services Access Publish “Little” Data “Local” Science My Datasets My Analytics

  8. Massive decentralisation – wikis, sticks, spreadsheetsMassive centralisation – commons, clouds, curated core facilitiesTremendous fragilityDigital Dust in Data Tombs

  9. Picking Pain Points. Keeping it Real. • Project Directors • Data remains with us under our control. • We control who sees what. • Just enough exchange. • SysMO PALs • Spreadsheets. • Yellow Pages. • Standard Operating Procedures.

  10. An education Modellers vs Experimentalists Computational thinking Systems thinking

  11. ? ? ? Gray‘s Laws (modified) • Working Now, Working to working • Gateways and ramps • Jam today, jam tomorrow • Just enough, just in time • Work with what you got already • 20 questions • Is there any group generating kinetic data? • Is this data available? • Who is working with which organism? • What methods are been used to determine enzyme activity? • Under which experimental conditions are my partners working on for the measurement of glucose concentration? ?

  12. Help people search for and find stuff Data Services Processes Models Software Experts

  13. SysMO SEEK Assets Catalogue. Archive. Social Network. Sharing Space. Gateway. Yellow Pages People. Expertise. Projects. Institutions. Facilities. Studies. Data Experimental data sets and analysed results. Gateway to data stores – SABIO-RK, ‘omics Models Store. Stimulate. Publish. Curate. Gateway to COPASI, JWS Online, BioModels. Processes Laboratory protocols – Standard Operating Procedures Bioinformatics analyses – computational workflows - Taverna Model population and validation – workflows – Taverna Gateway to myExperiment, MolMeth, OpenWetWare…. Interlinking ASSETS CATALOGUE

  14. Linking data to process Standard Operating Procedures Models Software Provenance The Lab Book Retrospective method reconstruction The myth of reproducible science

  15. Scientists willing to share methods and protocols. SOPs an early win. • Defined standard metadata model based on Nature Protocols. • Seeded.

  16. Linking data with stuff • Research Objects for packaging and exchanging Assets • Workflows linked to models linked to data linked to SOPs • Encapsulate community standards • Mixed resources: External and central. • Trust • “Preservation Packet” • Bechhofer et al 2010 forthcoming in The Future of The Web for Collaborative Science 2010. • SBRML • Systems Biology Results Markup Language • To tie to the SBML

  17. At the coal-face The Spreadsheet. The Content Management Systems. Legacy assets are assets. Metadata ramps.

  18. The Content Management System • Lightweight and flexible. Low take-on, hidden operations costs. Knowledgeable Civilians. Looks nice. • Anarchy amenable.

  19. Spreadsheets SysMOLab • Template distribution • Template mapping

  20. Everyone wants metadata. No one wants to collect it. Standards mayhem Metadata millstones Most data is thrown away. Metadata for my sake Metadata compliance by stealth Preparation for publishing

  21. CIMRCore Information for Metabolomics Reporting MIABEMinimal Information About a Bioactive Entity MIACAMinimal Information About a Cellular Assay MIAMEMinimum Information About a Microarray Experiment MIAME/EnvMIAME / Environmental transcriptomic experiment MIAME/NutrMIAME / Nutrigenomics MIAME/PlantMIAME / Plant transcriptomics MIAME/ToxMIAME / Toxicogenomics MIAPAMinimum Information About a Phylogenetic Analysis MIAPARMinimum Information About a Protein Affinity Reagent MIAPEMinimum Information About a Proteomics Experiment MIAREMinimum Information About a RNAi Experiment MIASEMinimum Information About a Simulation Experiment MIENSMinimum Information about an ENvironmental Sequence MIFlowCytMinimum Information for a FlowCytometry Experiment MIGenMinimum Information about a Genotyping Experiment MIGSMinimum Information about a Genome Sequence MIMIxMinimum Information about a Molecular Interaction Experiment MIMPPMinimal Information for Mouse Phenotyping Procedures MINIMinimum Information about a Neuroscience Investigation MINIMESSMinimal Metagenome Sequence Analysis Standard MINSEQEMinimum Information about a high-throughput SeQuencing Experiment MIPFEMinimal Information for Protein Functional Evaluation MIQASMinimal Information for QTLs and Association Studies MIqPCRMinimum Information about a quantitative Polymerase Chain Reaction experiment MIRIAMMinimal Information Required In the Annotation of biochemical Models MISFISHIEMinimum Information Specification For In Situ Hybridization and Immunohistochemistry Experiments STRENDAStandards for Reporting Enzymology Data TBCTox Biology Checklist BioPAX : Biological Pathways Exchangehttp://www.biopax.org/ FuGE Functional Genomics Experiment MGED: Microarray Experimental Conditions Minimum Information Models 63% 47% MIBBI: Minimum Information for Biological and Biomedical Investigations http://www.mibbi.org/index.php/MIBBI_portal

  22. Just Enough Results Model “I only want to collect and share just enough results” • Harvest standards e.g. MIAME (MIBBI.org) • Analyse consortium schemas and spreadsheets • JERMs for each data type – microarray, metabolomics, proteomics .... • Map project data sources to JERMs. • Distribute JERM spreadsheet templates

  23. JERM Spreadsheets Templates • RDF for ripping, mashing and comparing spreadsheets. • A little semantics goes a long way Controlled vocabulary plug in

  24. Reward curation Local curation at the point of capture – ISA-TAB for ‘omics. Centralised curation – SBML, CellML, SBO Automated curation. Which data is worth curating?

  25. Blue-Collar Science. Curator Credit Curator Career Funding. Personal and institutional visibility Scholarly citation metrics Federate workloads Unpopular with the big data providers. www.biocurators.org

  26. Commons-based Quality Control.

  27. Progressive Curation: “lazy evaluation” metadataJust enough, Just in timeJam today and Jam tomorrow Very BAD Pain Just right Good, but Unlikely Gain

  28. Sensitive sharing. Collaborate to compete Good reasons not to. Just enough just in time sharing. Data kept at host. Registered centrally through harvesting. Pre-Publication sharing vs Publication

  29. Competitive advantage. Academic vanity. Adoption. Reputation. Rewards Scrutiny. Being scooped. Misinterpretation. Reputation. Legal issues. Risks Nature461, 145 (10 September 2009) | doi:10.1038/461145a

  30. Just Enough Sharing Access Permissions Reusing myExperiment

  31. Reward sharing and reusing not reinventing. Technically. Culturally. Institutionally. Credit and Risk Mitigation.

  32. Reward and Provenance Attribution. Trust. Credit Reusing myExperiment

  33. Some pretty key things • Data citation • Stable and shared ids and names • A nightmare. • Sharednames.org • Biosharing.org • Versioning and Provenance • Models, software, data sets • Ensembl web service doesn’t report version number.

  34. Data commons, Data havensFor data after the project has ended.For the common good or me.Tidy and untidy data. Beth’s Provenance Objects Bio2RDF

  35. Access and availability of data and data analysis resources Web services underpin the ESFRI ELIXIR programme. Interfaces that are understandable and stable. Designed for people too. No access, no tools, no point (Keith Haines) Deposition to community databanks that minimise pain.

  36. What is it? Is it working?

  37. Data analysis, model population and data pipelining ramps.Crossing the adoption chasm There is a world of complexity for data preparation, processing and analysis Science Informatics Sweatshops. E-Laboratories. Workflows. Portals. Pre-cooked processes and process templates. Pre-cooked interfaces. Training.

  38. Lymphoma Prediction Workflow caArray Use gene-expression patterns associated with two lymphoma types to predict the type of an unknown sample. MicroArray from tumor tissue Microarray preProcessing Lymphoma prediction GenePattern Ack. Juli Klemm, Xiaopeng Bian, Rashmi Srinivasa (NCI) Jared Nedzel (MIT) Wei Tan Univ. Chicago

  39. myExperiment Communities • Supermarket shoppers • Tool builders • Trainers and Trainees

  40. Drop and Compute Ian Cottam Local folder synchronised and shared via cloud Condor job submitted by drag and drop Results appear in Dropbox

  41. Bashing against local IT NO – you can’t access that datastore / run your analysis. Joined up thinking.

  42. Data + Publications Data trapped in documents Supplemental information Text mining Text mining workflows Text mining to find method and controls

  43. Reflect. Elsevier Challenge Winner 2009

  44. [Oscar-3] Manual and Auto-mark up

  45. Do not underestimate the power of Interactive Visualisation and Browsing Pre-cooked complex queries. Navigation. With my data. At the click of a button.

  46. Distributed Annotation Service • Upload and overlay my data

  47. SysMO summary • Providing an environment where every data-driven researcher will thrive • Reality is messy. • Extreme Technology Determinism vs Voluntarist Sociocultural shaping • Extreme and continuous partnership with users. • Act Local Think Global • Agile development environment facilitated stream of features to tackle pain points. • Leverage other e-Laboratories, Maintaining scientists’ buy-in. • Socio-Political Axis dominates the Technical Axis. • Collaboration evolutions, Confidence in exchange.

  48. Coordination Sustainability Data Interoperability Adoption Capacity Six Action Plan Areas

More Related