1 / 11

Data Provenance Workshop

Data Provenance Workshop. Natalia Maltsev MCS Argonne National Laboratory. 98 published genomes 652 on-going genomes. So much data!!. Hmmm …. Why Biotechnological Revolution?. High-throughput technologies provide huge amounts of biological data: Sequence data

mayten
Download Presentation

Data Provenance Workshop

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Data Provenance Workshop Natalia Maltsev MCS Argonne National Laboratory

  2. 98 published genomes 652 on-going genomes So much data!! Hmmm… Why Biotechnological Revolution? • High-throughput technologies provide huge amounts of biological data: • Sequence data • Data describing functional Networks (Metabolism, Regulation, Gene Expression) • Dynamic data • Progress of Computer Science and Computer Technologies and Bioinformatics allows to analyze this data

  3. Genomes Gene Products Structure & Function Pathways & Physiology Biology in a Nutshell(for people with little knowledge but infinite intelligence) • Genome (ROM): assembly code on how to build proteins • Instructions: A, C, T, G • 3 variables  amino acid • Genome consists of genes • Gene  Protein: Object description Object instantiation • Protein Functions • Enzymes: proteins that catalyze biochemical reactions • Pathway: sequence of reactions • Network(directed graph): set of pathways with metabolites as vertices and enzymes as edges

  4. Data: Classes • Sequence data • DNA sequences, Protein Sequences– NCBI, GenBank, SwissProt, TIGR, sequencing projects • Data describing Networks • Metabolic Networks (EMP database, KEGG, etc) • Regulatory Networks (Sentra, TransFac, etc) • Gene Expression data (Experimental) • Other experimental data • Dynamic Data (experimental and literature) • Organisms data • Phenotypic data • Physiological data

  5. General Systems Biology Project Architecture • Stages of analysis: • Determine components of the system (assign functions to the genes) • Establish relationships between components – reconstruct biological networks (develop a static model) • Develop a dynamic model of the system

  6. Data Sources • Public and private databases (GeneBank, SwissProt, EMP, KEGG, etc) • Results of data analysis • Updates and versioning? (Data and annotations updates, Developed models)

  7. Prediction of Gene functions • Predicting of gene functions by comparing of an unknown sequence with sequences of genes for which the functions are established Seq1 – function alcohol dehydrogenase Seq2– Function? Alcohol dehydrogenase? Seq1_Mus.musculus GSGITKGLGAGANPEVGRNAADEDRDALRAALEGSDMVFIAAGMGGGTGTGAAPVVAE Seq2_Homo_sapiens GSGITKGLGAGANPEVGRNS AEEDRDALRAALDGSDMVFIAAGMGGGTGTGAAPVVAE

  8. Example 1 Gene Function Assignments Query sequence Function Unknown!!! KNOWLEDGE BASE Bioinformatics tools Blast InterPro Blocks F2 F1 result result result F3 VOTING ALGORITHM F2 F1 with probability P1 F2 with probability P2

  9. An Example on Pathways Reconstruction How reliably can we predict this pathway? What approach will Increase our confidence The most?

  10. Another Problem:Control of Data flow Data Acquisition How reliable? Data Analysis How reliable? Data Storage How reliable?

  11. General Systems Biology Project Architecture • What can provenance do? • Help plan experiments by uggesting “weak” facts to be tested in a wetlab • Find “weak” spots in a model • Prioritize certain steps of model building • Evaluate data flows

More Related