110 likes | 256 Views
Data Provenance Workshop. Natalia Maltsev MCS Argonne National Laboratory. 98 published genomes 652 on-going genomes. So much data!!. Hmmm …. Why Biotechnological Revolution?. High-throughput technologies provide huge amounts of biological data: Sequence data
E N D
Data Provenance Workshop Natalia Maltsev MCS Argonne National Laboratory
98 published genomes 652 on-going genomes So much data!! Hmmm… Why Biotechnological Revolution? • High-throughput technologies provide huge amounts of biological data: • Sequence data • Data describing functional Networks (Metabolism, Regulation, Gene Expression) • Dynamic data • Progress of Computer Science and Computer Technologies and Bioinformatics allows to analyze this data
Genomes Gene Products Structure & Function Pathways & Physiology Biology in a Nutshell(for people with little knowledge but infinite intelligence) • Genome (ROM): assembly code on how to build proteins • Instructions: A, C, T, G • 3 variables amino acid • Genome consists of genes • Gene Protein: Object description Object instantiation • Protein Functions • Enzymes: proteins that catalyze biochemical reactions • Pathway: sequence of reactions • Network(directed graph): set of pathways with metabolites as vertices and enzymes as edges
Data: Classes • Sequence data • DNA sequences, Protein Sequences– NCBI, GenBank, SwissProt, TIGR, sequencing projects • Data describing Networks • Metabolic Networks (EMP database, KEGG, etc) • Regulatory Networks (Sentra, TransFac, etc) • Gene Expression data (Experimental) • Other experimental data • Dynamic Data (experimental and literature) • Organisms data • Phenotypic data • Physiological data
General Systems Biology Project Architecture • Stages of analysis: • Determine components of the system (assign functions to the genes) • Establish relationships between components – reconstruct biological networks (develop a static model) • Develop a dynamic model of the system
Data Sources • Public and private databases (GeneBank, SwissProt, EMP, KEGG, etc) • Results of data analysis • Updates and versioning? (Data and annotations updates, Developed models)
Prediction of Gene functions • Predicting of gene functions by comparing of an unknown sequence with sequences of genes for which the functions are established Seq1 – function alcohol dehydrogenase Seq2– Function? Alcohol dehydrogenase? Seq1_Mus.musculus GSGITKGLGAGANPEVGRNAADEDRDALRAALEGSDMVFIAAGMGGGTGTGAAPVVAE Seq2_Homo_sapiens GSGITKGLGAGANPEVGRNS AEEDRDALRAALDGSDMVFIAAGMGGGTGTGAAPVVAE
Example 1 Gene Function Assignments Query sequence Function Unknown!!! KNOWLEDGE BASE Bioinformatics tools Blast InterPro Blocks F2 F1 result result result F3 VOTING ALGORITHM F2 F1 with probability P1 F2 with probability P2
An Example on Pathways Reconstruction How reliably can we predict this pathway? What approach will Increase our confidence The most?
Another Problem:Control of Data flow Data Acquisition How reliable? Data Analysis How reliable? Data Storage How reliable?
General Systems Biology Project Architecture • What can provenance do? • Help plan experiments by uggesting “weak” facts to be tested in a wetlab • Find “weak” spots in a model • Prioritize certain steps of model building • Evaluate data flows