280 likes | 570 Views
Organization of Biological Data and Databases. Pramod Wangikar Dept. of Chemical Engineering IIT Bombay. ORGANIZATION OF BIOLOGICAL DATA. Gene i. Genomics. m-RNA i. Transcriptomics. Protein Sequence / Proteomics. Protein i. Function (Enzyme, hormone etc.). 3-D Structural
E N D
Organization of Biological Data and Databases Pramod Wangikar Dept. of Chemical Engineering IIT Bombay
ORGANIZATION OF BIOLOGICAL DATA Gene i Genomics m-RNA i Transcriptomics Protein Sequence / Proteomics Protein i Function (Enzyme, hormone etc.) 3-D Structural Database
G A C G T T 3’ P OH 3’ 3’ 3’ 3’ 3’ 5’ P P P P P 5’ 5’ 5’ 5’ 5’ Primary Structure of Deoxyribonucleic Acid (DNA) OR pApCpGpTpTpG OR ACGTTG
The Basic Principle of Transcription RNA Polymerase 5’ Double stranded DNA RNA Nucleotides
The Code • 64 ways of writing the codon • 20 amino acids F M uac 5' 5'... aug gaa 5' uuu ... Adjacent mRNA codons
The Flow of Genetic Information Sequense same as RNA 3’ 5’ DNA ACTGCACCATGGGGCTCAGCGACGGGGAATGGCACTTGGTG TGACGTGGTACCCCGAGTCGCTGCCCCTTACCGTGAACCAC Sequence complementary to RNA 5’ mRNA ACUGCACCAUGGGGCUCAGCGACGGGGAAUGGCACUUGGUG Initiation signal codons Protein Met-Gly-Leu-Ser-Asp-Gly-Gln-Trp-His-Leu-Val
Memory Requirements for Storing Genomes 00 = a 01 = c 10 = g 11 = t Prokaryotic 0.5-7.0 Mbp Eukaryotic 10 Mbp - 1000 Gbp
E. coli and Data size Numbers are approximate: The data size increases roughly by three orders of magnitude for human system
Minimal Life: Self- assembly, Catalysis, Replication, Mutation, Selection Environment Cell Boundary Monomers RNA Growth rate
Maximal Life: Self- assembly, Catalysis, Replication, Mutation, Selection Regulatory & Metabolic Networks Environment Metabolites Interactions RNA DNA Protein Growth rate Expression stem cells cancer cells microbes
Regulation: More biological data What is regulation: A catalogue of possible scenarios and respective course of action. • The information for regulation can be stored in the form of: • Protein-protein interaction • Protein-DNA interaction • Protein-metabolite interaction • Molecular switches, controls, set-points, etc. Genome + Environment: Input file Biological Machinery: Executable program Observations: Output file Can we crack the executable program?
Some useful regulatory signals on Genes Upstream activating sequences (UAS) m-RNA expression start & end TATA box DNA x x mRNA Ribosomal binding site protein Protein synthesis stops Protein synthesis starts
DESCRIPTION OF A LIVING CELL / VIRUS Genome / Genomics General Capability of the Cell Readyness of the Cell Transcriptomics Proteomics / Protein Map Physiological state of the cell
Paradigm Shift in the Bioinformatics Age Conventional Path Structure Gene Function • Bioinformatics Age: Functional Genomics Gene sequence Structure of Protein Function Protein Map 2D-PAGE, pI, mol. wt. Proteomics
Possible Relationships Between Databases Genome Sequence Protein Seqeunce Proteomics Transcriptomics Expression Profile Protein Structure Protein Profile Protein-DNA interactions Protein-Protein Interaction Protein Function Metabolome Phenotype
Combinatorial Problems in Biology • Prediction of ORF; gene finding • Prediction of DNA regulatory sites • DNA regulatory Proteins • Protein-Protein interactions • Protein Function • Prediction of Metabolic capability • Prediction of Genetic Regulatory Circuits
Biological Databases • Raw databases • Processed databases • Querying in databases.
Raw Databases Conventional Ones DNA / Gene / Genome Sequence Databases. EMBL, GenBank, GSDB etc. > 106 genes, Doubles every 18 months. Genome Projects: E. coli, plants, Human, Mouse, etc. Protein Sequence Databases. PIR, SwissProt, GenBank, etc. > 105 protein sequences, Doubles every 21 months Three Dimensional structure Database. Brookhaven Protein Databank (PDB) > 20,000 structures, doubles every 24 months.
Proteomics Database(SwissProt) • Each Protein Identified by: pI, mol wt., mass spectra, microsequencing, peptide mass fingerprint, etc. • Entries for E.coli, yeast, human etc. Hoogland et al, Nucl. Acids Res. (2000) 28, 286
Cluster of Orthologous Groups (COG) of Proteins: A Processes Database • Compares genes from different genomes. • Forms clusters with similar sequences. • Each COG contains genes connected through vertical evolutionary descent. • 30 genomes (68,571 genes), 2,791 COGs with 45,350 genes • Assignment of function for genes based on known functions for some members of the cluster. • Highly useful for functional assignments for newly sequenced genomes.
EcoCyc Database: Encyclopedia of E. coli genes and Metabolism 4300 genes, 695 enzymes, 595 reactions, 123 pathways Blue: E. coli only; Green: both E. coli and H. influenzae. Karp et al, Nucl. Acids Res. (1998) 26, 50
Querying in Databases • Based on sequence similarity; gives similar sequences and the similarity score or expectation value. • Normally a BLAST, FASTA search (local alignment). Can look for a sequence motif. • Gene names, biological source, functional category, cellular location / role. • Structural features (for known 3-D structures).
Bioinformatics: A multidisciplinary effort is required • Generation of biological data • Storage and Retrieval of Data • Conversion of known biological hypotheses into mathematical/statistical models • Building models from data • Fitting new data to existing models. • Searching for patterns in data • Derive new biological knowledge from Data