740 likes | 1.03k Views
Introduction to Bioinformatics Databases. Central dogma of molecular biology. DNA. RNA. protein. phenotype. A main focus of bioinformatics is to study molecular sequence data to gain insight into a broad range of biological problems. Page 6. After Pace NR (1997) Science 276:734.
E N D
Central dogma of molecular biology DNA RNA protein phenotype A main focus of bioinformatics is to study molecular sequence data to gain insight into a broad range of biological problems.
Page 6 After Pace NR (1997) Science 276:734 With the use of bioinformatics we can learn the variation that occur between species, and we can deduce the evolutionary history of life on Earth.
Growth of GenBank 70 60 50 Base pairs of DNA (billions) 40 Sequences (millions) 30 20 10 0 1985 1990 1995 2000 December 1982 June 2006
Growth of the International Nucleotide Sequence Database Collaboration Base pairs of DNA (billions) Base pairs contributed by GenBank EMBL DDBJ http://www.ncbi.nlm.nih.gov/Genbank/
genome transcriptome proteome Central dogma of bioinformatics and genomics Central dogma of molecular biology DNA RNA protein
DNA RNA protein phenotype protein sequence databases cDNA ESTs UniGene genomic DNA databases Fig. 2.2 Page 20
There are three major public DNA databases GenBank EMBL DDBJ The underlying raw DNA sequences are identical Page 16
There are three major public DNA databases EMBL GenBank DDBJ Housed at EBI European Bioinformatics Institute Housed at NCBI National Center for Biotechnology Information Housed in Japan Page 16
Taxonomy nodes at NCBI 8/06 http://www.ncbi.nlm.nih.gov/Taxonomy/txstat.cgi
The most sequenced organisms in GenBank Homo sapiens 10.7 billion bases Mus musculus6.5b Rattus norvegicus5.6b Danio rerio1.7b Zea mays 1.4b Oryza sativa0.8b Drosophila melanogaster0.7b Gallus gallus 0.5b Arabidopsis thaliana0.5b Table 2-2 Page 18 Updated 8-12-04 GenBank release 142.0
The most sequenced organisms in GenBank Homo sapiens 11.2 billion bases Mus musculus7.5b Rattus norvegicus5.7b Danio rerio2.1b Bos taurus 1.9b Zea mays 1.4b Oryza sativa (japonica)1.2b Xenopus tropicalis 0.9b Canis familiaris 0.8b Drosophila melanogaster0.7b Table 2-2 Page 18 Updated 8-29-05 GenBank release 149.0
The most sequenced organisms in GenBank Homo sapiens 12.3 billion bases Mus musculus8.0b Rattus norvegicus5.7b Bos taurus 3.5b Danio rerio2.5b Zea mays 1.8b Oryza sativa (japonica)1.5b Strongylocentrotus purpurata 1.2b Sus scrofa 1.0b Xenopus tropicalis 1.0b Table 2-2 Page 18 Updated 7-19-06 GenBank release 154.0
National Center for Biotechnology Information (NCBI) www.ncbi.nlm.nih.gov Page 24
Types of Data in GenBank • DNA level • RNA level (cDNA) • Protein sequences. • …
Fig. 2.5 Page 25 www.ncbi.nlm.nih.gov
Fig. 2.5 Page 25
PubMed is… • National Library of Medicine's search service • 16 million citations in MEDLINE • links to participating online journals • PubMed tutorial (via “Education” on side bar) Page 24
Entrez integrates… • the scientific literature; • DNA and protein sequence databases; • 3D protein structure data; • population study data sets; • assemblies of complete genomes Page 24
Entrez is a search and retrieval system that integrates NCBI databases Page 24
BLAST is… • Basic Local Alignment Search Tool • NCBI's sequence similarity search tool • supports analysis of DNA and protein databases • 100,000 searches per day Page 25
OMIM is… • Online Mendelian Inheritance in Man • catalog of human genes and genetic disorders • edited by Dr. Victor McKusick, others at JHU Page 25
Books is… • searchable resource of on-line books Page 26
TaxBrowser is… • browser for the major divisions of living organisms • (archaea, bacteria, eukaryota, viruses) • taxonomy information such as genetic codes • molecular data on extinct organisms Page 26
Structure site includes… • Molecular Modelling Database (MMDB) • biopolymer structures obtained from • the Protein Data Bank (PDB) • Cn3D (a 3D-structure viewer) • vector alignment search tool (VAST) Page 26
Accession numbers are labels for sequences NCBI includes databases (such as GenBank) that contain information on DNA, RNA, or protein sequences. You may want to acquire information beginning with a query such as the name of a protein of interest, or the raw nucleotides comprising a DNA sequence of interest. DNA sequences and other molecular data are tagged with accession numbers that are used to identify a sequence or other record relevant to molecular data. Page 26
What is an accession number? An accession number is label that used to identify a sequence. It is a string of letters and/or numbers that corresponds to a molecular sequence. Examples (all for retinol-binding protein, RBP4): X02775 GenBank genomic DNA sequence NT_030059 Genomic contig Rs7079946 dbSNP (single nucleotide polymorphism) N91759.1 An expressed sequence tag (1 of 170) NM_006744 RefSeq DNA sequence (from a transcript) NP_007635 RefSeq protein AAC02945 GenBank protein Q28369 SwissProt protein 1KT7 Protein Data Bank structure record DNA RNA protein Page 27
Four ways to access DNA and protein sequences [1] Entrez Gene with RefSeq [2] UniGene [3] European Bioinformatics Institute (EBI) and Ensembl (separate from NCBI) [4] ExPASy Sequence Retrieval System (separate from NCBI) Page 27
4 ways to access protein and DNA sequences [1] Entrez Gene with RefSeq Entrez Gene is a great starting point: it collects key information on each gene/protein from major databases. It covers all major organisms. RefSeq provides a curated, optimal accession number for each DNA (NM_006744) or protein (NP_007635) Page 27
From the NCBI home page, type “rbp4” and hit “Go” revised Fig. 2.7 Page 29
revised Fig. 2.7 Page 29
Entrez Gene (top of page) Note that links to many other RBP4 database entries are available revised Fig. 2.8 Page 30
Fig. 2.9 Page 32
Fig. 2.9 Page 32
Fig. 2.9 Page 32
FASTA format Fig. 2.10 Page 32
What is an accession number? An accession number is label that used to identify a sequence. It is a string of letters and/or numbers that corresponds to a molecular sequence. Examples (all for retinol-binding protein, RBP4): X02775 GenBank genomic DNA sequence NT_030059 Genomic contig Rs7079946 dbSNP (single nucleotide polymorphism) N91759.1 An expressed sequence tag (1 of 170) NM_006744 RefSeq DNA sequence (from a transcript) NP_007635 RefSeq protein AAC02945 GenBank protein Q28369 SwissProt protein 1KT7 Protein Data Bank structure record DNA RNA protein Page 27
NCBI’s important RefSeq project: best representative sequences RefSeq (accessible via the main page of NCBI) provides an expertly curated accession number that corresponds to the most stable, agreed-upon “reference” version of a sequence. RefSeq identifiers include the following formats: Complete genome NC_###### Complete chromosome NC_###### Genomic contig NT_###### mRNA (DNA format) NM_###### e.g. NM_006744 Protein NP_###### e.g. NP_006735 Page 29-30
NCBI’s RefSeq project: accession for genomic, mRNA, protein sequences AccessionMoleculeNote AP_123456 Protein Protein products; alternate NC_123456 Genomic Complete genomic molecules NG_123456 Genomic Incomplete genomic regions NM_123456 mRNA Transcript products; mRNA NM_123456789 mRNA Transcript products; 9-digit NP_123456 Protein Protein products; NP_123456789 Protein Protein products; 9-digit NR_123456 RNA Non-coding transcripts NT_123456 Genomic Genomic assemblies NW_123456 Genomic Genomic assemblies NZ_ABCD12345678 Genomic Whole genome shotgun data XM_123456 mRNA Transcript products XP_123456 Protein Protein products XR_123456 RNA Transcript products YP_123456 Protein Protein products ZP_12345678 Protein Protein products
Ensembl to access protein and DNA sequences Try Ensembl at www.ensembl.org for a premier human genome web browser. Ensembl is a joint scientific project between the European Bioinformatics Institute and the Wellcome Trust Sanger Institute, Its aim is to provide a centralised resource for geneticists, molecular biologists and other researchers studying the genomes of our own species and other vertebrates. We will encounter Ensembl as we study the human genome, BLAST, and other topics.