1 / 38

COT 6930 HPC and Bioinformatics Bioinformatics Resources and Databases

COT 6930 HPC and Bioinformatics Bioinformatics Resources and Databases. Xingquan Zhu Dept. of Computer Science and Engineering. Protein structure databases. Gene expression database. transcription. translation. DNA. RNA. protein. phenotype. Protein sequence databases. cDNA

nodin
Download Presentation

COT 6930 HPC and Bioinformatics Bioinformatics Resources and Databases

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. COT 6930HPC and BioinformaticsBioinformatics Resources and Databases Xingquan Zhu Dept. of Computer Science and Engineering

  2. Protein structure databases Gene expression database transcription translation DNA RNA protein phenotype Protein sequence databases cDNA ESTs UniGene Genomic DNA Databases

  3. Gene

  4. Different transcripts can be related to the same gene!

  5. EST Expressed Sequence Tags • Partial copies of mRNA found within a particular cell • Can be used to identify genitc regions; splicing patterns of genes; etc

  6. Outline • Bioinformatics Databases • Primary databases • Derived databases • Nucleotide databases • GenBank (P), EMBL-Bank (P) • Protein databases • Swiss-Prot (D), PIR-PSD (D) • GenPept (D), TrEMBL (D) • Protein Data Bank (P) • Other Examples • RefSeq • UniGene • PubMed • SNP • OMIM

  7. Bioinformatics Databases • Information • DNA sequences • Conserved DNA domains • Genomes • Gene expression (ESTs, microarrays) • Protein sequences • Protein 3D structure • Protein families • Mutations / polymorphisms / SNPs • Metabolic pathways • Chemical compounds (ligands) • Biomedical literature (journal papers, online books…)

  8. Primary public domain bioinformatics servers Public Domain Bioinformatics Facilities Genome Net (KEGG & DDBJ) Japan European Bioinformatics Institute (EBI) United Kingdom National Center For Biotechnology Information (NCBI) United States Databases Databases Analysis Tools Analysis Tools Databases Analysis Tools

  9. Major Databases • DNA sequences • GenBank, RefSeq, UniGene • Protein sequences • Swiss-Prot, PIR-PSD, GenPept, TrEMBL, RefSeq • Protein structure • Protein Data Bank (PDB) • Gene expression • Gene Expression Omnibus (GEO) • Biomedical publications • PubMed / MedLine

  10. Bioinformatics Data Sources • Primary databases • Original submissions by researchers • Staff organizes information only • Generally sequence oriented • Examples • GenBank, PDB (Protein Data Bank)

  11. Bioinformatics Data Sources • Derived databases • Compiled from data in primary databases • Manually curated (human selection & correction) • Advantages – high quality • Disadvantages – high expense, low volume • Examples • Swiss-Prot, PIR-PSD, RefSeq • Computational derivation (automatically generated) • Advantages – inexpensive, up-to-date • Disadvantages – lower quality • Examples • GenPept, TrEMBL, UniGene

  12. Outline • Bioinformatics Databases • Primary databases • Derived databases • Nucleotide databases • GenBank (P), EMBL-Bank (P) • Protein databases • Swiss-Prot (D), PIR-PSD (D) • GenPept (D), TrEMBL (D) • Protein Data Bank (P) • Other Examples • RefSeq • UniGene • PubMed • SNP • OMIM

  13. Bioinformatic Databases – GenBank • “GenBank is the NIH genetic sequence database, an annotated collection of all publicly available DNA sequences” • Database type • Nucleotide sequences • Primary database • Current Size (As of Aug. 2006): • 65,369,091,950 (bps) • 61,132,599 (sequences) • Access to GenBank • Available for searching at NCBI via several methods • Such as BLAST search http://www.ncbi.nlm.nih.gov/Genbank/

  14. Bioinformatic Databases – GenBank • Types of submissions to database • Genomic DNA • High quality complete DNA sequence • mRNA / cDNA • Partial or complete mRNA (or cDNA) • Expressed sequence tag (EST) • A short sub-sequence of a transcribed spliced nucleotide sequence (mRNA) (500-800bps) • May represent portions of expressed genes • Either protein-coding or not • About 43 million ESTs are now available • Sequence tagged sites (STS) • Short DNA sequences unique in genome • Genomic survey sequence (GSS) • Single-pass genomic DNA • Third-party annotations of GenBank sequences

  15. Bioinformatic Databases – EMBL-Bank • Europe's primary nucleotide sequence resource • Primary databases • Database type • Nucleotide sequences • Primary database • http://www.ebi.ac.uk/embl/

  16. Outline • Bioinformatics Databases • Primary databases • Derived databases • Nucleotide databases • GenBank (P), EMBL-Bank (P) • Protein databases • Swiss-Prot (D), PIR-PSD (D) • GenPept (D), TrEMBL (D) • Protein Data Bank (P) • Other Examples • RefSeq • UniGene • PubMed • SNP • OMIM

  17. Bioinformatic Databases – Proteins • Protein sequence databases • Once derived from laboratory experiments • Now mostly based on predicted ORFs from DNA • Manual curation • Swiss-Prot • PIR-PSD • Computational derivation • GenPept • TrEMBL

  18. Bioinformatic Databases – Swiss-Prot, PIR-PSD • Database type • Protein sequences • Derived database • Manually curated (non-redundant, annotated) • Many annotations • Functions of the protein • Domains and sites • Secondary & quaternary structure • Similarities to other proteins • Variants • Swiss-Prot http://ca.expasy.org/sprot/ • PIR-PSD http://pir.georgetown.edu/pirwww/dbinfo/pir_psd.shtml

  19. Bioinformatic Databases – GenPept, TrEMBL • Database type • Protein sequences • Computationally derived database • Predicted (translating) coding sequences (CDS) from GenBank, EMBL (i.e., gene product) • GenPept • Download: ftp://ftp.ncifcrf.gov/pub/genpept/ • Release 163 (as of 12/26/2007) • 4,970,178 loci containing 1,517,599,916 residues • TrEMBL • http://www.ebi.ac.uk/trembl/index.html

  20. Structure Databases • 3-dimensional structures of proteins, nucleic acids, molecular complexes etc • 3-d data is available due to techniques such as NMR and X-Ray crystallography • Protein Data Bank • Protein 3D structures • Primary database • (http://www.rcsb.org/pdb/home/home.do)

  21. Protein Data Bank: PDB

  22. Bioinformatic Databases – Connections

  23. Outline • Bioinformatics Databases • Primary databases • Derived databases • Nucleotide databases • GenBank (P), EMBL-Bank (P) • Protein databases • Swiss-Prot (D), PIR-PSD (D) • GenPept (D), TrEMBL (D) • Protein Data Bank (P) • Other Examples • RefSeq • UniGene • PubMed • SNP • OMIM

  24. Bioinformatic Databases – RefSeq • The Reference Sequence (RefSeq) collection aims to provide a comprehensive, integrated, non-redundant set of sequences, including genomic DNA, transcript (RNA), and protein products. • Information derived from GenBank records • Database type • Nucleotide & protein sequences • Derived database • Human curated (non-redundant, cross-linked) • Data in RefSeq • Genomic DNA • mRNAs & proteins for known genes, gene models • Entire chromosomes • Multiple organisms • http://www.ncbi.nlm.nih.gov/projects/RefSeq/ • Example • http://www.ncbi.nlm.nih.gov/entrez/viewer.fcgi?val=NP_015325

  25. Bioinformatic Databases – UniGene • UniGene is an experimental system for automatically partitioning GenBank sequences into a non-redundant set of gene-oriented clusters • Each UniGene cluster contains sequences that represent a unique gene, as well as related information such as the tissue types in which the gene has been expressed and map location • Database type • Nucleotide sequences • Computationally derived database • Partitioned into non-redundant gene-oriented clusters • Gene-oriented view • Data in UniGene • Clusters of genomic DNA & ESTs • Multiple organisms • http://www.ncbi.nlm.nih.gov/sites/entrez?db=unigene

  26. Bioinformatic Databases – PubMed • Database type • Biomedical papers • Manually curated database • Service of the National Library of Medicine • MEDLINE publication database • Over 17,000 journals • 15 million citations since 1950 • http://www.ncbi.nlm.nih.gov/PubMed/

  27. Bioinformatic Databases – Others • Gene expression • ArrayExpress, Gene Expression Omnibus (GEO) • Multi-organism genomes • Entrez Genome, HomoloGene, COGs, TIGR • Genetic variation & genetic diseases • dbSNP, OMIM, CGAP • Metabolic pathways • WIT, KEGG • Many more… • Listed in journal “Nucleic Acids Research” each January

  28. Bioinformatic Databases: SNP Database Single Nucleotide Polymorphisms (SNPs) • Single base difference in a single position among two different individuals of the same species • Play an important role in differentiation and disease http://www.ncbi.nlm.nih.gov/projects/SNP/

  29. Sickle Cell Anemia • Due to 1 swapping an A for a T, causing inserted amino acid to be valine instead of glutamine in hemoglobin Image source: http://www.cc.nih.gov/ccc/ccnews/nov99/

  30. Healthy Individual >gi|28302128|ref|NM_000518.4| Homo sapiens hemoglobin, beta (HBB), mRNA ACATTTGCTTCTGACACAACTGTGTTCACTAGCAACCTCAAACAGACACCATGGTGCATCTGACTCCTGA GGAGAAGTCTGCCGTTACTGCCCTGTGGGGCAAGGTGAACGTGGATGAAGTTGGTGGTGAGGCCCTGGGC AGGCTGCTGGTGGTCTACCCTTGGACCCAGAGGTTCTTTGAGTCCTTTGGGGATCTGTCCACTCCTGATG CTGTTATGGGCAACCCTAAGGTGAAGGCTCATGGCAAGAAAGTGCTCGGTGCCTTTAGTGATGGCCTGGC TCACCTGGACAACCTCAAGGGCACCTTTGCCACACTGAGTGAGCTGCACTGTGACAAGCTGCACGTGGAT CCTGAGAACTTCAGGCTCCTGGGCAACGTGCTGGTCTGTGTGCTGGCCCATCACTTTGGCAAAGAATTCA CCCCACCAGTGCAGGCTGCCTATCAGAAAGTGGTGGCTGGTGTGGCTAATGCCCTGGCCCACAAGTATCA CTAAGCTCGCTTTCTTGCTGTCCAATTTCTATTAAAGGTTCCTTTGTTCCCTAAGTCCAACTACTAAACT GGGGGATATTATGAAGGGCCTTGAGCATCTGGATTCTGCCTAATAAAAAACATTTATTTTCATTGC >gi|4504349|ref|NP_000509.1| beta globin [Homo sapiens] MVHLTPEEKSAVTALWGKVNVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPKVKAHGKKVLG AFSDGLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHFGKEFTPPVQAAYQKVVAGVAN ALAHKYH

  31. Diseased Individual >gi|28302128|ref|NM_000518.4| Homo sapiens hemoglobin, beta (HBB), mRNA ACATTTGCTTCTGACACAACTGTGTTCACTAGCAACCTCAAACAGACACCATGGTGCATCTGACTCCTGA GGTGAAGTCTGCCGTTACTGCCCTGTGGGGCAAGGTGAACGTGGATGAAGTTGGTGGTGAGGCCCTGGGC AGGCTGCTGGTGGTCTACCCTTGGACCCAGAGGTTCTTTGAGTCCTTTGGGGATCTGTCCACTCCTGATG CTGTTATGGGCAACCCTAAGGTGAAGGCTCATGGCAAGAAAGTGCTCGGTGCCTTTAGTGATGGCCTGGC TCACCTGGACAACCTCAAGGGCACCTTTGCCACACTGAGTGAGCTGCACTGTGACAAGCTGCACGTGGAT CCTGAGAACTTCAGGCTCCTGGGCAACGTGCTGGTCTGTGTGCTGGCCCATCACTTTGGCAAAGAATTCA CCCCACCAGTGCAGGCTGCCTATCAGAAAGTGGTGGCTGGTGTGGCTAATGCCCTGGCCCACAAGTATCA CTAAGCTCGCTTTCTTGCTGTCCAATTTCTATTAAAGGTTCCTTTGTTCCCTAAGTCCAACTACTAAACT GGGGGATATTATGAAGGGCCTTGAGCATCTGGATTCTGCCTAATAAAAAACATTTATTTTCATTGC >gi|4504349|ref|NP_000509.1| beta globin [Homo sapiens] MVHLTPVEKSAVTALWGKVNVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPKVKAHGKKVLG AFSDGLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHFGKEFTPPVQAAYQKVVAGVAN ALAHKYH

  32. Disease Databases • Genes are involved in disease • Many diseases are well studied • Description of diseases and what is known about them is stored • A good place to start when you want to know about a certain disease • Linked to PubMed, the OMIM Morbid Map • OMIM - Online Mendelian Inheritance in Man • “A catalog of human genes and genetic disorders maintained by Johns Hopkins University” • http://www.ncbi.nlm.nih.gov/sites/entrez?db=omim

  33. Putting it All Together • Each Database contains specific information • Like other biological systems also these databases are interrelated

  34. PROTEIN PIR SWISS-PROT DISEASE LocusLink OMIM OMIA ASSEMBLED GENOMES GoldenPath WormBase TIGR MOTIFS BLOCKS Pfam Prosite GENOMIC DATA GenBank DDBJ EMBL ESTs dbEST unigene GENES RefSeq AllGenes GDB SNPs dbSNP GENE EXPRESSION Stanford MGDB NetAffx ArrayExpress PATHWAY KEGG COG STRUCTURE PDB MMDB SCOP LITERATURE PubMed

  35. Where to get started? NCBI ENTREZ • A search engine that provides access and links between various databases ENTREZ SNP Genomes Protein databases OMIM Taxonomy GenBank PubMed http://www.ncbi.nlm.nih.gov/sites/gquery

  36. Outline • Bioinformatics Databases • Primary databases • Derived databases • Nucleotide databases • GenBank (P), EMBL-Bank (P) • Protein databases • Swiss-Prot (D), PIR-PSD (D) • GenPept (D), TrEMBL (D) • Protein Data Bank (P) • Other Examples • RefSeq • UniGene • PubMed • SNP • OMIM

More Related