1 / 35

BTN323: INTRODUCTION TO BIOLOGICAL DATABASES

BTN323: INTRODUCTION TO BIOLOGICAL DATABASES. Lecturer: Junaid Gamieldien, PhD junaid@sanbi.ac.za. Day1: NCBI Databases and Entrez. http://www.sanbi.ac.za/training-2/undergraduate-training/. NOTE: Most slides derived from NCBI’s field guide. WHAT YOU NEED TO LEARN:.

jase
Download Presentation

BTN323: INTRODUCTION TO BIOLOGICAL DATABASES

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. BTN323:INTRODUCTION TO BIOLOGICAL DATABASES Lecturer: Junaid Gamieldien, PhD junaid@sanbi.ac.za Day1: NCBI Databases and Entrez http://www.sanbi.ac.za/training-2/undergraduate-training/ NOTE: Most slides derived from NCBI’s field guide

  2. WHAT YOU NEED TO LEARN: • What is a database and what are the features of an ideal db? • What are the relationships/differences between primary and derived sequence databases? • What are the benefits of RefSeq? • Why is data integration useful?

  3. The National Center for Biotechnology Information Bethesda,MD • Created in 1988 as a part of the • National Library of Medicine at NIH • Establish public databases • Research in computational biology • Develop software tools for sequence analysis • Disseminate biomedical information

  4. Web Access:www.ncbi.nlm.nih.gov New pages! New Homepage Common footer

  5. WHAT ARE DATABASES? • Structured collection of information. • Consists of basic units called records or entries. • Each record consists of fields, which hold pre-defined data related to the record. • For example, a protein database would have protein entries as records and protein properties as fields (e.g., name of protein, length, amino-acid sequence)

  6. THE ‘PERFECT’ DATABASE • Comprehensive, but easy to search. • Annotated, but not “too annotated”. • A simple, easy to understand structure. • Cross-referenced. • Minimum redundancy. • Easy retrieval of data.

  7. The Central Dogma & Biological Data Original DNA Sequences (Genomes) Expressed DNA sequences ( = mRNA Sequences = cDNA sequences) Expressed Sequence Tags (ESTs) • Protein Sequences • Inferred • Direct sequencing • Protein structures • Experiments • Models (homologues) Literature information

  8. NCBI Databases and Services • GenBank primary sequence database • Free public access to biomedical literature • PubMed free Medline (3 million searches per day) • PubMed Central full text online access • Entrez integrated molecular and literature databases

  9. TYPES OF MOLECULAR DATABASES • Primary Databases • Original submissions by experimentalists • Content controlled by the submitter • Examples: GenBank, Trace, SRA, SNP, GEO • Derivative Databases • Derived from primary data • Content controlled by third party (NCBI) • Examples: NCBI Protein, Refseq, TPA, RefSNP, GEO datasets, UniGene, Homologene, Structure, Conserved Domain

  10. ACGTGC C C GA GA ATT GA GA C ATT TATAGCCG AGCTCCGATA CCGATGACAA RefSeq C TATAGCCG ACGTGC Curators CGTGA ATTGACTA TTGACA Genome Assembly TTGACA TTGACA ACGTGC ACGTGC TATAGCCG CGTGA CGTGA TATAGCCG ATTGACTA TATAGCCG ATTGACTA ATTGACTA CGTGA ATTGACTA ATTGACTA ATT TATAGCCG TATAGCCG TATAGCCG TATAGCCG TATAGCCG TTGACA C GenBank UniGene GA AT C C C C ATT GA GA GA GA ATT ATT ATT Algorithms GA GA GA GA C C ATT ATT C C PRIMARY VS. DERIVATIVE SEQUENCE DATABASES Labs Sequencing Centers Updated continually by NCBI Updated ONLY by submitters

  11. Sequence Databases at NCBI • Primary • GenBank: NCBI’s primary sequence database • Trace Archive: reads from capillary sequencers • Sequence Read Archive: next generation data • Derivative • GenPept (GenBank translations) • Outside Protein (UniProt—Swiss-Prot, PDB) • NCBI Reference Sequences (RefSeq)

  12. GENBANK -PRIMARY SEQUENCE DB • Nucleotide only sequence database • Archival in nature • Historical • Reflective of submitter point of view (subjective) • Redundant • Data • Direct submissions (traditional records) • Batch submissions • FTP accounts (genome data)

  13. GENBANK -PRIMARY SEQUENCE DB (2) • Three collaborating databases • GenBank • DNA Database of Japan (DDBJ) • European Molecular Biology Laboratory (EMBL) Database

  14. Traditional GenBank Record • Accession • Stable • Reportable • Universal ACCESSION U07418 VERSION U07418.1 GI:466461 Version Tracks changes in sequence GI number NCBI internal use well annotated the sequence is the data

  15. Derivative Sequence Databases

  16. GenPept: GenBank CDS translations FEATURES Location/Qualifiers source 1..2484 /organism="Homo sapiens" /mol_type="mRNA" /db_xref="taxon:9606" /chromosome="3" /map="3p22-p23" gene 1..2484 /gene="MLH1" CDS 22..2292 /gene="MLH1" /note="homolog of S. cerevisiae PMS1 (Swiss-Prot Accession Number P14242), S. cerevisiae MLH1 (GenBank Accession Number U07187), E. coli MUTL (Swiss-Prot Accession Number P23367), Salmonella typhimurium MUTL (Swiss-Prot Accession Number P14161) and Streptococcus pneumoniae (Swiss-Prot Accession Number P14160)" /codon_start=1 /product="DNA mismatch repair protein homolog" /protein_id="AAC50285.1" /db_xref="GI:463989" /translation="MSFVAGVIRRLDETVVNRIAAGEVIQRPANAIKEMIENCLDAKS TSIQVIVKEGGLKLIQIQDNGTGIRKEDLDIVCERFTTSKLQSFEDLASISTYGFRGE ALASISHVAHVTITTKTADGKCAYRASYSDGKLKAPPKPCAGNQGTQITVEDLFYNIA TRRKALKNPSEEYGKILEVVGRYSVHNAGISFSVKKQGETVADVRTLPNASTVDNIRS >gi|463989|gb|AAC50285.1| DNA mismatch repair prote... MSFVAGVIRRLDETVVNRIAAGEVIQRPANAIKEMIENCLDAKSTSIQVIV... EDLDIVCERFTTSKLQSFEDLASISTYGFRGEALASISHVAHVTITTKTAD...

  17. REFSEQ: DERIVATIVE SEQUENCE DATABASE • Curated transcripts and proteins • Model transcripts and proteins • Assembled Genomic Regions • Chromosome records • Human genome • microbial • organelle ftp://ftp.ncbi.nih.gov/refseq/release/

  18. Selected RefSeq Accession Numbers mRNAs and Proteins NM_123456Curated mRNA NP_123456Curated Protein NR_123456Curated non-coding RNA XM_123456Predicted mRNA XP_123456Predicted Protein XR_123456Predicted non-coding RNA Gene Records NG_123456Reference Genomic Sequence Chromosome NC_123455Microbial replicons, organelle genomes, human chromosomes AC_123455 Alternate assemblies Assemblies NT_123456Contig NW_123456WGSSupercontig

  19. GenBank to RefSeq

  20. RefSeqs: Annotation Reagents Genomic DNA (NC, NT, NW) Scanning.... Model mRNA(XM) (XR) Model protein (XP) = ? Curated mRNA(NM) (NR) Curated Protein(NP) RefSeq GenBank Sequences

  21. RefSeq Benefits • Non-redundancy   • Updates to reflect current sequence data and biology • Data validation • Format consistency • Distinct accession series • Stewardship by NCBI staff and collaborators

  22. Other Derivative Databases • Expressed Sequences • dbSNP • Structure • Gene • and more…

  23. ENTREZFINDING RELEVANT INFORMATION IN NCBI DATABASES

  24. Word weight PubMed abstracts 3-D Structure 3 -D Structure Taxonomy VAST Phylogeny Gene Protein sequences Nucleotide sequences BLAST BLAST Neighbors Related Sequences Hard Link ENTREZ: A DISCOVERY SYSTEM • Pre-computed and pre-compiled data. • A potential “gold mine” of undiscovered relationships. • Used less than expected. Neighbors Related Structures Neighbors Related Sequences BLink Domains

  25. Global Query: All NCBI Databases The Entrez system: 38 (and counting) integrated databases

  26. Traditional Method: The links menu DNA Sequence Nucleotide – Protein Link Related Proteins Protein – Structure Link 3-D Structure

  27. The Problem • Rapidly growing databases with complex and changing relationships • Rapidly changing interfaces to match the above Result • Many people don’t know: • Where to begin • Where to click on a Web page • Why it might be useful to click there

  28. colon cancer Global NCBI (Entrez) Search

  29. Global Entrez Search Results

  30. Entrez Protein Gene Other Entrez DBs HomoloGene UniGene ENTREZ TIP: START SEARCHES IN GENE BLink Homologene: Gene Neighbors

  31. Precise Results MLH1[Gene Name] AND Human[Organism]

  32. MLH1 Gene Record

  33. MLH1:Links to Sequence

  34. ATPase domain GENEVIEW: HUMAN MLH1 VARIATIONS

  35. ‘TAKE HOME MESSAGE’ ADVANTAGES OF DATA INTEGRATION • More relevant inter-related information in one place • Makes it easier to find additional relevant information related to your initial query • Potentially find information indirectly linked, but relevant to your subject of interest • uncover non-obvious genetic features that explain phenotype or disease • Easier to build a ‘story’ based on multiple pieces of biological evidence

More Related