1 / 56

A Field Guide to GenBank and NCBI Molecular Biology Resources

A Field Guide to GenBank and NCBI Molecular Biology Resources. slightly modified from Peter Cooper ftp://ftp.ncbi.nih.gov/pub/cooper/FieldGuide/ Eric Sayers ftp://ftp.ncbi.nih.gov/pub/sayers/Field_Guide/U_Penn/. NCBI Resources. About NCBI NCBI Sequence Databases Primary Database – GenBank

umed
Download Presentation

A Field Guide to GenBank and NCBI Molecular Biology Resources

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. A Field Guide to GenBank and NCBI Molecular Biology Resources slightly modified from Peter Cooper ftp://ftp.ncbi.nih.gov/pub/cooper/FieldGuide/ Eric Sayers ftp://ftp.ncbi.nih.gov/pub/sayers/Field_Guide/U_Penn/

  2. NCBI Resources • About NCBI • NCBI Sequence Databases • Primary Database – GenBank • Derivative Databases - RefSeq • Entrez Databases and Text Searching • BLAST Services • Genomic Resources

  3. The National Center for Biotechnology Information (NCBI) • Created as a part of NLM in 1988 • Establish public databases • Perform research in computational biology • Develop software tools for sequence analysis • Disseminate biomedical information • Tools: BLAST(1990), Entrez (1992) • GenBank (1992) • Free MEDLINE (PubMed, 1997) • Human genome (2001)

  4. NCBI Home Pagehttp://www.ncbi.nlm.nih.gov To learn more, visit the “Site Map” and “About NCBI” web pages

  5. About NCBI

  6. Some NCBI Statistics….

  7. Christmas Day Users per day 1997 1998 1999 2000 2001

  8. Molecular Databases • Primary Databases • Original submissions by experimentalists • Database staff organize but don’t add additional information • Example:GenBank • Derivative Databases • Human curated • compilation and correction of data • Example:SWISS-PROT, NCBI RefSeq mRNA • Computationally Derived • Example:UniGene • Combinations • Example:NCBI Genome Assembly

  9. What is GenBank?NCBI’s Primary Sequence Database • Nucleotide only sequence database • GenBank Data • Direct submissions individual records (BankIt, Sequin) • Batch submissions via email (EST, GSS, STS) • ftp accounts established for sequencing centers • Data shared amongst three collaborating databases: • GenBank • DNA Database of Japan (DDBJ). • European Molecular Biology Laboratory Database (EMBL)

  10. The International Nucleotide Sequence Database Collaboration NIH Entrez Sequin BankIt ftp NCBI GenBank • Submissions • Updates • Submissions • Updates EMBL DDBJ EBI CIB NIG • Submissions • Updates SRS EMBL getentry

  11. Release 133 December 2002 22,318,883Records 28,507,990,166 Nucleotides 110,000 + Species • full release every two months • incremental and cumulative updates daily • available only through internet ftp://ftp.ncbi.nih.gov/genbank/ GenBank: NCBI’s Primary Sequence Database >90 Gigabytes of data

  12. Entrez Nucleotide RefSeq 1% EMBL 9% DDBJ 19% GenBank 71% 23,464,770 records

  13. Primary vs. Derivative Databases ACGTGC Curators C C GA ATT GA GA C ATT GA C RefSeq TATAGCCG Sequencing Centers ACGTGC TATAGCCG AGCTCCGATA CCGATGACAA ATTGACTA CGTGA TTGACA Labs TTGACA TTGACA ACGTGC Genome Assembly TATAGCCG ACGTGC TATAGCCG ATTGACTA CGTGA CGTGA ATTGACTA TATAGCCG CGTGA ATTGACTA ATTGACTA TATAGCCG TTGACA ATTGACTA TATAGCCG TATAGCCG TATAGCCG TATAGCCG ATT C GenBank UniGene GA AT C C Algorithms ATT C C GA ATT GA GA ATT GA ATT GA ATT GA C GA C ATT GA C C

  14. Traditional GenBank Divisions • Direct Submissions (Sequin and BankIt) • Accurate • Well characterized BCT Bacterial and Archeal INV Invertebrate MAM Mammalian (ex. ROD and PRI) PHG Phage PLN Plant and Fungal PRI Primate ROD Rodent SYN Synthetic (cloning vectors) VRL Viral VRT Other Vertebrate

  15. A Traditional GenBank Record Locus Field Molecule Type Modification Date GenBank Division Definition Line Accession Number Version GI (GenInfo) Keywords Taxonomy

  16. A Traditional GenBank Record

  17. Bulk Sequence Divisions of GenBank • Batch Submissions (email and ftp) • Inaccurate • Poorly Characterized EST Expressed Sequence Tag STS Sequence Tagged Site GSS Genome Survey Sequence HTG High Throughput Genomic HTC High Throughput cDNA

  18. Organization of GenBank 11 Traditional Divisions PAT 4% Traditional 8% 1 Patent Division STS, HTG, HTC 2% GSS 19% EST 67% 5 Bulk Divisions 23,087,196 records

  19. What is UniGene? A gene-oriented view of sequence entries • MegaBlast-based automated sequence clustering • Nonredundant set of gene-oriented clusters • Each cluster represents a unique gene • Provides information on tissue-specific expression and map locations • Includes well-characterized genes and novel ESTs • Useful for gene discovery and selection of mapping reagents

  20. Organisms Representedin UniGene

  21. Genome Sequencing Whole BAC insert (or genome) shredding sequencing cloning isolating GSS division or trace archive assembly Draft Sequence (HTG division)

  22. gaps Working Draft Sequence

  23. phase 1 HTG Acc = AC109609.1 phase 2 HTG Acc =AC109609.6 ROD phase 3 Acc = AC109609.10 HTG Division: High Throughput Genome

  24. HTG Division: High Throughput Genome

  25. NCBI’s Third Party Annotation (TPA) Database NEW • NCBI now accepts the submission of new annotations of existing GenBank sequences; • Facilitates the annotation of genomes by experts;

  26. A Sample TPA record

  27. RefSeq: NCBI’s Derivative Sequence Database • Curated transcripts and proteins • reviewed • human, mouse, rat, fruit fly, zebrafish, arabidopsis • Human model transcripts and proteins • Assembled Genomic Regions (contigs) • draft human genome • mouse genome • Chromosome records • Microbial • viral • organelle

  28. human mouse rat fruit fly zebrafish Arabidopsis The RefSeq Accession Numbers mRNAs and Proteins NM_123456Curated mRNA NP_123456Curated Protein NR_123456Curated non-coding RNA XM_123456Predicted Transcript (human, mouse) XP_123456Predicted Protein (human, mouse) XR_123456Predicted non-coding RNA Gene Records NG_ 123456Reference Genomic Sequence (human) Assemblies NT_ 123456Contig (Mouse and Human) NW_123456Supercontig (Mouse) NC_ 123456Chromosome (Microbial,Viral,Arabidopsis ) NR_ 123456 Interim Identifier for Microbial Chromosomes

  29. Curated RefSeq Records: NM_, NP_

  30. Entrez:Linking and Neighboring

  31. The Entrez Databases

  32. The (ever) Expanding Entrez System Journals UniGene Books SNP PubMed UniSTS PubMed Central Nucleotide PopSet Protein ProbeSet Entrez Genome Structure Taxonomy CDD OMIM 3D Domains

  33. Entrez Nucleotides glucose 6 phosphate dehydrogenase

  34. Document Summaries: glucose 6 phosphate dehydrogenase[All Fields] = 748 hits

  35. glucose 6 phosphate dehydrogenase Entrez Nucleotides: Limits Accession All Fields Author Name EC/RN Number Feature key Filter Gene Name Issue Journal Name Keyword Modification Date Organism Page Number Primary Accession Properties Protein Name Publication Date SeqID String Sequence Length Substance Name Text Word Title Word Uid Volume

  36. Entrez Nucleotides: Preview/Index

  37. Accession All Fields Author Name EC/RN Number Feature key Filter Gene Name Issue Journal Name Keyword Modification Date Organism Page Number Primary Accession Properties Protein Name Publication Date SeqID String Sequence Length . . . Adding Terms: Preview/Index

  38. Plant G6PD mRNAs

  39. Display: Formats, Links, and Neighbors Summary Brief ASN.1 FASTA XML GenBank GI list LinkOut Nucleotide Neighbors Genome Links ProbeSet Links OMIM Links PopSet Links Protein Links PubMed Links SNP Links Structure Links Taxonomy Links UniSTS Links

  40. FASTA definition line >gi|603218|gb|U18238.1|MSU18238 gi number Locus name Database identifiers gb GenBank emb EMBL dbj DDBJ sp SWISS-PROT pdb Protein Databank pir PIR prf PRF ref RefSeq Accession number >gi|603218|gb|U18238.1|MSU18238 Medicago sativa glucose-6-phosphate dehyd CCACCAGATATAATTAAGTAGATCAGAGTAGAAGAAGATGGGAACAAATGAATGGCATGTAGAAAGAAGA GATAGCATAGGTACTGAATCTCCTGTAGCAAGAGAGGTACTTGAAACTGGCACACTCTCTATTGTTGTGC TTGGTGCTTCTGGTGATCTTGCCAAGAAGAAGACTTTTCCTGCACTTTTTCACTTATATAAACAGGAATT GTTGCCACCTGATGAAGTTCACATTTTTGGCTATGCAAGGTCAAAGATCTCCGATGATGAATTGAGAAAC AAATTGCGTAGCTATCTTGTTCCAGAGAAAGGTGCTTCTCCTAAACAGTTAGATGATGTATCAAAGTTTT TACAATTGGTTAAATATGTAAGTGGCCCTTATGATTCTGAAGATGGATTTCGCTTGTTGGATAAAGAGAT TTCAGAGCATGAATATTTGAAAAATAGTAAAGAGGGTTCATCTCGGAGGCTTTTCTATCTTGCACTTCCT CCTTCAGTGTATCCATCCGTTTGCAAGATGATCAAAACTTGTTGCATGAATAAATCTGATCTTGGTGGAT GGACACGCGTTGTTGTTGAGAAACCCTTTGGTAGGGATCTAGAATCTGCAGAAGAACTCAGTACTCAGAT TGGAGAGTTATTTGAAGAACCACAGATTTATCGTATTGATCACTATTTAGGAAAGGAACTAGTGCAAAAC ATGTTAGTACTTCGTTTTGCAAATCGGTTCTTCTTGCCTCTGTGGAACCACAACCACATTGACAATGTGC AGATAGTATTTAGAGAGGATTTTGGAACTGATGGTCGTGGTGGATATTTTGACCAATATGGAATTATCCG AGATATCATTCCAAACCATCTGTTGCAGGTTCTTTGCTTGATTGCTATGGAAAAACCCGTTTCTCTCAAG CCTGAGCACATTCGAGATGAGAAAGTGAAGGTTCTTGAATCAGTACTCCCTATTAGAGATGATGAAGTTG TTCTTGGACAATATGAAGGCTATACAGATGACCCAACTGTACCGGACGATTCAAACACCCCGACTTTTGC AACTACTATTCTGCGGATACACAATGAAAGATGGGAAGGTGTTCCTTTCATTGTGAAAGCAGGGAAGGCC CTAAATTCTAGGAAGGCAGAGATTCGGGTTCAATTCAAGGATGTTCCTGGTGACATTTTCAGGAGTAAAA AGCAAGGGAGAAACGAGTTTGTTATCCGCCTACAACCTTCAGAAGCTATTTACATGAAGCTTACGGTCAA GCAACCTGGACTGGAAATGTCTGCAGTTCAAAGTGAACTAGACTTGTCATATGGGCAACGATATCAAGGG ATAACCATTCCAGAGGCTTATGAGCGTCTAATTCTCGACACAATTAGAGGTGATCAACAACATTTTGTTC GCAGAGACGAATTAAAGGCATCATGGCAAATATTCACACCACTTTTACACAAAATTGATAGAGGGGAGTT GAAGCCGGTTCCTTACAACCCGGGAAGTAGAGGTCCTGCAGAAGCAGATGAGTTATTAGAAAAAGCTGGA TATGTTCAAACACCCGGTTATATATGGATTCCTCCTACCTTATAGAGTGACCAAATTTCATAATAAAACA AGGATTAGGATTATCAGGAGCTTATAAATAAGTCTTCAATAAGCTTGTGAAATTTTCGTTATAATCTCTC TCATTTTGGGGTGTATATCAAGCATTTAAGCGCGTGTTTGACACAGTTTGTGTAATAGATTTGGCTCTGA ATGAAAATAAACGGGAATTGTTTCTTTTTGTTTTA >

  41. Entrez Genome

  42. Organism Pages

  43. The Map Viewer: a common platform for integrated display

  44. The Map Viewer

  45. Entrez PubMed

  46. Online Books

  47. Entrez Specialized Databases Taxonomy Searchable taxonomic tree having nodes for all species with records in an Entrez database Online Mendelian Inheritance in Man: A database of genetically linked human diseases OMIM ProbeSet Expression data (GEO) and microarray datasets

  48. Entrez Taxonomy

  49. Entrez OMIM

  50. Entrez ProbeSet

More Related