1 / 63

NCBI Molecular Biology Resources

NCBI Molecular Biology Resources. A Field Guide. NCBI Resources. About NCBI NCBI Sequence Databases Other NCBI Databases Entrez Databases and Text Searching Genomic Resources BLAST Services. The National Center for Biotechnology Information. Bethesda,MD.

aimon
Download Presentation

NCBI Molecular Biology Resources

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. NCBI Molecular Biology Resources A Field Guide

  2. NCBI Resources • About NCBI • NCBI Sequence Databases • Other NCBI Databases • Entrez Databases and Text Searching • Genomic Resources • BLAST Services

  3. The National Center for Biotechnology Information Bethesda,MD Created in 1988 as a part of the National Library of Medicine at NIH • Establish public databases • Research in computational biology • Develop software tools for sequence analysis • Disseminate biomedical information

  4. Types of Databases • Primary Databases • Original submissions by experimentalists • Content controlled by the submitter • Examples: GenBank, SNP, GEO • Derivative Databases • Built from primary data • Content controlled by third party (NCBI) • Examples: Refseq, TPA, RefSNP, UniGene, NCBI Protein, Structure, Conserved Domain

  5. The Entrez System

  6. Entrez Nucleotides Primary • GenBank / EMBL / DDBJ 35,116,960 Derivative • RefSeq 259,219 • Third Party Annotation 3,182 • PDB 4,703 Total 35,384,248

  7. Entrez Protein • GenPept (GB,EMBL, DDBJ)3,178,346 • RefSeq 933,905 • Third Party Annotation 4,338 • Swiss Prot 146,978 • PIR 282,821 • PRF 12,079 Total 4,314,705 BLAST nr 2,724,717

  8. What is GenBank?NCBI’s Primary Sequence Database • Nucleotide only sequence database • Archival in nature • GenBank Data • Direct submissions (traditional records ) • Batch submissions (EST, GSS, STS) • ftp accounts (genome data) • Three collaborating databases • GenBank • DNA Database of Japan (DDBJ) • European Molecular Biology Laboratory (EMBL) Database

  9. Entrez NIH NCBI GenBank • Submissions • Updates • Submissions • Updates EMBL DDBJ EBI CIB NIG • Submissions • Updates SRS EMBL getentry International Sequence Database Collaboration

  10. Sequence records • Total base pairs 35 40 35 30 30 25 25 20 20 15 15 10 10 5 5 0 0 The Growth of GenBank Release 140: 32.5 million records 37.9 billion nucleotides Average doubling time ≈ 12 months Sequence Records (millions) Total Base Pairs (billions) ’83 ’84 ’85 ’86 ’87 ’88 ’89 ’90 ’91 ’92 ’93 ’94 ’95 ’96 ’97 ’98 ’99 ’00 ’01 ’02 ’03 ’04

  11. Organization of GenBank:GenBank Divisions Records are divided into 17 Divisions. • 1 Patent (11 files) • 5 Bulk • 11 Traditional EST (288) Expressed Sequence Tag GSS (98) Genome Survey Sequence HTG (61) High Throughput Genomic STS (3) Sequence Tagged Site HTC (3) High Throughput cDNA PRI (27) Primate PLN (10) Plant and Fungal BCT (8) Bacterial and Archeal INV (6) Invertebrate ROD (11) Rodent VRL (3) Viral VRT (4) Other Vertebrate MAM (1) Mammalian (ex. ROD and PRI) PHG (1) Phage SYN (1) Synthetic (cloning vectors) UNA (1) Unannotated • Traditional Divisions: • Direct Submissions • (Sequin and BankIt) • Accurate • Well characterized • BULK Divisions: • Batch Submission • (Email and FTP) • Inaccurate • Poorly characterized Entrez query: gbdiv_xxx[Properties]

  12. Traditional GenBank Divisions • Direct Submissions (Sequin and BankIt) • Accurate • Well characterized BCT Bacterial and Archeal INV Invertebrate MAM Mammalian (ex. ROD and PRI) PHG Phage PLN Plant and Fungal PRI Primate ROD Rodent SYN Synthetic (vectors, synth. genes) VRL Viral VRT Other Vertebrate

  13. A Traditional GenBank Record LOCUS AF062069 3808 bp mRNA linear INV 23-OCT-2002 DEFINITION Limulus polyphemus myosin III mRNA, complete cds. ACCESSION AF062069 VERSION AF062069.2 GI:7144484 KEYWORDS . SOURCE Limulus polyphemus (Atlantic horseshoe crab) ORGANISM Limulus polyphemus Eukaryota; Metazoa; Arthropoda; Chelicerata; Merostomata; Xiphosura; Limulidae; Limulus. REFERENCE 1 (bases 1 to 3808) AUTHORS Battelle,B.-A., Andrews,A.W., Calman,B.G., Sellers,J.R., Greenberg,R.M. and Smith,W.C. TITLE A myosin III from Limulus eyes is a clock-regulated phosphoprotein JOURNAL J. Neurosci. 18 (12), 4548-4559 (1998) MEDLINE 98279067 PUBMED 9614231 REFERENCE 2 (bases 1 to 3808) AUTHORS Battelle,B.-A., Andrews,A.W., Calman,B.G., Sellers,J.R., Greenberg,R.M. and Smith,W.C. TITLE Direct Submission JOURNAL Submitted (29-APR-1998) Whitney Laboratory, University of Florida, 9505 Ocean Shore Blvd., St. Augustine, FL 32086, USA REFERENCE 3 (bases 1 to 3808) AUTHORS Battelle,B.-A., Andrews,A.W., Calman,B.G., Sellers,J.R., Greenberg,R.M. and Smith,W.C. TITLE Direct Submission JOURNAL Submitted (02-MAR-2000) Whitney Laboratory, University of Florida, 9505 Ocean Shore Blvd., St. Augustine, FL 32086, USA REMARK Sequence update by submitter COMMENT On Mar 2, 2000 this sequence version replaced gi:3132700.

  14. LOCUS AF062069 3808 bp mRNA linear INV 23-OCT-2002 Length Division Molecule type Locus name Modification Date GenBank Record: Locus LOCUS AF062069 3808 bp mRNA linear INV 23-OCT-2002 DEFINITION Limulus polyphemus myosin III mRNA, complete cds. ACCESSION AF062069 VERSION AF062069.2 GI:7144484 KEYWORDS . SOURCE Limulus polyphemus (Atlantic horseshoe crab) ORGANISM Limulus polyphemus Eukaryota; Metazoa; Arthropoda; Chelicerata; Merostomata; Xiphosura; Limulidae; Limulus. REFERENCE 1 (bases 1 to 3808) AUTHORS Battelle,B.-A., Andrews,A.W., Calman,B.G., Sellers,J.R., Greenberg,R.M. and Smith,W.C. TITLE A myosin III from Limulus eyes is a clock-regulated phosphoprotein JOURNAL J. Neurosci. 18 (12), 4548-4559 (1998) MEDLINE 98279067 PUBMED 9614231 REFERENCE 2 (bases 1 to 3808) AUTHORS Battelle,B.-A., Andrews,A.W., Calman,B.G., Sellers,J.R., Greenberg,R.M. and Smith,W.C. TITLE Direct Submission JOURNAL Submitted (29-APR-1998) Whitney Laboratory, University of Florida, 9505 Ocean Shore Blvd., St. Augustine, FL 32086, USA REFERENCE 3 (bases 1 to 3808) AUTHORS Battelle,B.-A., Andrews,A.W., Calman,B.G., Sellers,J.R., Greenberg,R.M. and Smith,W.C. TITLE Direct Submission JOURNAL Submitted (02-MAR-2000) Whitney Laboratory, University of Florida, 9505 Ocean Shore Blvd., St. Augustine, FL 32086, USA REMARK Sequence update by submitter COMMENT On Mar 2, 2000 this sequence version replaced gi:3132700.

  15. ACCESSION AF062069 VERSION AF062069.2 GI:7144484 GenBank Record: Identifiers LOCUS AF062069 3808 bp mRNA linear INV 23-OCT-2002 DEFINITION Limulus polyphemus myosin III mRNA, complete cds. ACCESSION AF062069 VERSION AF062069.2 GI:7144484 KEYWORDS . SOURCE Limulus polyphemus (Atlantic horseshoe crab) ORGANISM Limulus polyphemus Eukaryota; Metazoa; Arthropoda; Chelicerata; Merostomata; Xiphosura; Limulidae; Limulus. REFERENCE 1 (bases 1 to 3808) AUTHORS Battelle,B.-A., Andrews,A.W., Calman,B.G., Sellers,J.R., Greenberg,R.M. and Smith,W.C. TITLE A myosin III from Limulus eyes is a clock-regulated phosphoprotein JOURNAL J. Neurosci. 18 (12), 4548-4559 (1998) MEDLINE 98279067 PUBMED 9614231 REFERENCE 2 (bases 1 to 3808) AUTHORS Battelle,B.-A., Andrews,A.W., Calman,B.G., Sellers,J.R., Greenberg,R.M. and Smith,W.C. TITLE Direct Submission JOURNAL Submitted (29-APR-1998) Whitney Laboratory, University of Florida, 9505 Ocean Shore Blvd., St. Augustine, FL 32086, USA REFERENCE 3 (bases 1 to 3808) AUTHORS Battelle,B.-A., Andrews,A.W., Calman,B.G., Sellers,J.R., Greenberg,R.M. and Smith,W.C. TITLE Direct Submission JOURNAL Submitted (02-MAR-2000) Whitney Laboratory, University of Florida, 9505 Ocean Shore Blvd., St. Augustine, FL 32086, USA REMARK Sequence update by submitter COMMENT On Mar 2, 2000 this sequence version replaced gi:3132700.

  16. SOURCE Limulus polyphemus (Atlantic horseshoe crab) ORGANISM Limulus polyphemus Eukaryota; Metazoa; Arthropoda; Chelicerata; Merostomata; Xiphosura; Limulidae; Limulus. GenBank Record: Organism LOCUS AF062069 3808 bp mRNA linear INV 23-OCT-2002 DEFINITION Limulus polyphemus myosin III mRNA, complete cds. ACCESSION AF062069 VERSION AF062069.2 GI:7144484 KEYWORDS . SOURCE Limulus polyphemus (Atlantic horseshoe crab) ORGANISM Limulus polyphemus Eukaryota; Metazoa; Arthropoda; Chelicerata; Merostomata; Xiphosura; Limulidae; Limulus. REFERENCE 1 (bases 1 to 3808) AUTHORS Battelle,B.-A., Andrews,A.W., Calman,B.G., Sellers,J.R., Greenberg,R.M. and Smith,W.C. TITLE A myosin III from Limulus eyes is a clock-regulated phosphoprotein JOURNAL J. Neurosci. 18 (12), 4548-4559 (1998) MEDLINE 98279067 PUBMED 9614231 REFERENCE 2 (bases 1 to 3808) AUTHORS Battelle,B.-A., Andrews,A.W., Calman,B.G., Sellers,J.R., Greenberg,R.M. and Smith,W.C. TITLE Direct Submission JOURNAL Submitted (29-APR-1998) Whitney Laboratory, University of Florida, 9505 Ocean Shore Blvd., St. Augustine, FL 32086, USA REFERENCE 3 (bases 1 to 3808) AUTHORS Battelle,B.-A., Andrews,A.W., Calman,B.G., Sellers,J.R., Greenberg,R.M. and Smith,W.C. TITLE Direct Submission JOURNAL Submitted (02-MAR-2000) Whitney Laboratory, University of Florida, 9505 Ocean Shore Blvd., St. Augustine, FL 32086, USA REMARK Sequence update by submitter COMMENT On Mar 2, 2000 this sequence version replaced gi:3132700. NCBI’s Taxonomy

  17. GenBank Record: Feature Table FEATURES Location/Qualifiers source 1..3808 /organism="Limulus polyphemus" /db_xref="taxon:6850" /tissue_type="lateral eye" CDS 258..3302 /note="N-terminal protein kinase domain; C-terminal myosin heavy chain head; substrate for PKA" /codon_start=1 /product="myosin III" /protein_id="AAC16332.2" /db_xref="GI:7144485" /translation="MEYKCISEHLPFETLPDPGDRFEVQELVGTGTYATVYSAIDKQA NKKVALKIIGHIAENLLDIETEYRIYKAVNGIQFFPEFRGAFFKRGERESDNEVWLGI EFLEEGTAADLLATHRRFGIHLKEDLIALIIKEVVRAVQYLHENSIIHRDIRAANIMF SKEGYVKLIDFGLSASVKNTNGKAQSSVGSPYWMAPEVISCDCLQEPYNYTCDVWSIG ITAIELADTVPSLSDIHALRAMFRINRNPPPSVKRETRWSETLKDFISECLVKNPEYR PCIQEIPQHPFLAQVEGKEDQLRSELVDILKKNPGEKLRNKPYNVTFKNGHLKTISGQ BASE COUNT 1201 a 689 c 782 g 1136 t ORIGIN 1 tcgacatctg tggtcgcttt ttttagtaat aaaaaattgt attatgacgt cctatctgtt 3781 aagatacagt aactagggaa aaaaaaaa // /protein_id="AAC16332.2" /db_xref="GI:7144485" GenPept IDs

  18. What is UniGene? A gene-oriented view of sequence entries • MegaBlast based automated sequence clustering • Now informed by genome hits New! • Nonredundant set of gene oriented clusters • Each cluster a unique gene • Information on tissue types and map locations • Includes well-characterized genes and novel ESTs • Useful for gene discovery and selection of mapping reagents

  19. UniGene

  20. Human UniGene UniGene Build 168 Feb. 24, 2004 132,990 mRNAs 6,327 models 7,235 HTC 1,408,949 EST, 3'reads 2,082,199 EST, 5'reads + 774,927 EST, other/unknown ---------- 4,412,627 total sequences in clusters Final Number of Clusters (sets) =============================== total 27,511 contain at least one mRNA 5,613 contain at least one HTC 104,397 contain at least one EST 26,291 contain both mRNAs and ESTs 105,651 3,000,000,000 bp 30 K expected genes 75% uncharacterized transcripts

  21. Genome Sequencing - HTG, GSS,(WGS) Whole BAC insert (or genome) shredding sequencing cloning isolating GSS division or trace archive whole genome shotgun assemblies (traditional division) assembly Draft Sequence (HTG division)

  22. Other Genome Sequencing Products Trace Archive Whole Genome Shotgun

  23. Trace Archive • Primary reads from WGS and EST projects • Many not available in GenBank • Earliest access to genome data

  24. Derivative Sequence Databases RefSeq TPA

  25. NCBI Derivative Sequence Data C C Curators GA ATT GA GA C ATT GA C RefSeq TATAGCCG ACGTGC TATAGCCG AGCTCCGATA CCGATGACAA ATTGACTA CGTGA TTGACA Labs TTGACA TTGACA ACGTGC Genome Assembly TATAGCCG ACGTGC TATAGCCG ATTGACTA CGTGA CGTGA ATTGACTA TATAGCCG CGTGA ATTGACTA ATTGACTA TATAGCCG TTGACA ATTGACTA TATAGCCG TATAGCCG TATAGCCG TATAGCCG ATT C GenBank UniGene GA AT C C Algorithms ATT C C GA ATT GA GA ATT GA GA ATT GA C GA C ATT GA

  26. RefSeq: NCBI’s Derivative Sequence Database • Curated transcripts and proteins • reviewed • human, mouse, rat, fruit fly, zebrafish, arabidopsis • Model transcripts and proteins • Assembled Genomic Regions (contigs) • human genome • mouse genome • Chromosome records • Human genome • microbial • organelle srcdb_refseq[Properties] ftp://ftp.ncbi.nih.gov/refseq/release/

  27. RefSeq Benefits • non-redundancy   • explicitly linked nucleotide and protein sequences • updates to reflect current sequence data and biology • data validation • format consistency • distinct accession series • stewardship by NCBI staff and collaborators

  28. RefSeq Accession Numbers mRNAs and Proteins NM_123456Curated mRNA NP_123456Curated Protein NR_123456Curated non-coding RNA XM_123456Predicted mRNA XP_123456Predicted Protein XR_123456Predicted non-coding RNA Gene Records NG_123456Reference Genomic Sequence Chromosome NC_123455Microbial replicons, organelle genomes, human chromosomes Assemblies NT_123456Contig NW_123456WGSSupercontig

  29. Third Party Annotation (TPA) Database • Annotations of existing GenBank sequences • Allows for community annotation of genomes • Direct submissions • BankIt • Sequin tpa[Properties]

  30. Other NCBI Databases dbSNP:nucleotide polymorphism Geo:Gene Expression Omnibus microarray and other expression data Gene:gene records Unifies LocusLink and Microbial Genomes Structure:imported structures (PDB) Cn3D viewer, NCBI curation CDD:conserved domain database Protein families (COGs and KOGs) Single domains (PFAM, SMART, CD)

  31. NCBI Structures and Domains

  32. MMDB: Molecular Modeling Data Base • Derived from experimentally determined PDB records • Value added to PDB records including: • Addition of explicit chemical graph information • Validation • Inclusion of Taxonomy, Citation, • Conversion to ASN.1 data description language • Structure neighbors determined by Vector Alignment Search Tool (VAST)

  33. Structure Summary Cn3D viewer Structure Neighbors 3D Domain Neighbors Conserved Domains

  34. NCBI’s Conserved Domain Database • Multiple sequence alignments • PSI-BLAST –based score matrices • Sources SMART, PFAM, COGs, KOGs • New NCBI curated domains • structure informed alignments • Stats: • COGS 4,873 • KOGS 4,852 • Pfam 5,193 • Smart 653 • NCBI CDD 316

  35. WWWAccess Entrez & BLAST

  36. 1997 1998 1999 2000 2001 1997 1998 1999 2000 2001 Christmas Day NCBI Web Traffic

  37. Using Entrez An integrated database search and retrieval system

  38. Entrez: Database Integration Word weight PubMed abstracts 3-D Structure 3 -D Structure Taxonomy VAST Genomes Phylogeny Protein sequences Nucleotide sequences BLAST BLAST

  39. Database Searching with Entrez Using limits and field restriction to find human MutL homolog Linking and neighboring with MutL Mapping SNPs onto structure and the genome

  40. Global Entrez Search

  41. Document Summaries:MutL[All Fields]

  42. Entrez Nucleotides: Limits & Preview/Index

  43. Accession All Fields Author Name EC/RN Number Feature key Filter Gene Name Issue Journal Name Keyword Modification Date Organism Page Number Primary Accession Properties Protein Name Publication Date SeqID String Sequence Length Substance Name Text Word Title Uid Volume Field Restriction MutL Entrez Nucleotides: Limits Exclude bulk sequences

  44. MutL Entrez Nucleotides: Limits Title == Definition Exclude Bulk Sequences

  45. Document Summaries: Limits

  46. Adding Terms: Preview/Index Accession All Fields Author Name EC/RN Number Feature key Filter Gene Name Issue Journal Name Keyword Modification Date Organism Page Number Primary Accession Properties Protein Name Publication Date SeqID String Sequence Length Substance Name Text Word Title Uid Volume

  47. Human MutL Search Results

  48. GenBank Records Human MutL RefSeq

  49. NM_000249: Links

  50. Literature Links PubMed OMIM

More Related