700 likes | 1.95k Views
NCBI. Created as a part of NLM in 1988 Establish public databases Research in computational biology Develop software tools for sequence analysis Disseminate biomedical information Tools: BLAST(1990), Entrez (1992) GenBank (1992) Free MEDLINE (PubMed, 1997) Human genome (2001).
E N D
NCBI • Created as a part of NLM in 1988 • Establish public databases • Research in computational biology • Develop software tools for sequence analysis • Disseminate biomedical information • Tools: BLAST(1990), Entrez (1992) • GenBank (1992) • Free MEDLINE (PubMed, 1997) • Human genome (2001)
NCBI Home Pagewww.ncbi.nlm.nih.gov To learn more, visit “Site Map” and “About NCBI” web pages
Entrez The (ever) Expanding Entrez System PubMed Nucleotide UniGene Protein Journals Structure CDD Genome PopSet SNP OMIM 3D Domains Taxonomy UniSTS ProbeSet Books
Literature Databases • PubMed • Books • PubMed Central • Journals • On-Line Mendelian Inheritance in Man (OMIM)
Molecular Sequence Databases • Sequence Databases • Nucleotide (GenBank) • Taxonomy • PopSet • Protein • Marker Databases • Single Nucleotide Polymorphisms (SNP’s, dbSNP) • Sequence Tagged Sites (STS’s, dbSTS) • Expressed Sequence Tags (EST’s, dbEST) • UniGene
Molecular Databases • Primary Databases • Original submissions by experimentalists • Database staff organize but don’t add additional information • Example:GenBank • Derivative Databases • Human curated • compilation and correction of data • Example:SWISS-PROT, NCBI RefSeq mRNA • Computationally Derived • Example:UniGene • Combinations • Example:NCBI Genome Assembly
ACGTGC Curators C C GA ATT GA GA C ATT GA C RefSeq TATAGCCG ACGTGC TATAGCCG AGCTCCGATA CCGATGACAA ATTGACTA CGTGA TTGACA Labs TTGACA TTGACA ACGTGC Genome Assembly TATAGCCG ACGTGC TATAGCCG ATTGACTA CGTGA CGTGA ATTGACTA CGTGA TATAGCCG ATTGACTA ATTGACTA TATAGCCG TTGACA ATTGACTA TATAGCCG TATAGCCG TATAGCCG TATAGCCG ATT C GenBank GA UniGene AT C C Algorithms ATT C C GA ATT GA GA ATT GA GA ATT GA C GA C ATT GA
NIH NCBI ENTREZ GenBank NIG CIB Get Entry DDBJ EMBL EBI SRS EMBL The International Nucleotide Sequence Database Collaboration
RefSeq 1% PDB 0.01% EMBL 9% DDBJ 19% GenBank 71% Entrez Nucleotide
What is GenBank?NCBI’s Primary Sequence Database • Nucleotide only sequence database • Archival in nature • GenBank Data • Direct submissions individual records (BankIt, Sequin) • Batch submissions via email (EST, GSS, STS) • ftp accounts established for sequencing centers • Data shared amongst three collaborating databases: • GenBank • DNA Database of Japan (DDBJ). • European Molecular Biology Laboratory Database (EMBL)
The Old Way From Fran Lewitter, Whitehead Institute
Release 136 June 2003 25,592,865 Records 18,197,119(June 2002) 32,528,249,295 Nucleotides 22,616,937,182(June 2002) 110,000 + Species • full release every two months • incremental and cumulative updates daily • available only through internet ftp://ftp.ncbi.nih.gov/genbank/ ftp://genbank.sdsc.edu/pub ftp://bio-mirror.net/biomirror/genbank/ 121 Gigabytes of data GenBank: NCBI’s Primary Sequence Database
GenBank Divisions Traditional Divisions BCT Bacterial/Archeal INV Invertebrate MAM Mammalian (ex. ROD/PRI) PHG Phage PLN Plant/Fungal PRI Primate ROD Rodent SYN Synthetic (cloning vectors) VRL Viral VRT Other Vertebrate Bulk Sequence Divisions EST Expressed Sequence Tag STS Sequence Tagged Site GSS Genome Survey Sequence HTGS High Throughput Genomic Sequence HTC High Throughput cDNA
A Traditional GenBank Record Locus Field Molecule Type Modification Date Definition Line GenBank Division GI (GenInfo) Keywords Taxonomy Submission Field
Feature Table GenPept Record Genomic DNA Sequence
Bulk Sequence Divisions Bulk Sequence Divisions EST Expressed Sequence Tag STS Sequence Tagged Site HTGS High Throughput Genomic Sequence •Batch Submission, e-mail, or ftp •Inaccurate •Poorly Characterized
5’ 3’ make cDNA library 80-100,000 unique cDNA clones in library EST Division: Expressed Sequence Tags >IMAGE:275615 5' mRNA sequence GACAGCATTCGGGCCGAGATGTCTCGCTCCGTGGCCTTAGCTGTGCTCGCGCTACTCTCTCTTTCTGGCCTGGAGGTATCCAGCGTACTCCAAAGATTCAGGTTTACTCACGTCATCCAGCAGAGAATGGAAAGTCAAATTTCCTGAATTGCTATGTGTCTGGGTTTCATCCATCCGACATTGAAGTTGACTTACTGAAGAATGGAGAGAGAATTGAAAAAGTGGAGCATTCAGACTTGTCTTTCAGCAAGGACTGGTCTTTCTATCTCTTGTACTACACTGAATTCACCCCCACTGAAAAAGATGAGTATGCCTGCCGTGTTGAACCATGTNGACTTTGTCACAGNCCCAAGTTNAGTTTAAGTGGGNATCGAGACATGTAAGGCAGGCATCATGGGAGGTTTTGAAGNATGCCGCNTTTTGGATTGGGATGAATTCCAAATTTCTGGTTTGCTTGNTTTTTTAATATTGGATATGCTTTTG nucleus 30,000 genes gatccantgccatacg ctcgccaattcnntcg >IMAGE:275615 3', mRNA sequence NNTCAAGTTTTATGATTTATTTAACTTGTGGAACAAAAATAAACCAGATTAACCACAACCATGCCTTACTTTATCAAATGTATAAGANGTAAATATGAATCTTATATGACAAAATGTTTCATTCATTATAACAAATTTCCAATAATCCTGTCAATNATATTTCTAAATTTTCCCCCAAATTCTAAGCAGAGTATGTAAATTGGAAGTTAACTTATGCACGCTTAACTATCTTAACAAGCTTTGAGTGCAAGAGATTGANGAGTTCAAATCTGACCAAGATGTTGATGTTGGATAAGAGAATTCTCTGCTCCCCACCTCTANGTTGCCAGCCCTC • - isolate unique clones • sequence once • from each end RNA gene products
What is UniGene? A gene-oriented view of sequence entries • MegaBlast-based automated sequence clustering • Nonredundant set of gene-oriented clusters • Each cluster represents a unique gene • Provides information on tissue-specific expression and map locations • Includes well-characterized genes and novel ESTs • Useful for gene discovery and selection of mapping reagents
Query Sequence (muscle creatine kinase mRNA) EST hits to Homo sapiens muscle creatine kinase mRNA 3’ EST Hits 5’ EST Hits
STS Division :Sequence Tagged Sites • Segment of gene, EST, mRNA or genomic DNA of known position (microsatellite) • PCR with STS primers gives one product per genome • Basis of Radiation Hybrid Mapping • UniGene • Genome Assembly • Related resource: Electronic PCR
phase 1 HTG Acc = AC109609.1 phase 2 HTG Acc =AC109609.6 ROD phase 3 Acc = AC109609.10 HTG Division: High Throughput Genome unfinished, may be unordered,with gaps unfinished, oriented,ordered,may have gaps finished,no gaps Same accession numbers, different versions 40,000 to > 50,000 bp
RefSeq: NCBI’s Derivative Sequence Database • Curated transcripts and proteins • reviewed • human, mouse, rat, fruit fly, zebrafish, arabidopsis • Human model transcripts and proteins • Assembled Genomic Regions (contigs) • draft human genome • mouse genome • Chromosome records • Microbial • viral • organelle
Reference Sequences Chromosome: NC_000000 mRNA: NM_000000 protein: NP_000000 Gene: NG_000000 Contig: NT_000000 NW_000000 RNA: NR_000000 Model mRNA: XM_000000 Model protein: XP_000000 Curated Automated Model RNA: XR_000000
RefSeq Chromosomes:NC_ LOCUS NC_002695 5498450 bp DNA circular BCT 02-OCT-2001 DEFINITION Escherichia coli O157:H7, complete genome. ACCESSION NC_002695 VERSION NC_002695.1 GI:15829254 KEYWORDS . SOURCE Escherichia coli O157:H7. ORGANISM Escherichia coli O157:H7 Bacteria; Proteobacteria; gamma subdivision; Enterobacteriaceae; Escherichia. REFERENCE 1 (sites) AUTHORS Makino,K., Yokoyama,K., Kubota,Y., Yutsudo,C.H., Kimura,S., Kurokawa,K., Ishii,K., Hattori,M., Tatsuno,I., Abe,H., Iida,T., Yamamoto,K., Ohnishi,M., Hayashi,T., Yasunaga,T., Honda,T., Sasakawa,C. and Shinagawa,H. TITLE Complete nucleotide sequence of the prophage VT2-Sakai carrying the verotoxin 2 genes of the enterohemorrhagic Escherichia coli O157:H7 derived from the Sakai outbreak JOURNAL Genes Genet. Syst. 74 (5), 227-239 (1999) MEDLINE 20198780 PUBMED 10734605
BLASTa starting point for most bioinformatics related problems…