420 likes | 550 Views
A Minimal Guide to NCBI Nucleotide Resources. Types of Databases. Primary Databases Original submissions by experimentalists Content controlled by the submitter Examples: GenBank, SNP, GEO Derivative Databases Built from primary data Content controlled by third party (NCBI)
E N D
Types of Databases • Primary Databases • Original submissions by experimentalists • Content controlled by the submitter • Examples:GenBank, SNP, GEO • Derivative Databases • Built from primary data • Content controlled by third party (NCBI) • Examples:Refseq, TPA, RefSNP, UniGene, GEO Datasets, NCBI Protein, Structure, Conserved Domain
Accessing the Data: Entrez all[filter]
Entrez NIH NCBI GenBank • Submissions • Updates • Submissions • Updates EMBL DDBJ EBI CIB NIG • Submissions • Updates SRS EMBL getentry International Sequence Database Collaboration
Release 142 June 2004 35,532,003 Records 40,325,321,348 Nucleotides >140,000 Species 153 Gigabytes 634 files GenBank: NCBI’s Primary Sequence Database • full release every two months • incremental and cumulative updates daily • available only through internet • release notes: gbrel.txt ftp://ftp.ncbi.nih.gov/genbank/ ftp://genbank.sdsc.edu/pub ftp://bio-mirror.net/biomirror/genbank
LOCUS NM_000588 924 bp mRNA linear PRI 07-APR-2003 DEFINITION Homo sapiens interleukin 3 (colony-stimulating factor, multiple)(IL3), mRNA. ACCESSION NM_000588 VERSION NM_000588.3 GI:28416914 KEYWORDS . A GenBank Record
/protein_id=“NP_000579.2” /db_xref=“GI:28416915 GenPept identifiers GenBank Record: Feature Table
GenPept GenBank ASN.1 FASTA Protein FASTA Nucleotide Abstract Syntax Notation: ASN.1
Bulk Divisions • Expressed Sequence Tag • 1st pass single read cDNA • Genome Survey Sequence • 1st pass single read gDNA • High Throughput Genomic • incomplete sequences of genomic clones • Sequence Tagged Site • PCR-based mapping reagents • Batch submissions (email and ftp) • Inaccurate • Poorly characterized
C GA ATT GA ATT C C C ATT C ACT GA TA Curators Primary vs. Derivative Databases UniGene Algorithms Sequencing Centers UniSTS EST GenBank Updated continually by NCBI STS Updated ONLY by submitters RefSeq: Annotation Pipeline GSS HTG INV VRT PHG VRL PRI ROD PLN MAM BCT RefSeq: LocusLink and Genomes Pipelines RefSeq Labs
RefSeq protein = AAC77388 splice variant splice variant splice variant Δ = 5 aa = P11388 Why Make Reference Sequences? Entrez Protein query: topoisomerase II alpha[title] AND human[organism]
RefSeq Benefits genomes transcripts • non-redundant, best representative • updates to reflect current sequence data and biology • distinct, stable accession series proteins
Reference Sequence: RefSeq AccessionSequence Type NM_123456789mRNA NP_123456789protein, from NM_ NR_123456non-coding RNA XM_123456predicted mRNA XP_123456predicted protein XR_123456predicted non-coding RNA ZP_12345678 predicted from NZ_ NC_123456genomic, e.g., chromosomes NG_123455genomic, incomplete region NT_123456genomic, BAC assembly NW_123456genomic, WGS assembly NZ_ABCD12345678 genomic, WGS collection blue=curated REFSEQ Key
RefSeq Status Codes REVIEWED: by NCBI staff or by a collaborator. Some RefSeq records may incorporate expanded sequence and annotation information including additional publications and features. VALIDATED: in an initial review to provide the preferred sequence standard; not yet subjected to final review at which time additional functional information may be provided. PROVISIONAL: the record has not yet been subject to individual review and is thought to be well supported and to represent a valid transcript and protein. PREDICTED: may represent an ab initio prediction or may be partially supported by other transcript data; the protein is predicted. INFERRED: by genome sequence analysis. MODEL: provided via automated processing and not subjected to individual review or revision between builds.
Third Party Annotation (TPA) Database • Annotations of existing GenBank sequences • Allows for community annotation of genomes • Direct submissions • BankIt • Sequin
Other Databases at the NCBI dbSNP nucleotide polymorphisms GEOGene Expression Omnibus microarray and other expression data GEO DataSets curated reports of GEO data collections of biologically and mathematically comparable GEO Samples. Structureimported structures (PDB) Cn3D viewer, NCBI curation CDDconserved domain database protein families (COGs and KOGs) single domains (PFAM, SMART, CD)
NCBI’s SNP Database • Primary and derivative (RefSNP) • Single nucleotide polymorphisms • Repeat polymorphisms • Insertion-deletion polymorphisms • 24 Species • Over 11 million refSNPs (rsXXXXXXX)
RefSNP • Non-redundant • Computational Analysis • BLAST hits to genome, mRNA, protein
Using Entrez An integrated database search and retrieval system
Entrez: Database Integration Word weight PubMed abstracts 3-D Structure 3 -D Structure Taxonomy VAST Genomes Phylogeny Protein sequences Nucleotide sequences BLAST BLAST
[Title] Entrez Nucleotide: HFE 218 records Not HFE
39 records Smarter Query hfe[title]ANDhuman[orgn] Curated HFE splice variants (11 total)
Primary data hfe[title]ANDhuman[orgn] (con’t)
Finding Primary Sequences • Entrez Nucleotide 99+%GenBank (primary data) • srcdb ddbj/embl/genbank[properties] = 39,849,856records <1%RefSeq (curated data) • srcdb refseq[properties] = 304,945records • Useful search terms in [Properties]: • srcdb : source database(e.g., srcdb genbank[prop]) • gbdiv : GenBank division(e.g., gbdiv est[prop]) • biomol : biomolecule type(e.g., biomol mrna[prop])
Database Queries #1hfe 116 #2hfe[title]ANDhuman[orgn]42 #3 #2ANDsrcdb refseq[prop]11 #4 #2ANDsrcdb ddbj/embl/genbank[prop] 31 #5 #2ANDgbdiv pri[prop] 29 #4 #2ANDgbdiv est[prop] 2 Primate division gbdiv pri[prop] EST division gbdiv est[prop]
Molecule Queries #1hfe 116 #2hfe[title]ANDhuman[orgn]42 #3 #2ANDbiomol mrna[prop]29 #4 #2ANDbiomol genomic[prop] 13 Genomic DNA biomol genomic[prop] cDNA biomol mrna[prop]
More Queries… Gene symbol:human hemochromatosis (HFE) hfe[sym] ANDhuman[organism] Protein name:topoisomerase genes from Archaea topoisomerase[gene/protein name]ANDarchaea[organism] Chromosome, Links:genes on human chromosome 2 with OMIM links 2[chromosome] ANDgene omim[filter] ANDhuman[organism] RefSeq status, variants:reviewed RefSeqs with transcript variants srcdb refseq reviewed[prop]ANDhas transcript variants[prop] Disease and Gene Ontology:membrane proteins linked to cancer integral to plasma membrane[gene ontology]ANDcancer[dis]
Other Entrez Databases UniGene:rat clusters that have at least one mRNA rat[organism] NOT0[mrna count] SNP:uniquely mapped microsatellites on human chr2 microsat[SNP Class] AND 1[Map Weight] AND 2[Chromosome]) AND human[orgn] UniSTS:markers on the Genethon map of human chromosome 12 Genethon[Map Name] ANDhuman[organism] AND12[chromosome] Structure:structures of bacterial kinases with resolutions below 2 Å bacteria[organism]ANDkinaseAND000.00:002.00[resolution]
Related Sequences Most similar Least similar