580 likes | 913 Views
NCBI Molecular Biology Resources. A Field Guide part 1. September 29, 2004 ICGEB. NCBI Resources. About NCBI The NCBI Entrez System NCBI Sequence Databases NCBI Genomic Resources ** Intermission **
E N D
NCBI Molecular Biology Resources A Field Guide part 1 September 29, 2004 ICGEB
NCBI Resources • About NCBI • The NCBI Entrez System • NCBI Sequence Databases • NCBI Genomic Resources ** Intermission ** • NCBI Precomputed Resources • Behind the scenes
Bethesda, MD The National Institutes of Health
The National Center for Biotechnology Information • Created as a part of NLM in 1988 • Establish public databases • Perform research in computational biology • Develop software tools for sequence analysis • Disseminate biomedical information
Christmas & New Year’sDays Number of Users and Hits Per Day 1997 1998 1999 2000 2001 2002 2003 Currently averaging 10,000,000 to 35,000,000 hits per day!
Part 2. Data Flow and Processing Part 1. The Databases Part 3. Querying and Linking the Data Part 4. User Support A part of the NCBI Bookshelf
OMIM - A catalogue of genes involved with human disease processes - Detailed clinical and reference information - Curated and maintained by Johns Hopkins - Links to PubMed and sequence databases
The Entrez System Gene UniGene CancerChromosomes UniSTS Homologene SNP PopSet Genome Nucleotide GEO Books Entrez Taxonomy PubMed GEO Datasets MeSH OMIM Protein PMC Journals Domains 3D Domains Structure
Types of Databases • Primary Databases • Original submissions by experimentalists • Database staff review and may organize the data, but we don’t add/modify additional information • Records are “owned” and updated by their authors • Examples: GenBank, SNP, GEO • Derivative Databases • Human-curated (compilation and correction of data) • Examples: Gene(LocusLink), Structure & Literature databases • Computationally-Derived • Example: UniGene • Combination • Examples: RefSeq, Genome Assembly, Domain databases
ACGTGC C C GA GA ATT GA GA C ATT TATAGCCG AGCTCCGATA CCGATGACAA RefSeq C TATAGCCG ACGTGC Curators CGTGA ATTGACTA TTGACA Genome Assembly TTGACA TTGACA ACGTGC ACGTGC TATAGCCG CGTGA CGTGA TATAGCCG ATTGACTA TATAGCCG ATTGACTA ATTGACTA CGTGA ATTGACTA ATTGACTA ATT TATAGCCG TATAGCCG TATAGCCG TATAGCCG TATAGCCG TTGACA C GenBank UniGene GA AT C C C C ATT GA GA GA GA ATT ATT ATT Algorithms GA GA GA GA C C ATT ATT C C Primary vs. DerivativeSequence Databases Labs Sequencing Centers Updated continually by NCBI Updated ONLY by submitters
Examples of tag delimiters How to Query a Particular Database term1 term2 (term1[tag delimiter]op term2[tag delimiter]op …) op = AND, OR, NOT • Boolean operators MUST be in ALL CAPS! tag delimiter= Entrez indexing field Organism Journal User compounds Author
Sample Query Brauninger a c-src kinase Organism Journal User compounds Author
Using Fields to Find Records Accession All Fields Author EC/RN Number Feature Key Filter Gene Name Issue Journal Keyword Modification Date Organism Page Number Primary Accession Properties Protein Name Publication Date SeqID String Sequence Length Substance Name Text Word Title Volume • Most useful search field [Organism]: • human[orgn] …or… bacteria[orgn] • Useful search terms in [Properties] field: • srcdb: “source database” ( srcdb genbank[prop] ) • gbdiv: “genbank division” ( gbdiv est[prop] ) • biomol: “biomolecular type” ( biomol mrna[prop] )
Using Field Limits #1: thyroid peroxidase 335 #2: thyroid peroxidase AND human[orgn] 291 #3: thyroid peroxidase[title]AND human[orgn] 166 #4: #3 AND srcdb refseq[prop] 5 #5: #3 AND srcdb ddbj/embl/genbank[prop] 161 #6: #5 AND gbdiv est[prop] 20 #7: #5 AND gbdiv pri[prop] 141 #8: #7 AND biomol genomic[prop] 25 #9: #7 AND biomol mrna[prop] 116
Complex searches you can do with Preview/Index Terms used (and indexed) in Entrez fields can be searched to gain useful information! How many rat Unigene clusters contain at least one mRNA? • Select the UniGene database. • Find all the rat records. • Find those that have ≥ 1 mRNAs. (“not 0”) NOT rat [organism]
Complex Queries with Preview/Index NOT 0 [mRNA Count]
1º Sequence Database GenBank • Nucleotide only sequence database • Archival in nature • Submission of GenBank Data to NCBI • Direct submissions of individual records via Web (BankIt, Sequin) • Batch submissions of bulk sequences via Email (EST, GSS, STS) • FTP accounts for Sequencing Centers
Sequence records • Total base pairs 35 40 35 30 30 25 25 20 20 15 15 10 10 5 5 0 0 GenBank Release 143: 37.3 million records 41.8 billion nucleotides Average doubling time ≈ 14 months Sequence Records (millions) Total Base Pairs (billions) ’83 ’84 ’85 ’86 ’87 ’88 ’89 ’90 ’91 ’92 ’93 ’94 ’95 ’96 ’97 ’98 ’99 ’00 ’01 ’02 ’03 ’04
Release 143 August 2004 37,343,937 Records 41,808,045,653 Nucleotides >170,000 Species 160 Gigabytes 657 files GenBank • full release every two months • incremental and cumulative updates daily • available only through internet ftp://ftp.ncbi.nih.gov/genbank/
The International Sequence Database Collaboration NIH Entrez Sequin BankIt ftp NCBI GenBank • Submissions • Updates • Submissions • Updates EMBL DDBJ EBI CIB NIG • Submissions • Updates SRS EMBL getentry
Organization of GenBank:GenBank Divisions (gbdiv) Records are divided into 17 Divisions. • 1 Patent (11 files) • 5 High Throughput • 11 Traditional EST (335) Expressed Sequence Tag GSS (116) Genome Survey Sequence HTG (61) High Throughput Genomic STS (5) Sequence Tagged Site HTC (6) High Throughput cDNA PRI (28) Primate PLN (12) Plant and Fungal BCT (10) Bacterial and Archeal INV (6) Invertebrate ROD (13) Rodent VRL (3) Viral VRT (7) Other Vertebrate MAM (1) Mammalian (ex. ROD and PRI) PHG (1) Phage SYN (1) Synthetic (cloning vectors) UNA (1) Unannotated • Traditional Divisions: • Direct Submissions • (Sequin and BankIt) • Accurate • Well characterized • BULK Divisions: • Batch Submission • (Email and FTP) • Inaccurate • Poorly characterized
File Formats of theSequence Databases Each sequence is represented by a text record called a flat file. • GenBank/GenPept (useful for scientists) • FASTA (the simplest format) • ASN.1 & XML (useful for programmers)
Accession Number ACCESSION AF062069 VERSION AF062069.2 GI:7144484 Length mRNA = cDNA DNA = genomic Date of most recent modification Division ORGANISM Limulus polyphemus Eukaryota;Metazoa;Arthropoda;Chelicerata;Merostomata; Xiphosura;Limulidae;Limulus. Accession.Version GI Number LOCUS AF0620069 3808 bp mRNA INV 02-MAR-2000 DEFINITION Limulus polyphemus myosin III mRNA, complete cds. A Traditional “GenBank” Record LOCUS AF062069 3808 bp mRNA INV 02-MAR-2000 DEFINITION Limulus polyphemus myosin III mRNA, complete cds. ACCESSION AF062069 VERSION AF062069.2 GI:7144484 KEYWORDS . SOURCE Atlantic horseshoe crab. ORGANISM Limulus polyphemus Eukaryota; Metazoa; Arthropoda; Chelicerata; Merostomata; Xiphosura; Limulidae; Limulus. REFERENCE 1 (bases 1 to 3808) AUTHORS Battelle,B.-A., Andrews,A.W., Calman,B.G., Sellers,J.R., Greenberg,R.M. and Smith,W.C. TITLE A myosin III from Limulus eyes is a clock-regulated phosphoprotein JOURNAL J. Neurosci. (1998) In press REFERENCE 2 (bases 1 to 3808) AUTHORS Battelle,B.-A., Andrews,A.W., Calman,B.G., Sellers,J.R., Greenberg,R.M. and Smith,W.C. TITLE Direct Submission JOURNAL Submitted (29-APR-1998) Whitney Laboratory, University of Florida, 9505 Ocean Shore Blvd., St. Augustine, FL 32086, USA REFERENCE 3 (bases 1 to 3808) AUTHORS Battelle,B.-A., Andrews,A.W., Calman,B.G., Sellers,J.R., Greenberg,R.M. and Smith,W.C. TITLE Direct Submission JOURNAL Submitted (02-MAR-2000) Whitney Laboratory, University of Florida, 9505 Ocean Shore Blvd., St. Augustine, FL 32086, USA REMARK Sequence update by submitter COMMENT On Mar 2, 2000 this sequence version replaced gi:3132700. Definition =Title References NCBI’s Taxonomy
/protein_id="AAC16332.2" /db_xref="GI:7144485" Lower down in the GenBank Record FEATURES Location/Qualifiers source 1..3808 /organism="Limulus polyphemus" /db_xref="taxon:6850" /tissue_type="lateral eye" CDS 258..3302 /note="N-terminal protein kinase domain; C-terminal myosin heavy chain head; substrate for PKA" /codon_start=1 /product="myosin III" /protein_id="AAC16332.2" /db_xref="GI:7144485" /translation="MEYKCISEHLPFETLPDPGDRFEVQELVGTGTYATVYSAIDKQA NKKVALKIIGHIAENLLDIETEYRIYKAVNGIQFFPEFRGAFFKRGERESDNEVWLGI EFLEEGTAADLLATHRRFGIHLKEDLIALIIKEVVRAVQYLHENSIIHRDIRAANIMF SKEGYVKLIDFGLSASVKNTNGKAQSSVGSPYWMAPEVISCDCLQEPYNYTCDVWSIG ITAIELADTVPSLSDIHALRAMFRINRNPPPSVKRETRWSETLKDFISECLVKNPEYR PCIQEIPQHPFLAQVEGKEDQLRSELVDILKKNPGEKLRNKPYNVTFKNGHLKTISGQ BASE COUNT 201 a 689 c 782 g 1136 t ORIGIN 1 tcgacatctg tggtcgcttt ttttagtaat aaaaaattgt attatgacgt cctatctgtt 3781 aagatacagt aactagggaa aaaaaaaa // Feature Table GenPept Protein ID
FASTA format >gi|4680720|gb|M17755.2|HUMTPOC Homo sapiens thyroid peroxidase (TPO) mRNA, complete cds GAGGCAATTGAGGCGCCCATTTCAGAAGAGTTACAGCCGTGAAAATTACTCAGCAGTGCAGTTGGCTGAG AAGAGGAAAAAAGAATGAGAGCGCTGGCTGTGCTGTCTGTCACGCTGGTTATGGCCTGCACAGAAGCCTT CTTCCCCTTCATCTCGAGAGGGAAAGAACTCCTTTGGGGAAAGCCTGAGGAGTCTCGTGTCTCTAGCGTC TTGGAGGAAAGCAAGCGCCTGGTGGACACCGCCATGTACGCCACGATGCAGAGAAACCTCAAGAAAAGAG GAATCCTTTCTGGAGCTCAGCTTCTGTCTTTTTCCAAACTTCCTGAGCCAACAAGCGGAGTGATTGCCCG AGCAGCAGAGATAATGGAAACATCAATACAAGCGATGAAAAGAAAAGTCAACCTGAAAACTCAACAATCA CAGCATCCAACGGATGCTTTATCAGAAGATCTGCTGAGCATCATTGCAAACATGTCTGGATGTCTCCCTT ACATGCTGCCCCCAAAATGCCCAAACACTTGCCTGGCGAACAAATACAGGCCCATCACAGGAGCTTGCAA CAACAGAGACCACCCCAGATGGGGCGCCTCCAACACGGCCCTGGCACGATGGCTCCCTCCAGTCTATGAG GACGGCTTCAGTCAGCCCCGAGGCTGGAACCCCGGCTTCTTGTACAACGGGTTCCCACTGCCCCCGGTCC GGGAGGTGACAAGACATGTCATTCAAGTTTCAAATGAGGTTGTCACAGATGATGACCGCTATTCTGACCT CCTGATGGCATGGGGACAATACATCGACCACGACATCGCGTTCACACCACAGAGCACCAGCAAAGCTGCC ... >gi|4680721|gb|AAA61217.2| thyroid peroxidase [Homo sapiens] MRALAVLSVTLVMACTEAFFPFISRGKELLWGKPEESRVSSVLEESKRLVDTAMYATMQRNLKKRGILSG AQLLSFSKLPEPTSGVIARAAEIMETSIQAMKRKVNLKTQQSQHPTDALSEDLLSIIANMSGCLPYMLPP KCPNTCLANKYRPITGACNNRDHPRWGASNTALARWLPPVYEDGFSQPRGWNPGFLYNGFPLPPVREVTR HVIQVSNEVVTDDDRYSDLLMAWGQYIDHDIAFTPQSTSKAAFGGGSDCQMTCENQNPCFPIQLPEEARP AAGTACLPFYRSSAACGTGDQGALFGNLSTANPRQQMNGLTSFLDASTVYGSSPALERQLRNWTSAEGLL RVHGRLRDSGRAYLPFVPPRAPAACAPEPGNPGETRGPCFLAGDGRASEVPSLTALHTLWLREHNRLAAA LKALNAHWSADAVYQEARKVVGALHQIITLRDYIPRILGPEAFQQYVGPYEGYDSTANPTVSNVFSTAAF RFGHATIHPLVRRLDASFQEHPDLPGLWLHQAFFSPWTLLRGGGLDPLIRGLLARPAKLQVQDQLMNEEL TERLFVLSNSSTLDLASINLQRGRDHGLPGYNEWREFCGLPRLETPADLSTAIASRSVADKILDLYKHPD NIDVWLGGLAENFLPRARTGPLFACLIGKQMKALRDGDWFWWENSHVFTDAQRRELEKHSLSRVICDNTG LTRVPMDAFQVGKFPEDFESCDSITGMNLEAWRETFPQDDKCGFPESVENGDFVHCEESGRRVLVYSCRH GYELQGREQLTCTQEGWDFQPPLCKDVNECADGAHPPCHASARCRNTKGGFQCLCADPYELGDDGRTCVD ...
GenPept GenBank ASN.1 FASTA Protein FASTA Nucleotide Abstract Syntax Notation: ASN.1 Seq-entry ::= set { level 1 , class nuc-prot , descr { title "Human thyroid peroxidase mRNA, partial cds., and translated products" , source { org { taxname "Homo sapiens" , common "human" , db { { db "taxon" , tag id 9606 } } , orgname { name binomial { genus "Homo" , species "sapiens" } , lineage "Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi; Mammalia; Eutheria; Primates; Catarrhini; Hominidae; Homo" ,
Bulk Divisions • Batch Submission and htg (email and ftp) • Inaccurate • Poorly Characterized • Expressed Sequence Tag • 1st pass single read cDNA • Genome Survey Sequence • 1st pass single read gDNA • High Throughput Genomic • incomplete sequences of genomic clones • Sequence Tagged Site • PCR-based mapping reagents
5’ 3’ make cDNA library 80-100,000 unique cDNA clones in library EST Division: Expressed Sequence Tags gbdiv_est[Properties] nucleus 30,000 genes gatccantgccatacg >IMAGE:275615 5' mRNA sequence GACAGCATTCGGGCCGAGATGTCTCGCTCCGTGGCCTTAGCTGTGCTCGCGCTACTCTCTCTTTCTGG TGGAGGTATCCAGCGTACTCCAAAGATTCAGGTTTACTCACGTCATCCAGCAGAGAATGGAAAGTCAA TTCCTGAATTGCTATGTGTCTGGGTTTCATCCATCCGACATTGAAGTTGACTTACTGAAGAATGGAGA GAATTGAAAAAGTGGAGCATTCAGACTTGTCTTTCAGCAAGGACTGGTCTTTCTATCTCTTGTACTAC TGAATTCACCCCCACTGAAAAAGATGAGTATGCCTGCCGTGTTGAACCATGTNGACTTTGTCACAGNC AAGTTNAGTTTAAGTGGGNATCGAGACATGTAAGGCAGGCATCATGGGAGGTTTTGAAGNATGCCGCN TTGGATTGGGATGAATTCCAAATTTCTGGTTTGCTTGNTTTTTTAATATTGGATATGCTTTTG ctcgccaattcnntcg • - isolate unique clones • sequence once • from each end RNA gene products >IMAGE:275615 3', mRNA sequence NNTCAAGTTTTATGATTTATTTAACTTGTGGAACAAAAATAAACCAGATTAACCACAACCATGCCTTA TTATCAAATGTATAAGANGTAAATATGAATCTTATATGACAAAATGTTTCATTCATTATAACAAATTT AATAATCCTGTCAATNATATTTCTAAATTTTCCCCCAAATTCTAAGCAGAGTATGTAAATTGGAAGTT CTTATGCACGCTTAACTATCTTAACAAGCTTTGAGTGCAAGAGATTGANGAGTTCAAATCTGACCAAG GTTGATGTTGGATAAGAGAATTCTCTGCTCCCCACCTCTANGTTGCCAGCCCTC
Genome Sequencing - HTG, GSS,(WGS) Whole BAC insert (or genome) shredding sequencing cloning isolating GSS division or trace archive whole genome shotgun assemblies (traditional division) assembly Draft Sequence (HTG division)
HTG Division: Honeybee Draft Sequences • Unfinished sequences of BACs • Gaps and unordered pieces • Finished sequences move to traditional GenBank division
Other Primary Databases • GEO (Gene Expression Omnibus) • Searchable microarray data repository • SNP (Single Nucleotide Polymorphism) • Allelic variations (including minisatellites/ simple sequence repeats and insertions/ deletions)
Redesigned with new features • Submit and update data • Query the database: • gene identifiers • field information • sequence • Browse datasets • Download data
Submitted by Experimentalists Curated by NCBI Submitted by Manufacturer* GDS Grouping of experiments GSE Grouping of slide/chip data “a single experiment” GPL Platform descriptions GSM Raw/processed spot intensities from a single slide/chip Entrez GEO Datasets Entrez GEO
FHCRC non-commercial human 18K array Comparison of gene expression profiles of HFF cells infected with CMV strains GDS177: CMV infection of HFF cells src1: CMV infected fibroblasts src2: uninfected fibroblasts GSM827 : FHCMV-T-1GSM825 : FHCMV-T-2GSM828 : FHCMV-T-3 GSM829 : FHCMV-H-1GSM830 : FHCMV-H-2GSM831 : FHCMV-H-3 GSM832 : CMV_AD169-2GSM833 : CMV_AD169-3 Expression