NCBI Molecular Biology Resources

NCBI Molecular Biology Resources A Field Guide August 2-3, 2005 University of Massachusetts

NCBI Resources • The NCBI Entrez System • NCBI Sequence Databases • Primary data: GenBank • Derivative data: RefSeq, Gene, Genome • Beyond Refseq: UniGene, Trace Archive • NCBI Genomic Resources ** Intermission ** • BLAST • Protein Structure and Function • Sequence polymorphisms and phenotypes

Bethesda, MD The National Institutes of Health

The National Center for Biotechnology Information • Created as a part of NLM in 1988 • Establish public databases • Perform research in computational biology • Develop software tools for sequence analysis • Disseminate biomedical information

Web Access Text Entrez Sequence BLAST Structure VAST

Christmas and New Year’s Day NCBI Web Traffic User’s per day

The NCBI ftp site 30,000 files per day 620 Gigabytes per day

What does NCBI do? • NCBI accepts submissions of primary data • NCBI develops tools to analyze these data • NCBI uses these tools to create derivative databases based on the primary data • NCBI provides free search, link, and retreival of these data, primarily through the Entrez system

Types of Databases • Primary Databases • Original submissions by experimentalists • Content controlled by the submitter • Examples: GenBank, SNP, GEO, PubChem Substance • Derivative Databases • Built from primary data • Content controlled by third party (NCBI) • Examples: Refseq, TPA, RefSNP, UniGene, Protein, Structure, Conserved Domain, PubChem Compound

Primary vs. Derivative Databases C C GA ATT GA UniGene GA C ATT GA Algorithms C TATAGCCG Sequencing Centers ACGTGC TTGACA ATTGACTA ACGTGC CGTGA UniSTS EST GenBank Updated continually by NCBI STS Updated ONLY by submitters RefSeq: Annotation Pipeline GSS HTG INV VRT PHG VRL PRI ROD PLN MAM BCT ACGTGC RefSeq: LocusLink and Genomes Pipelines Curators TATAGCCG AGCTCCGATA CCGATGACAA Labs

What is Entrez? • A system of 29 linked databases • A text search engine • A tool for finding biologically linked data • A retrieval engine • A virtual workspace for manipulating large datasets

The Entrez System: Text Searches

Entrez Databases • Each record is assigned a UID • unique integer identifier for internal tracking • GI number for Nucleotide • Each record is given a Document Summary • a summary of the record’s content (DocSum) • Each record is assigned links to biologically related UIDs • Each record is indexed by data fields • [author], [title], [organism], and many others

Entrez Taxonomy The backbone of NCBI [organism]

An Entrez Database - Nucleotide • GenBank: Primary Data (97.9%) • original submissions by experimentalists • submitters retain editorial control of records • archival in nature • RefSeq: Derivative Data (2.1%) • curated by NCBI staff • NCBI retains editorial control of records • record content is updated continually

Entrez Nucleotide Primary Data • DDBJ / EMBL / GenBank 56,865,268 Derivative Data • RefSeq 1,226,084 • PDB 5,973 • Third Party Annotation 4,650 Total 58,101,975

What is GenBank?NCBI’s Primary Sequence Database • Nucleotide only sequence database • Archival in nature • Each record is assigned a stable accession number • GenBank Data • Direct submissions (traditional records ) • Batch submissions (EST, GSS, STS) • ftp accounts (genome data) • Three collaborating databases • GenBank • DNA Database of Japan (DDBJ) • European Molecular Biology Laboratory (EMBL) Database

The International Sequence Database Collaboration NIH Entrez Sequin BankIt ftp NCBI GenBank • Submissions • Updates • Submissions • Updates EMBL DDBJ EBI CIB NIG • Submissions • Updates SRS EMBL getentry

GenBank Releases Release 148 June 2005 45,236,251 Records 49,398,852,122 Nucleotides >140,000 Species 172 Gigabytes 785 files • full release every two months • incremental and cumulative updates daily • available only through internet ftp://ftp.ncbi.nih.gov/genbank/

The Growth of GenBank Release 148: 45.2 million records 49.4 billion nucleotides Average doubling time ≈ 14 months*

GenBank Divisions PRI (28) Primate ROD (14) Rodent PLN (13) Plant and Fungal BCT (10)Bacterial/Archeal INV (7) Invertebrate VRT (7)Other Vertebrate VRL (4) Viral MAM (2) Mammalian PHG (1) Phage SYN (1) Synthetic UNA (1)Unannotated Traditional • Direct Submissions (Sequin/Bankit) • Accurate (~1 error per 10,000 bp) • Well characterized • Organized by taxonomy Bulk • From sequencing projects • Batch submissions (ftp/email) • Inaccurate • Poorly Characterized • Organized by sequence type EST (349)Expressed Sequence Tag GSS (120) Genome Survey Sequence HTG (62) High Throughput Genomic HTC (6)High Throughput cDNA STS (5) Sequence Tagged Site

Header Feature Table Sequence A Traditional GenBank Record LOCUS AY182241 1931 bp mRNA linear PLN 04-MAY-2004 DEFINITION Malus x domestica (E,E)-alpha-farnesene synthase (AFS1) mRNA, complete cds. ACCESSION AY182241 VERSION AY182241.2 GI:32265057 KEYWORDS . SOURCE Malus x domestica (cultivated apple) ORGANISM Malus x domestica Eukaryota; Viridiplantae; Streptophyta; Embryophyta; Tracheophyta; Spermatophyta; Magnoliophyta; eudicotyledons; core eudicots; rosids; eurosids I; Rosales; Rosaceae; Maloideae; Malus. REFERENCE 1 (bases 1 to 1931) AUTHORS Pechous,S.W. and Whitaker,B.D. TITLE Cloning and functional expression of an (E,E)-alpha-farnesene synthase cDNA from peel tissue of apple fruit JOURNAL Planta 219, 84-94 (2004) REFERENCE 2 (bases 1 to 1931) AUTHORS Pechous,S.W. and Whitaker,B.D. TITLE Direct Submission JOURNAL Submitted (18-NOV-2002) PSI-Produce Quality and Safety Lab, USDA-ARS, 10300 Baltimore Ave. Bldg. 002, Rm. 205, Beltsville, MD 20705, USA REFERENCE 3 (bases 1 to 1931) AUTHORS Pechous,S.W. and Whitaker,B.D. TITLE Direct Submission JOURNAL Submitted (25-JUN-2003) PSI-Produce Quality and Safety Lab, USDA-ARS, 10300 Baltimore Ave. Bldg. 002, Rm. 205, Beltsville, MD 20705, USA REMARK Sequence update by submitter COMMENT On Jun 26, 2003 this sequence version replaced gi:27804758. FEATURES Location/Qualifiers source 1..1931 /organism="Malus x domestica" /mol_type="mRNA" /cultivar="'Law Rome'" /db_xref="taxon:3750" /tissue_type="peel" gene 1..1931 /gene="AFS1" CDS 54..1784 /gene="AFS1" /note="terpene synthase" /codon_start=1 /product="(E,E)-alpha-farnesene synthase" /protein_id="AAO22848.2" /db_xref="GI:32265058" /translation="MEFRVHLQADNEQKIFQNQMKPEPEASYLINQRRSANYKPNIWK NDFLDQSLISKYDGDEYRKLSEKLIEEVKIYISAETMDLVAKLELIDSVRKLGLANLF EKEIKEALDSIAAIESDNLGTRDDLYGTALHFKILRQHGYKVSQDIFGRFMDEKGTLE NHHFAHLKGMLELFEASNLGFEGEDILDEAKASLTLALRDSGHICYPDSNLSRDVVHS LELPSHRRVQWFDVKWQINAYEKDICRVNATLLELAKLNFNVVQAQLQKNLREASRWW ANLGIADNLKFARDRLVECFACAVGVAFEPEHSSFRICLTKVINLVLIIDDVYDIYGS EEELKHFTNAVDRWDSRETEQLPECMKMCFQVLYNTTCEIAREIEEENGWNQVLPQLT KVWADFCKALLVEAEWYNKSHIPTLEEYLRNGCISSSVSVLLVHSFFSITHEGTKEMA DFLHKNEDLLYNISLIVRLNNDLGTSAAEQERGDSPSSIVCYMREVNASEETARKNIK GMIDNAWKKVNGKCFTTNQVPFLSSFMNNATNMARVAHSLYKDGDGFGDQEKGPRTHI LSLLFQPLVN" ORIGIN 1 ttcttgtatc ccaaacatct cgagcttctt gtacaccaaa ttaggtattc actatggaat 61 tcagagttca cttgcaagct gataatgagc agaaaatttt tcaaaaccag atgaaacccg 121 aacctgaagc ctcttacttg attaatcaaa gacggtctgc aaattacaag ccaaatattt 181 ggaagaacga tttcctagat caatctctta tcagcaaata cgatggagat gagtatcgga 241 agctgtctga gaagttaata gaagaagtta agatttatat atctgctgaa acaatggatt 1801 aataaatagc agcaaaagtt tgcggttcag ttcgtcatgg ataaattaat ctttacagtt 1861 tgtaacgttg ttgccaaaga ttatgaataa aaagttgtag tttgtcgttt aaaaaaaaaa 1921 aaaaaaaaaa a // The Flatfile Format

An Example Record – M17755 Indexing for Nucleotide UID 4680720 Field Indexed Terms [primary accession] M17755 [title] Homo sapiens thyroid peroxidase (TPO) mRNA… [organism] Homo sapiens [sequence length] 3060 [modification date] 1999/04/26 [properties] biomol mrna gbdiv pri srcdb genbank

M17755: Feature Table TPO [gene name] CDS position in bp thyroiditis [text word] thyroid peroxidase [protein name] protein accession

Sequence: 99.99% Accurate The sequence itself is not indexed… Use BLAST for that!

Entrez Protein • GenPept (DDBJ, EMBL, GenBank)4,444,405 • RefSeq 1,753,167 • PIR 222,395 • Swiss Prot 189,005 • PDB 68,621 • PRF 12,079 • Third Party Annotation 4,219 Total 6,693,891

Protein Sources and Links PIR no mRNA! RefSeq  NM_000537 SWISS-PROT no mRNA! GenPept  M17755

Sequence Revisions First seen at NCBI, not first seen at GenBank! Version and GI change only if the sequence changes The accession number always retrieves the most recent version

Update without a Sequence Change June 15, 1989! GenBank came to NCBI in 1992!

Update with a Sequence Change

GenBank File Formats ASN.1 – The Raw Data flat file XML (4 flavors) FASTA

Toolbox Sources ftp> open ftp.ncbi.nih.gov . . ftp> cd toolbox ftp> cd ncbi_tools ftp://ftp.ncbi.nlm.gov/toolbox/ncbi_tools NCBI Toolbox /************************************************************************ * * asn2ff.c * convert an ASN.1 entry to flat file format, using the FFPrintArray. * **************************************************************************/ #include <accentr.h> #include "asn2ff.h" #include "asn2ffp.h" #include "ffprint.h" #include <subutil.h> #include <objall.h> #include <objcode.h> #include <lsqfetch.h> #include <explore.h> #ifdef ENABLE_ID1 #include <accid1.h> #endif FILE *fpl; Args myargs[] = { {"Filename for asn.1 input","stdin",NULL,NULL,TRUE,'a',ARG_FILE_IN,0.0,0,NULL}, {"Input is a Seq-entry","F", NULL ,NULL ,TRUE,'e',ARG_BOOLEAN,0.0,0,NULL}, {"Input asnfile in binary mode","F",NULL,NULL,TRUE,'b',ARG_BOOLEAN,0.0,0,NULL}, {"Output Filename","stdout", NULL,NULL,TRUE,'o',ARG_FILE_OUT,0.0,0,NULL}, {"Show Sequence?","T", NULL ,NULL ,TRUE,'h',ARG_BOOLEAN,0.0,0,NULL},

Text Searches in Entrez term1 term2 If no [limit] is specified… Organism?  [ organism ] Journal?  [ journal ] User compounds?  search as phrase Author?  [author] else [All Fields] term1[limit]OPterm2[limit]OP … where limit =Entrez indexing field (organism, author, …) op = AND, OR, NOT

Entrez Tabs Provides a simple form for applying commonly used Entrez limits Limits Allows access to the full indexing of each Entrez database and aids in constructing complex queries Preview/Index Provides access to previous searches in the current Entrez database History Clipboard A temporary storage area for selected records Details Displays the detailed parsing of the current Entrez query, and lists errors and terms without matches

Programming Entrez: E-Utilities http://www.ncbi.nih.gov/entrez/query/static/eutils_help.html ESearch Entrez query UID list or History ESummary UID list or History Document summaries EFetch Formatted data UID list or History ELink UID list or History UID list or History EPost History UID list

Finding Primary Sequences • Search Entrez Nucleotide • 97.9% GenBank (primary data) • 2.1% RefSeq (curated data) Possible queries we’ve seen so far… M17755 [primary accession] TPO [gene name] thyroid peroxidase [title] thyroiditis [text word] Homo sapiens [organism] thyroid peroxidase [protein name] 3060 [sequence length] 1999/04/26 [modification date] biomol mrna [properties] gbdiv pri [properties] srcdb genbank [properties]

A Starting Query Find nucleotide records for human thyroid peroxidase 309 records human thyroid peroxidase (("Homo sapiens“[Organism] OR human[All Fields]) AND thyroid peroxidase[All Fields]) Field Limit! 298 records human[organism] AND thyroid peroxidase ("Homo sapiens“[Organism] AND thyroid peroxidase[All Fields]) 11 records aren’t human sequences!!

Limit by Title and Database • Entrez Nucleotide • GenBank srcdbddbj/embl/genbank[properties] • RefSeq srcdbrefseq[properties] #1: thyroid peroxidase AND human[orgn] 298 #2: thyroid peroxidase[title]AND human[orgn] 169 #3: #2 AND srcdbrefseq[properties] 5 #4: #2 AND srcdbddbj/embl/genbank[properties] 164 primary data

Limit by Genbank Division EST Division gbdiv est[prop] Primate Division gbdiv pri[prop] #1: thyroid peroxidase AND human[orgn] 298 #2: thyroid peroxidase[title] AND human[orgn] 169 #3: #2 AND srcdb refseq[properties] 5 #4: #2 AND srcdb ddbj/embl/genbank[properties] 164 #5: #4 AND gbdiv est[prop] 20 #6: #4 AND gbdiv pri[prop] 144 traditional GenBank records

Limit by Biomolecule Type Genomic DNA biomol genomic[prop] cDNA biomol mrna[prop] #1: thyroid peroxidase AND human[orgn] 298 #2: thyroid peroxidase[title] AND human[orgn] 169 #3: #2 AND srcdb refseq[properties] 5 #4: #2 AND srcdb ddbj/embl/genbank[properties] 164 #5: #2 AND gbdiv est[prop] 20 #6: #2 AND gbdiv pri[prop] 144 #7: #6 AND biomol genomic[prop] 26 #8: #6 AND biomol mrna[prop] 118 genomic DNA mRNA / cDNA

Limit by Protein Name thyroid peroxidase[protein name]AND human[orgn]AND gbdiv pri[prop]AND biomol mrna[prop] 118 records [title]  4 records [protein name]

Entrez Document Summaries Links menu Click the accession to view the record Links to other Entrez databases computed for M17755

Entrez Links for GI 4680720 Gene annotation based on M17755 Full text online articles about M17755 All polymorphisms in the TPO gene DNA/RNA sequences similar to M17755 Graphical view of TPO gene annotation Human phenotypes involving TPO Microarray datasets for M17755 Protein translation of M17755 Literature abstracts about M17755 Sequence polymorphisms in M17755 Source organism of M17755 STS markers in the TPO gene TPO links beyond NCBI

Viewing M17755

GenBank Sequences for Human TPO Which one is the best sequence???

RefSeq: NCBI’s Derivative Sequence Database RefSeq Benefits • Non-redundant • Explicitly linked nucleotide and protein sequences • Updated to reflect current sequence data and biology • Validated by hand • Format consistency • Distinct accession series • Stewardship by NCBI staff and collaborators ftp://ftp.ncbi.nih.gov/refseq/release

RefSeq: NCBI’s Derivative Sequence Database • Curated transcripts and proteins • NM_123456  NP_123456 • NR_123456 (non-coding RNA) • Model transcripts and proteins • XM_123456  XP_123456 • XR_123456 (non-coding RNA) • Assembled Genomic Regions (contigs) • NT_123456 (BAC clones) • NW_123456 (WGS) • Other Genomic Sequence • NG_123456 (complex regions, pseudogenes) • NZ_ABCD12345678 (WGS)  ZP_123456 • Chromosome records in Entrez Genome • NC_123456 (chromosome; microbial or organelle genome) Nucleotide Protein

Creating NM Records Genome annotation Longest mRNA NMs must have cDNA support

NM/NP Records in Entrez NM_000547: variant 1 COMMENT REVIEWED REFSEQ: This record has been curated by NCBI staff. The reference sequence was derived from M17755.2 and AW874082.1. On Feb 25, 2003 this sequence version replaced gi:21361188. EST that completes 3’ end NM_175719: variant 2 COMMENT REVIEWED REFSEQ: This record has been curated by NCBI staff. The reference sequence was derived from J02970.1, AW874082.1 and M17755.2. Nucleotide Protein

Annotating the Gene Genomic DNA (NC, NT, NW) Scanning.... Model mRNA(XM) (XR) Model protein (XP) = ? = ! Curated mRNA(NM) (NR) Curated Protein(NP) RefSeq Genbank Sequences

NCBI Molecular Biology Resources