Molecular Biology Databases

Molecular Biology Databases NCBI, DDBL, EMBL and others

What is a Database? • A database can be defined as "a collection of data arranged for ease and speed of search and retrieval.“ • A DNA database contains individual records or data entries of the DNA sequences as well as information about the sequences. • A DNA database often contains flat-files. These are relatively simple database systems in which each database is contained in a single table. • In contrast, relational database systems can use multiple tables to store information, and each table can have a different record format.

GenBank as a Database • GenBank is the National Institute of Health (NIH) genetic sequence database, an annotated collection of all publicly available DNA sequences. • It is maintained by the National Center for Biotechnology Information (NCBI) within the National Institute of Health (NIH).

Anatomy of a Genome InfoSystem Information structure – Records of hierarchical, complex documents; Tables of rows and columns of numbers, letters, words – Table of contents, Reports, Indexing (as a reference book) – Browse thru available structure. – Search and retrieve according to biological questions – Bulk data selection & retrieval for other uses Information content – Primary: Literature (referenced, abstracted and curated), Sequence and feature analyses, maps, controlled vocabulary/ontologies relevant to biology, people, research methods, contacts, etc. – Metadata describing primary data, along with protocols, notes, sources Informatics / software – “Back-end” database, data collection, management, with some analyses – “Front-end” information services (hypertext web, document search/retrieval methods); ease of understanding and usage (HCI) – “Middleware” glue code, software, etc. – Specialized application for genome data: maps, BLAST searches, ontologies

History of Sequence Databases • The first bioinformatics databases were constructed a few years after the first protein sequences began to become available. • The first protein sequence reported was that of bovine insulin in 1956, consisting of 51 residues. • Nearly a decade later, the first nucleic acid sequence was reported, that of yeast alanine tRNA with 77 bases. • Just a year later, Dayhoff gathered all the available sequence data to create the first bioinformatic database. • The Protein DataBank followed in 1972 with a collection of ten X-ray crystallographic protein structures, and the SWISSPROT protein sequence database began in 1987.

GenBank History • DNA databases began in the early 1980s with a database called GenBank, which was originated by the U.S. Department of Energy to hold the short stretches of DNA sequence that scientists were just beginning to obtain from a range of organisms. • In the early days of GenBank, rooms of technicians sat at keyboards consisting of only the four letters A, C, T and G, tediously entering the DNA-sequence information published in academic journals.

The National Center for Biotechnology Information • Created as a part of NLM in 1988 • Establish public databases • U.S. National DNA Sequence Database • Perform research in computational biology • Develop software tools for sequence analysis • Disseminate biomedical information

GenBank History • Newer communication technologies enabled researchers to dial up GenBank and dump in their sequence data directly. • The administration of GenBank was transferred to National Institutes of Health's National Center for Biotechnology Information (NCBI). • With the advent of the World Wide Web, researchers could access the data in GenBank for free from around the globe. • Once the Human Genome Project (HGP) began in 1990, DNA-sequence data in GenBank began to grow exponentially. • With the introduction in the 1990s of high-throughput sequencing additions to GenBank skyrocketed.

An Interesting Metaphor • For Bioinformatics Information Flow and Databases • Cooks generate and enter the data. • Data Management makes it into a stew of blended information. • The waiters take the data from the servers to the public. • The diners are placing orders for the information they wish to consume.

Molecular Databases • Primary Databases • Original submissions by experimentalists • Database staff organize but don’t add additional information • Example:GenBank,SNP, GEO • Derivative Databases • Human curated • compilation and correction of data • Example:SWISS-PROT, NCBI RefSeq mRNA • Computationally Derived • Example:UniGene • Combinations • Example:NCBI Genome Assembly

What, the scientists submit their own DNA sequences? • Who checks for error? • Who makes people actually send their data to the database so all can share it? • Learn from success, failure of GenBank/EMBL extensive publicly shared bio-data • Carrot/stick approach. Granting agencies and journals began requiring scientists to publish sequence data. Patented sequences must be entered in the databases too. • However, there is significant public databank error due to data ownership by scientists; no inducements to update or go back and correct errors.

Primary vs. Derivative Databases ACGTGC Curators C C GA ATT GA GA C ATT GA C RefSeq TATAGCCG Sequencing Centers ACGTGC TATAGCCG AGCTCCGATA CCGATGACAA ATTGACTA CGTGA TTGACA Labs TTGACA TTGACA ACGTGC Genome Assembly TATAGCCG ACGTGC TATAGCCG ATTGACTA CGTGA CGTGA ATTGACTA CGTGA TATAGCCG ATTGACTA ATTGACTA TATAGCCG TTGACA ATTGACTA TATAGCCG TATAGCCG TATAGCCG TATAGCCG ATT C GenBank UniGene GA AT C C Algorithms ATT C C GA ATT GA GA ATT GA ATT GA ATT GA C GA C ATT GA C C

GenBank is NCBI’s Primary Sequence Database • Nucleotide only sequence database • Archival in nature • GenBank Data • Direct submissions (traditional records ) • Batch submissions (EST, GSS, STS) • ftp accounts (genome data) • Three collaborating databases • GenBank • DNA Database of Japan (DDBJ) • European Molecular Biology Laboratory (EMBL) Database

Why use Bioinformatics Databases? • Speed of information retrieval • Increasing size of data sets • Amount of information available • Save time and money by simulating experiments prior to actual experiment (a.k.a. in silico)

How do you access Databases? • Search engines • Programs that allow you to search the database • Links from other sites to the search engines • Programs that directly link to the search engines

Boolean Logic • Why do we use Boolean operators • To narrow your search • get fewer superfluous results • What are the Boolean Operators • AND-looks for entries with both terms • OR-looks for entries with one term or the other • NOT (or BUTNOT)-looks for entries with one term but not the other • * (Wildcard) -looks for ALL entries that contain the term with the * after it

AND Allergy Food Citations that contain the descriptors Food ‘AND’ Allergy only.

Allergy Food OR Citations that contain the descriptors Food ‘OR’ Allergy. This is a bigger set.

NOT Citations that contain the descriptors Allergy ‘NOT’ Food

* (Wildcard) Allerg* Food Citations that contain the descriptors Allerg* (Allergies, Allergy, Allergen

GenBank as a Database • GenBank identifiers are unique combination of numbers and letters used to index GenBank sequence entries. • They can be used to retrieve information about a particular gene or DNA sequence from the GenBank database. • This information also includes links to similar sequence entries and other public databases, making it a relational database as well as a flat file database.

What is GenBank?NCBI’s Primary Sequence Database • Nucleotide only sequence database • Archival in nature • GenBank Data • Direct submissions individual records (BankIt, Sequin) • Batch submissions via email (EST, GSS, STS) • ftp accounts sequencing centers • Data shared three collaborating databases • GenBank • DNA Database of Japan (DDBJ). • European Molecular Biology Laboratory Database (EMBL) at EBI.

Entrez NIH NCBI GenBank • Submissions • Updates • Submissions • Updates EMBL DDBJ EBI CIB NIG • Submissions • Updates SRS EMBL getentry The International Sequence Database Collaboration

Release 131 August 2002 18,197,119 Records 22,616,937,182 Nucleotides 110,000 + Species • full release every two months • incremental and cumulative updates daily • available only through internet ftp://ftp.ncbi.nih.gov/genbank/ GenBank: NCBI’s Primary Sequence Database 83.65 Gigabytes of data

Release 135 April 2003 24,027,936 Records 31,099,264,455 Nucleotides 120,000 + Species • full release every two months • incremental and cumulative updates daily • available only through internet ftp://ftp.ncbi.nih.gov/genbank/ GenBank: NCBI’s Primary Sequence Database 114 Gigabytes

Release 139 December 2003 30,968,418 Records 36,553,368,485 Nucleotides >140,000 Species 138 Gigabytes 570 files GenBank: NCBI’s Primary Sequence Database • full release every two months • incremental and cumulative updates daily • available only through internet ftp://ftp.ncbi.nih.gov/genbank/

35 40 • Sequence records • Total base pairs 35 30 30 25 25 20 20 15 15 10 10 5 5 0 0 '82 '84 '85 '86 '87 '88 '90 '91 '92 '93 '95 '96 '97 '98 '00 '01 '02 '03 The Growth of GenBank Release 139: 31.0 million records 36.6 billion nucleotides Average doubling time ≈ 12 months Sequence Records (millions) Total Base Pairs (billions)

The Entrez System

Entrez Nucleotides Primary • GenBank / EMBL / DDBJ 35,116,960 Derivative • RefSeq 259,219 • Third Party Annotation 3,182 • PDB 4,703 Total 35,384,248

Entrez Protein • GenPept (GB,EMBL, DDBJ)3,442,298 • RefSeq 856,191 • Third Party Annotation 3,834 • Swiss Prot 144,508 • PIR 282,821 • PRF 12,079 Total 3,442,298 BLAST nr 1,642,191

Organization of GenBank:GenBank Divisions Records are divided into 17 Divisions. • 1 Patent (11 files) • 5 High Throughput • 11 Traditional EST (288) Expressed Sequence Tag GSS (98) Genome Survey Sequence HTG (61) High Throughput Genomic STS (3) Sequence Tagged Site HTC (3) High Throughput cDNA PRI (27) Primate PLN (10) Plant and Fungal BCT (8) Bacterial and Archeal INV (6) Invertebrate ROD (11) Rodent VRL (3) Viral VRT (4) Other Vertebrate MAM (1) Mammalian (ex. ROD and PRI) PHG (1) Phage SYN (1) Synthetic (cloning vectors) UNA (1) Unannotated • Traditional Divisions: • Direct Submissions • (Sequin and BankIt) • Accurate • Well characterized • BULK Divisions: • Batch Submission • (Email and FTP) • Inaccurate • Poorly characterized Entrez query: gbdiv_xxx[Properties]

Traditional GenBank Divisions • Direct Submissions (Sequin and BankIt) • Accurate • Well characterized BCT Bacterial and Archeal INV Invertebrate MAM Mammalian (ex. ROD and PRI) PHG Phage PLN Plant and Fungal PRI Primate ROD Rodent SYN Synthetic (cloning vectors) VRL Viral VRT Other Vertebrate

A Helpful Resource • This is a link to a sample annotated GenBank Record. Click on any of the underlined links to learn more about the file structure. • http://www.ncbi.nlm.nih.gov/Sitemap/samplerecord.html

What is an Accession Number? • An accession number is label that used to identify a sequence in the various databases. It is a string of letters and/or numbers that corresponds to a molecular sequence. • Examples (all for retinol-binding protein, RBP4): • X02775 GenBank genomic DNA sequence • NT_030059 Genomic contig • Rs7079946 dbSNP (single nucleotide polymorphism) • N91759.1 An expressed sequence tag (1 of 170) • NM_006744 RefSeq DNA sequence (from a transcript) • NP_007635 RefSeq protein • AAC02945 GenBank protein • Q28369 SwissProt protein • 1KT7 Protein Data Bank structure record

GenBank Flat File Format • When you click on an entry, you have opened a GenBank Flat File • Information includes: • The Name of the gene • The Accession number • Journal articles

GenBank Flat File Format • Information (Cont) • Structural information of the gene (eg intron/exon boundaries, promoters,etc) • The code for the protein • The code for the DNA (RNA-if mRNA it is the cDNA for the mRNA sequenced)

Accession Number ACCESSION AF062069 VERSION AF062069.2 GI:7144484 Version Number GI Number A Traditional GenBank Record LOCUS AF062069 3808 bp mRNA INV 02-MAR-2000 DEFINITION Limulus polyphemus myosin III mRNA, complete cds. ACCESSION AF062069 VERSION AF062069.2 GI:7144484 KEYWORDS . SOURCE Atlantic horseshoe crab. ORGANISM Limulus polyphemus Eukaryota; Metazoa; Arthropoda; Chelicerata; Merostomata; Xiphosura; Limulidae; Limulus. REFERENCE 1 (bases 1 to 3808) AUTHORS Battelle,B.-A., Andrews,A.W., Calman,B.G., Sellers,J.R., Greenberg,R.M. and Smith,W.C. TITLE A myosin III from Limulus eyes is a clock-regulated phosphoprotein JOURNAL J. Neurosci. (1998) In press REFERENCE 2 (bases 1 to 3808) AUTHORS Battelle,B.-A., Andrews,A.W., Calman,B.G., Sellers,J.R., Greenberg,R.M. and Smith,W.C. TITLE Direct Submission JOURNAL Submitted (29-APR-1998) Whitney Laboratory, University of Florida, 9505 Ocean Shore Blvd., St. Augustine, FL 32086, USA REFERENCE 3 (bases 1 to 3808) AUTHORS Battelle,B.-A., Andrews,A.W., Calman,B.G., Sellers,J.R., Greenberg,R.M. and Smith,W.C. TITLE Direct Submission JOURNAL Submitted (02-MAR-2000) Whitney Laboratory, University of Florida, 9505 Ocean Shore Blvd., St. Augustine, FL 32086, USA REMARK Sequence update by submitter COMMENT On Mar 2, 2000 this sequence version replaced gi:3132700. Definition =Title NCBI’s Taxonomy

/protein_id="AAC16332.2" /db_xref="GI:7144485" GenPept Protein IDS GenBank Record: Feature Table FEATURES Location/Qualifiers source 1..3808 /organism="Limulus polyphemus" /db_xref="taxon:6850" /tissue_type="lateral eye" CDS 258..3302 /note="N-terminal protein kinase domain; C-terminal myosin heavy chain head; substrate for PKA" /codon_start=1 /product="myosin III" /protein_id="AAC16332.2" /db_xref="GI:7144485" /translation="MEYKCISEHLPFETLPDPGDRFEVQELVGTGTYATVYSAIDKQA NKKVALKIIGHIAENLLDIETEYRIYKAVNGIQFFPEFRGAFFKRGERESDNEVWLGI EFLEEGTAADLLATHRRFGIHLKEDLIALIIKEVVRAVQYLHENSIIHRDIRAANIMF SKEGYVKLIDFGLSASVKNTNGKAQSSVGSPYWMAPEVISCDCLQEPYNYTCDVWSIG ITAIELADTVPSLSDIHALRAMFRINRNPPPSVKRETRWSETLKDFISECLVKNPEYR PCIQEIPQHPFLAQVEGKEDQLRSELVDILKKNPGEKLRNKPYNVTFKNGHLKTISGQ BASE COUNT 1201 a 689 c 782 g 1136 t ORIGIN 1 tcgacatctg tggtcgcttt ttttagtaat aaaaaattgt attatgacgt cctatctgtt 3781 aagatacagt aactagggaa aaaaaaaa //

Multiple Formats are available for Sequence Data • Historically, all the DNA and Protein software was written concurrent with the establishment of the databases. So the formats needed in the databases and the software co-evolved. • Sequence analysis software needs simpler formats than databases for speed- or else the program must be allowed to ignore most of the excess information.

FASTA Definition Line >gi|603218|gb|U18238.1|MSU18238 gi number Locus Name Database Identifiers gb GenBank emb EMBL dbj DDBJ sp SWISS-PROT pdb Protein Databank pir PIR prf PRF ref RefSeq Accession number FastA format is a very popular solution >gi|603218|gb|U18238.1|MSU18238 Medicago sativa glucose-6-phosphate dehyd CCACCAGATATAATTAAGTAGATCAGAGTAGAAGAAGATGGGAACAAATGAATGGCATGTAGAAAGAAGA GATAGCATAGGTACTGAATCTCCTGTAGCAAGAGAGGTACTTGAAACTGGCACACTCTCTATTGTTGTGC TTGGTGCTTCTGGTGATCTTGCCAAGAAGAAGACTTTTCCTGCACTTTTTCACTTATATAAACAGGAATT GTTGCCACCTGATGAAGTTCACATTTTTGGCTATGCAAGGTCAAAGATCTCCGATGATGAATTGAGAAAC AAATTGCGTAGCTATCTTGTTCCAGAGAAAGGTGCTTCTCCTAAACAGTTAGATGATGTATCAAAGTTTT TACAATTGGTTAAATATGTAAGTGGCCCTTATGATTCTGAAGATGGATTTCGCTTGTTGGATAAAGAGAT TTCAGAGCATGAATATTTGAAAAATAGTAAAGAGGGTTCATCTCGGAGGCTTTTCTATCTTGCACTTCCT CCTTCAGTGTATCCATCCGTTTGCAAGATGATCAAAACTTGTTGCATGAATAAATCTGATCTTGGTGGAT GGACACGCGTTGTTGTTGAGAAACCCTTTGGTAGGGATCTAGAATCTGCAGAAGAACTCAGTACTCAGAT TGGAGAGTTATTTGAAGAACCACAGATTTATCGTATTGATCACTATTTAGGAAAGGAACTAGTGCAAAAC ATGTTAGTACTTCGTTTTGCAAATCGGTTCTTCTTGCCTCTGTGGAACCACAACCACATTGACAATGTGC AGATAGTATTTAGAGAGGATTTTGGAACTGATGGTCGTGGTGGATATTTTGACCAATATGGAATTATCCG AGATATCATTCCAAACCATCTGTTGCAGGTTCTTTGCTTGATTGCTATGGAAAAACCCGTTTCTCTCAAG CCTGAGCACATTCGAGATGAGAAAGTGAAGGTTCTTGAATCAGTACTCCCTATTAGAGATGATGAAGTTG TTCTTGGACAATATGAAGGCTATACAGATGACCCAACTGTACCGGACGATTCAAACACCCCGACTTTTGC AACTACTATTCTGCGGATACACAATGAAAGATGGGAAGGTGTTCCTTTCATTGTGAAAGCAGGGAAGGCC CTAAATTCTAGGAAGGCAGAGATTCGGGTTCAATTCAAGGATGTTCCTGGTGACATTTTCAGGAGTAAAA AGCAAGGGAGAAACGAGTTTGTTATCCGCCTACAACCTTCAGAAGCTATTTACATGAAGCTTACGGTCAA GCAACCTGGACTGGAAATGTCTGCAGTTCAAAGTGAACTAGACTTGTCATATGGGCAACGATATCAAGGG ATAACCATTCCAGAGGCTTATGAGCGTCTAATTCTCGACACAATTAGAGGTGATCAACAACATTTTGTTC GCAGAGACGAATTAAAGGCATCATGGCAAATATTCACACCACTTTTACACAAAATTGATAGAGGGGAGTT GAAGCCGGTTCCTTACAACCCGGGAAGTAGAGGTCCTGCAGAAGCAGATGAGTTATTAGAAAAAGCTGGA TATGTTCAAACACCCGGTTATATATGGATTCCTCCTACCTTATAGAGTGACCAAATTTCATAATAAAACA AGGATTAGGATTATCAGGAGCTTATAAATAAGTCTTCAATAAGCTTGTGAAATTTTCGTTATAATCTCTC TCATTTTGGGGTGTATATCAAGCATTTAAGCGCGTGTTTGACACAGTTTGTGTAATAGATTTGGCTCTGA ATGAAAATAAACGGGAATTGTTTCTTTTTGTTTTA >

FASTA format

Graphics format

ASN.1 Format • ASN.1, or Abstract Syntax Notation One, is an International Standards Organization (ISO) data representation format used to achieve interoperability between platforms. • NCBI uses ASN.1 for the storage and retrieval of data such as nucleotide and protein sequences, structures, genomes, and MEDLINE records. • ASN.1 permits computers and software systems of all types to reliably exchange both the data structure and content.

NCBI Software Development Tool Kit • The "NCBI Toolbox" is a set of software and data exchange specifications used by NCBI to produce portable, modular software for molecular biology. • The software in the Toolbox is primarily designed to read ASN.1 format records. • It is available to the public in the toolbox/ncbi_tools directory of NCBI's ftp site, and can be used in its own right or as a foundation for building tools with similar properties. • The readme files in the toolbox and toolbox/ncbi_tools directories of the FTP site contain more information about the toolbox and ASN.1.

GenPept GenBank ASN.1 FASTA Protein FASTA Nucleotide Abstract Syntax Notation: ASN.1 Seq-entry ::= set { level 1 , class nuc-prot , descr { title "Medicago sativa glucose-6-phosphate dehydrogenase mRNA, and translated products" , source { org { taxname "Medicago sativa subsp. sativa" , db { { db "taxon" , tag id 56147 } } , orgname { name binomial { genus "Medicago" , species "sativa" , subspecies "subsp. sativa" } , mod {

Toolbox Sources ftp> open ftp.ncbi.nih.gov . . ftp> cd toolbox ftp> cd ncbi_tools ftp://ftp.ncbi.nlm.gov/toolbox/ncbi_tools NCBI Toolbox /************************************************************************ * * asn2ff.c * convert an ASN.1 entry to flat file format, using the FFPrintArray. * **************************************************************************/ #include <accentr.h> #include "asn2ff.h" #include "asn2ffp.h" #include "ffprint.h" #include <subutil.h> #include <objall.h> #include <objcode.h> #include <lsqfetch.h> #include <explore.h> #ifdef ENABLE_ID1 #include <accid1.h> #endif FILE *fpl; Args myargs[] = { {"Filename for asn.1 input","stdin",NULL,NULL,TRUE,'a',ARG_FILE_IN,0.0,0,NULL}, {"Input is a Seq-entry","F", NULL ,NULL ,TRUE,'e',ARG_BOOLEAN,0.0,0,NULL}, {"Input asnfile in binary mode","F",NULL,NULL,TRUE,'b',ARG_BOOLEAN,0.0,0,NULL}, {"Output Filename","stdout", NULL,NULL,TRUE,'o',ARG_FILE_OUT,0.0,0,NULL}, {"Show Sequence?","T", NULL ,NULL ,TRUE,'h',ARG_BOOLEAN,0.0,0,NULL},

Database Tools aren’t keeping pace • Despite the huge progress in sequencing and expression analysis technologies, and the corresponding magnitude of more data that is held in the public, private and commercial databases, the tools used for storage, retrieval, analysis and dissemination of data in bioinformatics are still very similar to the original systems gathered together by researchers 15-20 years ago. • Many are simple extensions of the original academic systems, which have served the needs of both academic and commercial users for many years. • These systems are now beginning to fall behind as they struggle to keep up with the pace of change in the pharma industry.

Database Tools aren’t keeping pace • Databases are still gathered, organized, disseminated and searched using flat files. • Relational databases are still few and far between, and object-relational or fully object oriented systems are rarer still in mainstream applications. • Interfaces still rely on command lines, fat client interfaces, which must be installed on every desktop, or HTML/CGI forms. • Whilst they were in the hands of bioinformatics specialists, pharmas have been relatively undemanding of their tools. • Now the problems have expanded to cover the mainstream discovery process, much more flexible and scalable solutions are needed to serve pharma R&D informatics requirements.

There are more than one type of DNA sequence in Genebank • Genomic sequences made from genomic DNA- these do contain introns and LOTS of DNA that never becomes messenger RNA. mRNA codes for proteins. • cDNA sequences made from mRNA- these don’t contain the introns • ESTS (short stretches of cDNA sequences that are sort of a “rough draft” • mtDNA from mitochondrial genomes • SNP single nucleotide polymorphisms with some DNA variation.

Molecular Biology Databases

Molecular Biology Databases

Presentation Transcript

Molecular Biology

Molecular Biology

Molecular Biology

Molecular Biology

Molecular Biology

Molecular Biology

Molecular Biology

Molecular Biology Databases

Introduction to Molecular Biology and Biological Databases

Molecular Biology

Molecular biology databases

Molecular Biology

Molecular Biology

Molecular Biology

Molecular Biology

Molecular Biology

Interoperation of Molecular Biology Databases

Molecular Biology

Introduction to Molecular Biology and Biological Databases

Molecular Biology

Molecular Biology

Molecular Biology