170 likes | 290 Views
Primary vs. Secondary Databases. -This is one of the most important functions of a database: to reliably store and make accessible the data. Most protein sequences are predicted (i.e. annotated) from nucleotide sequence and therefore not curated.
E N D
Primary vs. Secondary Databases -This is one of the most important functions of a database: to reliably store and make accessible the data. Most protein sequences are predicted (i.e. annotated) from nucleotide sequence and therefore not curated. Primary databases are repositories of “raw” data. These are also referred to as archival databases. Secondary databases are repositories of “curated” data. Curated databases require human review of some kind, in addition to some experimental verification of the biological meaning of sequence data.
International Nucleotide Sequence Database Collection • European Molecular Biology Laboratory EMBL(UK) • GenBank (NCBI, USA) • DNA Databank of Japan DDBJ All three organizations share 100% of their data. See Figure 1.1 in your text. One of the consequences of data sharing is that file formats must correspond
Flatfiles • In biological databases, a flat file is a textfile, that usually contains one (sequence) record. • Flat files are the indivisible unit of all sequence databases, but data in them can be display in a variety of formats. • One of the most common formats for sequence records is called FASTA
A closer look at Flatfiles The first line is called the header Name identifier: a unique identifier for each sequence. This is also known as the primary accession number Length of mRNA i.e. not a circular molecule Like a plasmid Date when last updated. In this case, the sequence was submitted as an mRNA sequence. The “N” means nucleotide and the “M” means mRNA. See Box 1.2 Taxonomic code, not very useful anymore
Flatfiles continued The second line is called the Definition Line, the goal of which is to summarize the essential biological information encoded by the entry. Basic description of structure and function Gene name 1. Genus species Note: Gene ontology can be confusing. In this case, the gene is named after a fruitfly mutant. Type of molecule from which the sequence was derived. In reality, this would have been derived from a cDNA corresponding to a mRNA harvested from an embryonic cell
The most important entry. GenBank specific “geneinfo” identifier Primary database to reference to the sequence. If using this sequence in a publication, this is cited to refer readers to the database entry you used or created The version is very similar to the accession number, but if the sequence is updated either because it was wrong or incomplete, the number after the decimal indicates the version
Source organism Pretty self explanatory, except the difference between SOURCE and ORGANISM is that the latter is hyperlinked so one can go and investigate more…
All GenBank entries must be associated with a citation In essence, this ensures that the means by which the sequences were acquired have been peer reviewed, if not the sequence itself. This is what lends scientific credibility to the quality of these databases. This is an EMBL accession number, which means that it was not originally submitted through the GenBank portal
The only feature common to all three primary databases is the source feature All sequences must come from somewhere, so the minimum data (organisms and type of molecule) is entered here, with a link to the Taxonomy Browser. The list of acceptable database cross references (i.e. db_xref) to external links is strictly controlled. In this case it is still within the “Entrez” webspace, but others are possible.
All annotated nucleotide entries contain a “virtual” translation into amino acid sequence In this case, the translation is derived directly from a mRNA sequence, so there is a good chance it is correct, but if the translation is due to computationally derived genomic sequence, it should validated against a curated database.
So, flatfiles are informative, but what if you want to work with the sequence? The sequence data in the flatfile can be displayed or downloaded in a variety of different ways. A FASTA file is a very common format.
The simplest possible FASTA file >sequence AGTCCGATCGATCGTAGCTACGTACGTACGTAGCTAGCTACGTACGTACGATCGATGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCG This FASTA sequence file has all of the necessary elements for a database entry, but it is not very informative. For example, we don’t know what database it is from, what organism is has come from, what molecule it encodes, if any etc.
FASTA format The chevron symbol “>” is important because it denotes the beginning of a new sequence. This is particularly important if you are using a file that contains multiple sequences for a query search, for example. >A sequence CAGCTGACAGATCGTACGATCGATGCGCACGAAGCACTACTAGCTAGGT >Another sequence CGCTAGCTCGCGATCGTATCAACGCGCGCGCGCGCGCATACTCACGCGC
Protein sequence databases Read Chapter one in book from “Protein Sequence Databases” to end of chapter With the exception of Protein Data Bank, which is a primary database composed of experimentally determined protein structure, all other protein databases are considered to be either mixed primary and secondary databases because they rely upon conceptual, or virtual translation of nucleotide data. GenPept is a secondary database, searchable through the “Protein” portal in Entrez. Caveat: errors in nucleotide sequence can be propagated. UniParc is a mixed primary and secondary database, and therefore attempts to be a comprehensive repository of amino acid sequences. Curated, Protein Data Bankis a primary database of protein structure determinations, using either X-ray crystallography or Nuclear Magnetic Resonance Spectroscopy.
Entrez Webspace http://www.ncbi.nlm.nih.gov/books/NBK21101/ This book will be your best friend. It is a comprehensive online documentation volume that attempts to fill the gap between a straightforward search in PubMed or BLAST and more advanced tasks. This webspace uses the concept of neighboring, which describes logical (i.e. natural) relationships between entries in one database and those in another.