Nucleotide Sequence Databases

Nucleotide Sequence Databases Your guide to genes & genomes

Nucleotide Sequence Databases • First generation • GenBank is a representative example • started as sort of a museum to preserve knowledge of a sequence from first discovery • great repositories, particularly for long-term study of bioinformatic data • flat files; not built for (and not great at) querying

Nucleotide Sequence Databases • Second generation: • Entrez gene is an example • information is gene-centric (not just sequence-centric) • all sequence information for a given gene can be found in one place

Nucleotide Sequence Databases • Third generation: • Ensembl is a good example • Information is organized around whole genomes; not only a specific gene’s structure, but its context: • position of this gene relative to others • strand orientation • how gene relates to presence or absence of biochemical functions in organism

Prokaryotes (& Archaea) • microscopic organisms • single cell • no nucleus • simple genome: • single, circular DNA molecule • 600,000 – 8 million base pairs • 70% of genome codes for proteins

Prokaryotes (& Archaea) • genes don’t overlap • no introns; mRNA is collinear with gene sequence • protein sequences derived by translating longest ORF (ATG to STOP) spanning gene-transcript sequence source: http://www.cod.edu/people/faculty/fancher/CellStructure.htm

Thought for today … source: http://www.scicomics.com/uploads/prokaryote.jpg

Eukaryotes • way more complicated • genes found in cell nucleus • genome size: 10 million – 670 million base pairs • much lower gene density than prokaryotes: in human chromosomes, about one gene for every 100,000 base pairs source: http://www.cod.edu/people/faculty/fancher/CellStructure.htm

Eukaryotes • much less efficient than prokaryotes; less than 5% of human genome codes for protein • genes transcribed after a promoter region; but process may be strongly influenced by sequence elements relatively far away source: http://www.cit.gu.edu.au/~anthony/dungeon/balcony/

Eukaryotes • Gene sequences and mRNA/protein sequences not collinear; only exons are retained in mature mRNA that encodes protein • A single gene may (and often does) exhibit more than one mRNA and protein form

GenBank • First example: prokaryotic gene • point your browser to: http://www.ncbi.nlm.nih.gov/entrez • choose Nucleotide from the Search pull-down menu • in For box, type X01714 and click Go • Click the link labeled X01714 • Can “Send To Text” if you want to save the file

GenBank fields • LOCUS • size of sequence (in base pairs) • nature of molecule (e.g. DNA or RNA) • topology (linear or circular) • DEFINITION: brief description of gene • ACCESSION: unique identifier for this (and some other) databases • VERSION: lists synonymous or past ID numbers

GenBank fields • KEYWORDS: list of terms related to entry; can be used for keyword searching for related data • SOURCE: common name of relevant organism • ORGANISM: complete id, with taxonomic classification • note that ORGANISM is indented under SOURCE; this indicates that ORGANISM is a subordinate term, or subsection, of SOURCE

GenBank fields • REFERENCE: credits author(s) who initially determined the sequence; includes subsections: • AUTHOR • TITLE • JOURNAL • PUBMED • COMMENT: free-formatted text that doesn’t fit in another category

GenBank fields • FEATURES: table describing gene regions and associated biological properties • source: origin of specific regions of sequence; useful for distinguishing cloning vectors from host sequences • promoter: precise coordinates of promoter element in the sequence; may be more than one of these • misc feature: in this example, indicates (putative) location of transcription start (mRNA synthesis) • RBS (ribosome binding site): location of last upstream element • CDS (CoDing Segment): describes the ORF

GenBank fields: FEATURES: CDS • gives coordinates from initial nucleotide (ATG) to last nucleotide of stop codon (TAA) • several lines follow, listing protein products, reading frame to use, genetic code to apply and several IDs for the protein sequence • /translation section gives computer translation of sequence into amino acid sequence

Last Section: sequence itself • This is the most important section in terms of analysis using other tools • Can isolate just this section and save the file, as follows: • Choose FASTA from the Display pull-down menu (top of page) • Choose Text in the Send To pull-down menu • Use File/Save As to save the file • use “Text” as file type • give the file a name that you’ll know to associate with this particular sequence

Example 2: eukaryotic mRNA • Can obtain this example by searching Nucleotide database for U90223 • Similar to prokaryote example, because we’re looking at a direct coding sequence for a protein – not DNA, in other words • Notes on example: • KEYWORD field is empty: this is an example of an incomplete annotation • remember, you’re looking at a primary database! • FEATURES field contains some new terms: • sig_peptide: location of mitochondrial targeting sequence • mat_peptide: exact boundaries of mature peptide

Example 3: Eukaryotic gene • Can obtain this record by searching Nucleotide for AF018430 • General information: • LOCUS: same info as previous examples – note the locus name is different from the accession number this time • DEFINITION: specifies exon; remember, protein-coding regions in eukaryotes are not contiguous as in prokaryotes • SEGMENT: indicates this is the second of 4; you’d need all 4 to reconstruct the mRNA that codes for the protein

Eukaryotic gene: FEATURES section • source subsection includes a /map section: • indicates chromosome (15) • arm (q means long arm) • cytogenic band (q21.1)

Eukaryotic gene: FEATURES section • gene subsection: describes how to reconstruct the mRNAs found in this and separate entries: • the strings that begin “AF” refer to the GenBank entries (remember, this one was AF018430), and the numbers represent the nucleotide positions from the entries • if a set of numbers (example: 1..1177) is NOT preceded by an entry indicator, it’s from the current entry • The < and > signs indicate that the start and stop points are only approximate

Eukaryotic gene: FEATURES section • mRNA section: can be read in a similar manner to the gene section • note that there are two mRNA sections (each followed by a CDS section) • first section describes mitochondrial RNA • second section describes nuclear RNA • exon section: indicates position of exon(s) in sequence

Retrieving GenBank entries without accession numers • Search Nucleotide for specific product you’re interested in; for example: human[organism] AND dUTPase[Protein name] • this search yields several entries; can click the Links link to the right of one of these (AF018432) and choose Related Sequences from the pull-down that appears • retrieves several more entries, some DNA and some mRNA • terms used in the titles of these entries can give us additional search criteria: human[organism] AND “dUTPpyrophosphatase”[Title] • yields somewhat different set of entries

Nucleotide Sequence Databases

Nucleotide Sequence Databases

Presentation Transcript

Sequence Databases

Sequence Databases

Sequence Databases

Nucleotide Databases: Genbank

Answers: Genes, alleles, nucleotide sequence

Sequence databases

Mixture models of nucleotide sequence evolution

Querying Sequence Databases

Searching Sequence Databases

Chapter 2 Sequence databases

Sequence Databases

Sequence Databases – 20 June 2008

VarDetect: a nucleotide sequence variation exploratory tool

Searching Sequence Databases

Similarity searches in sequence databases

Protein Sequence Databases

Rationale for searching sequence databases

Sequence Databases

Protein sequence databases

BLAST and searching sequence databases

Sequence Databases