320 likes | 1.08k Views
Nucleotide Sequence Databases. Your guide to genes & genomes. Nucleotide Sequence Databases. First generation GenBank is a representative example started as sort of a museum to preserve knowledge of a sequence from first discovery
E N D
Nucleotide Sequence Databases Your guide to genes & genomes
Nucleotide Sequence Databases • First generation • GenBank is a representative example • started as sort of a museum to preserve knowledge of a sequence from first discovery • great repositories, particularly for long-term study of bioinformatic data • flat files; not built for (and not great at) querying
Nucleotide Sequence Databases • Second generation: • Entrez gene is an example • information is gene-centric (not just sequence-centric) • all sequence information for a given gene can be found in one place
Nucleotide Sequence Databases • Third generation: • Ensembl is a good example • Information is organized around whole genomes; not only a specific gene’s structure, but its context: • position of this gene relative to others • strand orientation • how gene relates to presence or absence of biochemical functions in organism
Prokaryotes (& Archaea) • microscopic organisms • single cell • no nucleus • simple genome: • single, circular DNA molecule • 600,000 – 8 million base pairs • 70% of genome codes for proteins
Prokaryotes (& Archaea) • genes don’t overlap • no introns; mRNA is collinear with gene sequence • protein sequences derived by translating longest ORF (ATG to STOP) spanning gene-transcript sequence source: http://www.cod.edu/people/faculty/fancher/CellStructure.htm
Thought for today … source: http://www.scicomics.com/uploads/prokaryote.jpg
Eukaryotes • way more complicated • genes found in cell nucleus • genome size: 10 million – 670 million base pairs • much lower gene density than prokaryotes: in human chromosomes, about one gene for every 100,000 base pairs source: http://www.cod.edu/people/faculty/fancher/CellStructure.htm
Eukaryotes • much less efficient than prokaryotes; less than 5% of human genome codes for protein • genes transcribed after a promoter region; but process may be strongly influenced by sequence elements relatively far away source: http://www.cit.gu.edu.au/~anthony/dungeon/balcony/
Eukaryotes • Gene sequences and mRNA/protein sequences not collinear; only exons are retained in mature mRNA that encodes protein • A single gene may (and often does) exhibit more than one mRNA and protein form
GenBank • First example: prokaryotic gene • point your browser to: http://www.ncbi.nlm.nih.gov/entrez • choose Nucleotide from the Search pull-down menu • in For box, type X01714 and click Go • Click the link labeled X01714 • Can “Send To Text” if you want to save the file
GenBank fields • LOCUS • size of sequence (in base pairs) • nature of molecule (e.g. DNA or RNA) • topology (linear or circular) • DEFINITION: brief description of gene • ACCESSION: unique identifier for this (and some other) databases • VERSION: lists synonymous or past ID numbers
GenBank fields • KEYWORDS: list of terms related to entry; can be used for keyword searching for related data • SOURCE: common name of relevant organism • ORGANISM: complete id, with taxonomic classification • note that ORGANISM is indented under SOURCE; this indicates that ORGANISM is a subordinate term, or subsection, of SOURCE
GenBank fields • REFERENCE: credits author(s) who initially determined the sequence; includes subsections: • AUTHOR • TITLE • JOURNAL • PUBMED • COMMENT: free-formatted text that doesn’t fit in another category
GenBank fields • FEATURES: table describing gene regions and associated biological properties • source: origin of specific regions of sequence; useful for distinguishing cloning vectors from host sequences • promoter: precise coordinates of promoter element in the sequence; may be more than one of these • misc feature: in this example, indicates (putative) location of transcription start (mRNA synthesis) • RBS (ribosome binding site): location of last upstream element • CDS (CoDing Segment): describes the ORF
GenBank fields: FEATURES: CDS • gives coordinates from initial nucleotide (ATG) to last nucleotide of stop codon (TAA) • several lines follow, listing protein products, reading frame to use, genetic code to apply and several IDs for the protein sequence • /translation section gives computer translation of sequence into amino acid sequence
Last Section: sequence itself • This is the most important section in terms of analysis using other tools • Can isolate just this section and save the file, as follows: • Choose FASTA from the Display pull-down menu (top of page) • Choose Text in the Send To pull-down menu • Use File/Save As to save the file • use “Text” as file type • give the file a name that you’ll know to associate with this particular sequence
Example 2: eukaryotic mRNA • Can obtain this example by searching Nucleotide database for U90223 • Similar to prokaryote example, because we’re looking at a direct coding sequence for a protein – not DNA, in other words • Notes on example: • KEYWORD field is empty: this is an example of an incomplete annotation • remember, you’re looking at a primary database! • FEATURES field contains some new terms: • sig_peptide: location of mitochondrial targeting sequence • mat_peptide: exact boundaries of mature peptide
Example 3: Eukaryotic gene • Can obtain this record by searching Nucleotide for AF018430 • General information: • LOCUS: same info as previous examples – note the locus name is different from the accession number this time • DEFINITION: specifies exon; remember, protein-coding regions in eukaryotes are not contiguous as in prokaryotes • SEGMENT: indicates this is the second of 4; you’d need all 4 to reconstruct the mRNA that codes for the protein
Eukaryotic gene: FEATURES section • source subsection includes a /map section: • indicates chromosome (15) • arm (q means long arm) • cytogenic band (q21.1)
Eukaryotic gene: FEATURES section • gene subsection: describes how to reconstruct the mRNAs found in this and separate entries: • the strings that begin “AF” refer to the GenBank entries (remember, this one was AF018430), and the numbers represent the nucleotide positions from the entries • if a set of numbers (example: 1..1177) is NOT preceded by an entry indicator, it’s from the current entry • The < and > signs indicate that the start and stop points are only approximate
Eukaryotic gene: FEATURES section • mRNA section: can be read in a similar manner to the gene section • note that there are two mRNA sections (each followed by a CDS section) • first section describes mitochondrial RNA • second section describes nuclear RNA • exon section: indicates position of exon(s) in sequence
Retrieving GenBank entries without accession numers • Search Nucleotide for specific product you’re interested in; for example: human[organism] AND dUTPase[Protein name] • this search yields several entries; can click the Links link to the right of one of these (AF018432) and choose Related Sequences from the pull-down that appears • retrieves several more entries, some DNA and some mRNA • terms used in the titles of these entries can give us additional search criteria: human[organism] AND “dUTPpyrophosphatase”[Title] • yields somewhat different set of entries