E N D
Information organization • Oct 2, 2012 • Learning objectives-Demonstrate Dotter Program. Understand how information is stored in GenBank. Learn how to read a GenBank flat file. Learn how to search GenBank for information. Understand difference between header, features and sequence. Distinguish between a primary database and secondary database. • Homework #2 due today. • Homework #3 due Tues. Oct. 9
What is GenBank? • Gene sequence database • Annotated records that represent single contiguous stretches of DNA or RNA-may have more than one coding region. • Generated from direct submissions to the DNA sequence databases from the authors. • Part of the International Nucleotide Sequence Database Collaboration.
History of GenBank • Began with Atlas of Protein Sequences and Structures (Dayhoff et al., 1965) • In 1986 it shared data with EMBL and in 1987 it shared data with DDBJ. • Primary database • Examples of secondary databases derived from GenBank: UniProt, EST database. • GenBank Flat File is a human readable form of a GenBank record.
Downstream (relative to CDS) Upstream (relative to CDS) Transcription initiation site Transcription termination site Start of gene Coding strand End of gene 5’ 3’ DNA Promoter Protein Coding Sequence (CDS) 5’ 3’ Template strand 5’ untranslated region (5’UTR) 3’ untranslated region (3’UTR) Transcription 3’ 5’ RNA Translation Protein Protein folding Folded protein
DNA 3 4 2 1 Intron 2 Intron 1 Intron 3 Transcription Primary transcript 2 4 1 3 Splicing mRNA Translation protein Transcript splicing
Alternative splicing 1 2 3 4 Primary transcript
General Comments on GBFF • Three sections: • 1) Header-information about the whole record • 2) Features-description of annotations-each represented by a key. • 3) Nucleotide sequence-each ends with // on last line of record. • DNA-centered • Translated sequence is a feature
Feature Keys • Purpose: • 1) Indicates biological nature of sequence • 2) Supplies information about changes to sequences • Feature KeyDescription conflict Separate determinations of the same seq. differ rep_origin Origin of replication protein_bind Protein binding site on DNA CDS (Protein) coding sequence
Feature Keys-Terminology Feature Key Location/Qualifiers CDS 23..400 /product=“alcohol dehydro.” /gene=“adhI” The feature CDS is a coding sequence beginning at base 23 and ending at base 400 that has a product called “alcohol dehydrogenase” and corresponds to the gene called “adhI”.
Feature Keys-Terminology (Cont.) Feat. Key Location/Qualifiers CDS join (544..589,688..1032) /product=“T-cell recep. B-ch.” /partial The feature CDS is a partial coding sequence formed by joining the indicated elements to form one contiguous sequence encoding a product called T-cell receptor beta-chain.
Record from GenBank GenBank division (plant, fungal and algal) Locus name Modification date LOCUS SCU49845 5028 bp DNA PLN 21-JUN-1999 DEFINITION Saccharomyces cerevisiae TCP1-beta gene, partial cds, and Axl2p (AXL2) and Rev7p (REV7) genes, complete cds. ACCESSION U49845 VERSION U49845.1 GI:1293613 KEYWORDS . SOURCE baker's yeast. ORGANISM Saccharomyces cerevisiae Eukaryota; Fungi; Ascomycota; Hemiascomycetes; Saccharomycetales; Saccharomycetaceae; Saccharomyces. Accession number (never changes) Coding sequence GeneInfo identifier (changes whenever there is a change) Nucleotide sequence identifier (changes when there is a change in sequence (accession.version)) Word or phrase describing the sequence (not based on controlled vocabulary). Not used in newer records. Common name for organism Formal scientific name for the source organism and its lineage based on NCBI Taxonomy Database
Record from GenBank (cont.1) Oldest reference first REFERENCE 1 (bases 1 to 5028) AUTHORS Torpey,L.E., Gibbs,P.E., Nelson,J. and Lawrence,C.W. TITLE Cloning and sequence of REV7, a gene whose function is required for DNA damage-induced mutagenesis in Saccharomyces cerevisiae JOURNAL Yeast 10 (11), 1503-1509 (1994) MEDLINE 95176709 REFERENCE 2 (bases 1 to 5028) AUTHORS Roemer,T., Madden,K., Chang,J. and Snyder,M. TITLE Selection of axial growth sites in yeast requires Axl2p, a novel plasma membrane glycoprotein JOURNAL Genes Dev. 10 (7), 777-793 (1996) MEDLINE 96194260 Medline UID REFERENCE 3 (bases 1 to 5028) AUTHORS Roemer,T. TITLE Direct Submission JOURNAL Submitted (22-FEB-1996) Terry Roemer, Biology, Yale University, New Haven, CT, USA Submitter of sequence (always the last reference)
Record from GenBank (cont.2) There are three parts to the feature key: a keyword (indicates functional group), a location (instruction for finding the feature), and a qualifier (auxiliary information about a feature) FEATURES Location/Qualifiers source 1..5028 /organism="Saccharomyces cerevisiae" /db_xref="taxon:4932" /chromosome="IX" /map="9" CDS <1..206 /codon_start=3 /product="TCP1-beta" /protein_id="AAA98665.1" /db_xref="GI:1293614" /translation="SSIYNGISTSGLDLNNGTIADMRQLGIVESYKLKRAVVSSASEA AEVLLRVDNIIRARPRTANRQHM" Location Keys Qualifiers Partial sequence on the 5’ end. The 3’ end is complete. Start of open reading frame Descriptive free text must be in quotations Database cross-refs Protein sequence ID # Values Note: only a partial sequence
Record from GenBank (cont.3) Another location gene687..3158 /gene="AXL2" CDS 687..3158 /gene="AXL2" /note="plasma membrane glycoprotein" /codon_start=1 /function="required for axial budding pattern of S. cerevisiae" /product="Axl2p" /protein_id="AAA98666.1" /db_xref="GI:1293615" /translation="MTQLQISLLLTATISLLHLVVATPYEAYPIGKQYPPVARVN. . . “ gene complement(3300..4037) /gene="REV7" CDS complement(3300..4037) /gene="REV7" /codon_start=1 /product="Rev7p" /protein_id="AAA98667.1" /db_xref="GI:1293616" /translation="MNRWVEKWLRVYLKCYINLILFYRNVYPPQSFDYTTYQSFNLPQ . . . “ Cutoff Another location Cutoff
Record from GenBank (cont.4) BASE COUNT 1510 a 1074 c 835 g 1609 t ORIGIN 1 gatcctccat atacaacggt atctccacct caggtttaga tctcaacaac ggaaccattg 61 ccgacatgag acagttaggt atcgtcgaga gttacaagct aaaacgagca gtagtcagct . . .//
Protein databases derived from GenBank containing data for a single gene • Non-redundant (nr) • UniProtKB protein DNA RNA cDNA DNA databases derived from GenBank containing data for a single gene • Non-redundant (nr) • dbGSS • dbSTS RNA (cDNA) databases derived from GenBank containing data for a single gene • dbEST • UniGene • RefSeq
GenBank/EMBL/DDBJ dbEST-expressed sequence tags-single pass cDNA sequences (high error freq.) It is non-redundant PDB-Three-dimensional structure coordinates of biological molecules PROSITE-database of protein domain/function relationships. Types of primary databases carrying biological infomation
Summary • GenBank-longest running molecular biology database. • Three sections in every GenBank record • Primary databases and secondary databases. • RefSeq-contains unique record for each RNA variant. • UniProtKB-protein centered
Workshop • Do problem 1 in Chapter 2.
Homework • Do problems 2 and 3 in Chapter 2.