860 likes | 988 Views
Chapter 6 Genomics and Gene Recognition. 暨南大學資訊工程學系 黃光璿 (HUANG, Guan Shieng) 2004/04/26. Motivation. Cells can determine the beginnings and ends of genes. How can we identify genes algorithmically? prokaryotic genomes eukaryotic genomes. Review. DNA Sequencing.
E N D
Chapter 6Genomics and Gene Recognition 暨南大學資訊工程學系 黃光璿 (HUANG, Guan Shieng) 2004/04/26
Motivation • Cells can determine the beginnings and ends of genes. • How can we identify genes algorithmically? • prokaryotic genomes • eukaryotic genomes
DNA Sequencing • Determine the order of nucleotides in a DNA fragment • Maxam-Gilbert method, 1970 • Sanger’s Chain-termination method
Base-calling • Phred program • Developed at the University of Washinton in 1998, can convert traces (analog signals) into sequences (digital signals). • <50: noisy • >800: signals declined
High-throughput Sequencing • Four-color fluorescent dyes have replaced the radioactive label. • Reads greater than 800 bp are possible, though 500~700 is more common. • Applied Biosystem's ABI PrismTM 3700 • six 96-well plates per day • 96 X 6 X 800 ~ 0.5 M • Amersham Pharmacia's Mega BASE 1000TM
6.1 Prokaryotic Genomes Should contain at least information to • make and replicate its DNA; • make new proteins; • obtain and store energy.
TIGR (The Institute for Genome Research) • have made bacterial genome sequencing as a cottage industry • Example • bio-terrorism mailings (anthrax strains,炭疽病株), late 2001.
6.2.1 Promoter Elements • promoter • a binding site in a DNA chain at which RNA polymerase binds to initiate transcription of messenger RNA by one or more nearby structural genes
6.2.1.1 RNA polymerases • β’: to bind to DNA template • β: to link one nucleotide to another • α: to hold all subunits together • σ: to recognize the specific nucleotide sequences (which is less conserved)
6.2.1.3 • consensus sequence • recognized by the same σ-factor • agree for many different genes • operon • the set of genes with related functions • regulatory proteins • positive regulator enhance (強化) • negative regulator repress (抑制), attenuate (減弱)
lactose (乳糖) operon (操縱子) (in E. coli) • beta-galactosidase (z) • lactose permease (y) • lactose transacetylase (a) One long polycistronic RNA makes all three proteins.
6.2.1.4 E. Coli’s Lac Operon • σ70 • Most efficiently expressed only when a cell’s environment is rich in lactose (乳糖) and also poor in glucose (葡萄糖) • lactose combined with negative regulator pLacI gene expressed! • glucose positive regulator CRP gene enhanced!
6.2.2 Open Reading Frames • stop codons • UAA, UAG, UGA • (1 - 3/64)N = 0.05 N~63 • E. Coli • average length = 316.8 codons, 1.8% shorter than 60 codons • Open Reading Frame (ORF) • continuous triplet codons without stop codon
start codon • AUG • E. Coli • AUG ~ 83%, UUG ~ 17% • How to determine the starting position for translation? • start codon • Shine-Delgarno sequence • A,G-rich region serves as ribosome loading sites • E.g., 5’ – AGGAGGT – 3’
6.2.4 Termination Sequences (refer to transcription) • > 90% prokaryotic operons contain intrinsic terminators • inverted repeat (7~20 bp, G-C rich) (e.g., 5’- CGGATG|CATCCG-3’) • ~ 6 U’s following the inverted repeat • cause RNA polymerases to pause ~ 1 min (RNA polymerases incorporate ~ 100 nt/sec)
6.3 GC-Content in Prokaryotic Genomes • G/C to A/T relative ratio • recognized as a distinguishing attribute of bacterial genomes • GC: 25% ~ 75%, wide range • GC-content of each bacterial species • seems to be independently shaped by mutational biases
GC-contents are generally uniform throughout bacteria’s genomes • horizontal gene transfer • the movement of genetic material between bacteria other than by descent in which information travels through the generations as the cell divides GC-contents reflect the evolutionary history of the bacteria
Prokaryotic Gene Density • 85%~88% are associated with the coding regions • E. Coli • 4288 genes, average length 950 bp, separated by 118 bp.
Finding genes in prokaryotic genomes is relatively easy. • Long open reading frames (>60); • Matches to simple promoter sequences; • Transcriptional termination signal; • Comparisons with the nucleotide sequences of known protein coding regions from other organisms.
6.5 Eukaryotic Genomes • Differences (to prokaryotic genomes) • Internal membrane-bound compartments allows them to maintain a wide variety of chemical environment. • eukaryotes Multicellular organisms, each cell type usually has a distinctive pattern of gene expression. • relatively little constraint on the size of their genomes gene expressions, more complicated & flexible
6.6 Eukaryotic Gene Structure • 1000 times harder than finding a needle in a haystack??? • Long open reading frames • is not appropriated since introns exist.
Grail EXP & GenScan • Rely on neural network and dynamic programming. • prediction < 50%
Detecting features include • promoter • a series of introns/exon boundaries • putative ORF with codon usage bias
6.6.1 Promoter Elements • prokaryotes • single RNA polymerase • eukaryotes • three kinds of RNA polymerases
RNA polymerase I, III • are needed at fairly constant levels in all eukaryotic cells at all times.
RNA polymerase II • basal promoter • RNA polymerase II initiation complex is assembled and transcription begins. • upstream promoter elements • protein binding • Have been estimated that at least 5 upstream promoter elements are required to uniquely identify the genes.
RNA polymerase II does not recognize the basal promoter directly. • basal transcription factors • TATA-binding protein (TBP) • at least 12 TBP-associated factors (TAFs) • TATA-box for eukaryotes (-25) • 5’ – TATAWAW – 3’ (W= A or T) • initiator (Inr) sequence • 5’ – YYCARR – 3’ (Y=C or T, R=A or G)
Transcription factor differences • cause tissue-specific expression of some gene.
6.6.2 Regulatory Protein Binding Sites • bacteria • RNA polymerases have high affinity for promoters. • emphasis on negative regulation • eukaryotes • RNA polymerases II & III do not assemble around promoters very efficiently. • additional emphasis on positive regulations
Transcription Factors • constitutive • Do not respond to external signal. • regulatory • Do respond to external signals. • sequence-specific DNA-binding protein
6.7 Open Reading Frames • Nuclear membrane • separates the process of transcription and translation. • DNA hnRNA (heterogeneous RNA) mRNA • translation • capped, spliced, poly-A • capped: chemical alteration (e.g., methylation) • splicing: removal of introns • polyadenylation: ~ 250 A’s at the 3’ end
Splicing causes a serious problem for gene recognition algorithm. Do not have to posses the statistically significant long ORFs.