460 likes | 515 Views
Gene Structure and Identification. Eukaryotic Genes and Genomes Gene Finding. Complex Genome DNA. ~10% highly repetitive ( Mbp) NOT GENES ~25% moderate repetitive ( Mbp) Some genes ~25% exons and introns ( Mbp) 40%=? Regulatory regions Intergenic regions.
E N D
Gene Structure and Identification Eukaryotic Genes and Genomes Gene Finding Chuck Staben
Complex Genome DNA • ~10% highly repetitive ( Mbp) • NOT GENES • ~25% moderate repetitive ( Mbp) • Somegenes • ~25% exons and introns ( Mbp) • 40%=? • Regulatory regions • Intergenic regions How to tell?? Chuck Staben
Eukaryotic Gene ExpressionEngraved on your Brain!!!! Chuck Staben
Yeast ORFS=genes! What don’t you find this way? Chuck Staben
“large” Eukaryotes Intron average= Exon average= Promoter/enhancer Where/how arranged genome sparse Fungi introns promoter/enhancer Where? genome dense or sparse? Eukaryotes, cont’d Chuck Staben
Intron Prevalence Chuck Staben
Intron Size Chuck Staben
Exon Size Chuck Staben
Fungi Sew together exons • ORF regions • consensus sequences • domain/polypeptide matches Chuck Staben
Exon/Intron Structure CCACATTgtn(30-10,000)an(5-20)agCAGAA …_______________... ...ProHisSerGlu... Chuck Staben
Alternative Splice CCACATTgtn(30-10,000)an(5-20)agcagAA ...CCACATTAA... ...ProHis_____ Rules for alternative splicing? Chuck Staben
Codon Bias/Nucleotide Frequency-useful? • Bias=0.97 means______ • Bias=0.03 means______ Chuck Staben
Consensus Sequences • Promoter sites • Intron/Exon • Transcription Termination/PolyA • Translation initation Position Weight Matrices Chuck Staben
Finding Functional Sequences Known Consensus Sequences Consensus Sequence Generation Functional Tests Chuck Staben
Consensus Inference • Position Weight Matrices • Sequence Logos • Hidden Markov Models ProfileScan Chuck Staben
Translation Initiation Sites Chuck Staben
Functional Assay CCATGG 100 CCCTGG 0 CCTTGG 5 CCATAG 0 CTATGG 90 CCATGA 85 • Conservation • Correlated • Positions Chuck Staben
Splicing Consensus A64G73GTA62A68G84T63… Y80NY80Y87R75AY95…C65AGNNVert GTRNGT(N){30-1000} CTRAC(N){5-15}YAG Fungi Alternate Splicing!?? Chuck Staben
Linguistic Approach • Non-repetitive DNA!! • Long ORF • similar to known protein • ORF extended by “reasonable” splices • ORF begins with “good” ATG • Promoter/terminator flanks Looks like a duck... Chuck Staben
DATABASE SEARCH • BLASTN • What? • Limitations? • BLASTX/TBLASTX • BLASTX does? • TBALSTX? www.ncbi.nlm.nih.gov Chuck Staben
Protein Database Matches Great for the “known” What about the unknown??? Chuck Staben
Transcript Initiation • Basal Promoters • Enhancers/Silencers/Regulatory Sites • Boundary elements? • Transcription Initation Prokaryotes vs Eukaryotes Organism-to-Organism Chuck Staben
GC CAAT TATA Basal Promoter Analysis Myers and Maniatis, Genes VI, 831 • ATATAA -30 TBP • GGCCAATC -75 CTF/NF1 • GCCACACCC -90 SP1 +1 Chuck Staben
mRNA processing • Exon/Intron • Alternate splicing • Polyadenylation/Cleavage • Stability Chuck Staben
Poly A sites • Metazoans • AATAAA • Yeast-different Chuck Staben
Translation • Initation site • (Frameshifting) • Translational regulatory elements • upstream ORFs • translational enhancers Chuck Staben
Translation Sites • Initiate at 5’-ATG • upstream ORF…regulatory • (Frameshifting) • Translation enhancers…. Chuck Staben
Integrated Genefinding • Linguistic approach (our discussion) • Probabilistic approaches • Discriminant analyses • MARKOV MODELS Chuck Staben
Tools-WWW • GRAIL II: integrated gene parsing • GenLang • GENIE • HMMGene (lock ESTs, etc.) • GENSCAN • GENEMARK HMM Probabilities Chuck Staben
Hidden Markov Models • Probabilistic Models • Applicable to linear sequences • P(all states)=1, infer probabilities of all states from observed (hidden states unobserved) • Work best when local correlations unimportant • Genefinding, phylogeny, secondary structure, genetic mapping • Work best with “Training Set” • Quantitative probabilities Chuck Staben
Accuracy Assessment PP=predicted coding PN=predicted non-coding AP=“real” positive AN=“’real” negatives TP=number correct positive TN=number correct negative FP=number false positive FN=number false negative Sn=TP/AP Sp=TP/PP AC = ((TP/(TP+FN)) + (TP/(TP+FP)) + (TN/(TN+FP)) + (TN/(TN+FN))) / 2 - 1 Chuck Staben
Accuracy Levels DNA Sequence Error Rate!?? Chuck Staben
NEXT • Regulatory Sequences • Known Consensus Sequences • Consensus Sequence Generation • Functional (Lab) Data • Real examples Chuck Staben
Gene Regulatory Sequences • Functional sites • Consensus • Experimental tests • Inferred sites • Transcriptome analysis Chuck Staben
Regulatory Sites • Transcript initiation • mRNA processing • Translation sites Chuck Staben
Regulatory Factors • lacI, trpR, CAP, araC…. • GAL4, NDT80… Known from experiment Infer from genome? Infer from expression data? Chuck Staben
EUKARYOTES • More complex signals • More genes • More dispersed signals • Combinatoric regulation common Chuck Staben
Enhancer Elements • Octamer OCT1, OCT2 • Name some… False +, False - Chuck Staben
Consensus Sequence Databases • WWW-based • TFD (transcription factor database) • BCM Search launcher Chuck Staben
Transcriptome Analyses • Microarray transcription analysis • MEME analysis of clusters More later.... Chuck Staben
Practical Gene Finding • Use ALL tools • Comparative • BLASTN, BLASTX • Predictive: Stitch together a consensus • HMM, GRAIL… • ORF finders • Findpatterns (and WWW pattern searches) • cDNA OR protein OR genetic evidence Chuck Staben
FRAMES-aldolase gene Chuck Staben
If aldolase is so tough, how do you really do it? Combine DNA sequence with other data! Chuck Staben
Infer Promoter, Enhancer Test in cis Genome-cDNA P DNA sequencing Align (GAP) cDNA Chuck Staben
Comparative Genomics • Conservation of coding regions • Identification of transcription signals • “words” in common Chuck Staben