460 likes | 519 Views
Explore the complexities of gene structures in eukaryotic organisms, including exon-intron regions, regulatory sequences, and gene identification methods. Learn about alternative splicing, codon bias, and functional sequence analysis. Delve into hidden Markov models, gene prediction tools, and accuracy assessment in gene-finding strategies.
E N D
Gene Structure and Identification Eukaryotic Genes and Genomes Gene Finding Chuck Staben
Complex Genome DNA • ~10% highly repetitive ( Mbp) • NOT GENES • ~25% moderate repetitive ( Mbp) • Somegenes • ~25% exons and introns ( Mbp) • 40%=? • Regulatory regions • Intergenic regions How to tell?? Chuck Staben
Eukaryotic Gene ExpressionEngraved on your Brain!!!! Chuck Staben
Yeast ORFS=genes! What don’t you find this way? Chuck Staben
“large” Eukaryotes Intron average= Exon average= Promoter/enhancer Where/how arranged genome sparse Fungi introns promoter/enhancer Where? genome dense or sparse? Eukaryotes, cont’d Chuck Staben
Intron Prevalence Chuck Staben
Intron Size Chuck Staben
Exon Size Chuck Staben
Fungi Sew together exons • ORF regions • consensus sequences • domain/polypeptide matches Chuck Staben
Exon/Intron Structure CCACATTgtn(30-10,000)an(5-20)agCAGAA …_______________... ...ProHisSerGlu... Chuck Staben
Alternative Splice CCACATTgtn(30-10,000)an(5-20)agcagAA ...CCACATTAA... ...ProHis_____ Rules for alternative splicing? Chuck Staben
Codon Bias/Nucleotide Frequency-useful? • Bias=0.97 means______ • Bias=0.03 means______ Chuck Staben
Consensus Sequences • Promoter sites • Intron/Exon • Transcription Termination/PolyA • Translation initation Position Weight Matrices Chuck Staben
Finding Functional Sequences Known Consensus Sequences Consensus Sequence Generation Functional Tests Chuck Staben
Consensus Inference • Position Weight Matrices • Sequence Logos • Hidden Markov Models ProfileScan Chuck Staben
Translation Initiation Sites Chuck Staben
Functional Assay CCATGG 100 CCCTGG 0 CCTTGG 5 CCATAG 0 CTATGG 90 CCATGA 85 • Conservation • Correlated • Positions Chuck Staben
Splicing Consensus A64G73GTA62A68G84T63… Y80NY80Y87R75AY95…C65AGNNVert GTRNGT(N){30-1000} CTRAC(N){5-15}YAG Fungi Alternate Splicing!?? Chuck Staben
Linguistic Approach • Non-repetitive DNA!! • Long ORF • similar to known protein • ORF extended by “reasonable” splices • ORF begins with “good” ATG • Promoter/terminator flanks Looks like a duck... Chuck Staben
DATABASE SEARCH • BLASTN • What? • Limitations? • BLASTX/TBLASTX • BLASTX does? • TBALSTX? www.ncbi.nlm.nih.gov Chuck Staben
Protein Database Matches Great for the “known” What about the unknown??? Chuck Staben
Transcript Initiation • Basal Promoters • Enhancers/Silencers/Regulatory Sites • Boundary elements? • Transcription Initation Prokaryotes vs Eukaryotes Organism-to-Organism Chuck Staben
GC CAAT TATA Basal Promoter Analysis Myers and Maniatis, Genes VI, 831 • ATATAA -30 TBP • GGCCAATC -75 CTF/NF1 • GCCACACCC -90 SP1 +1 Chuck Staben
mRNA processing • Exon/Intron • Alternate splicing • Polyadenylation/Cleavage • Stability Chuck Staben
Poly A sites • Metazoans • AATAAA • Yeast-different Chuck Staben
Translation • Initation site • (Frameshifting) • Translational regulatory elements • upstream ORFs • translational enhancers Chuck Staben
Translation Sites • Initiate at 5’-ATG • upstream ORF…regulatory • (Frameshifting) • Translation enhancers…. Chuck Staben
Integrated Genefinding • Linguistic approach (our discussion) • Probabilistic approaches • Discriminant analyses • MARKOV MODELS Chuck Staben
Tools-WWW • GRAIL II: integrated gene parsing • GenLang • GENIE • HMMGene (lock ESTs, etc.) • GENSCAN • GENEMARK HMM Probabilities Chuck Staben
Hidden Markov Models • Probabilistic Models • Applicable to linear sequences • P(all states)=1, infer probabilities of all states from observed (hidden states unobserved) • Work best when local correlations unimportant • Genefinding, phylogeny, secondary structure, genetic mapping • Work best with “Training Set” • Quantitative probabilities Chuck Staben
Accuracy Assessment PP=predicted coding PN=predicted non-coding AP=“real” positive AN=“’real” negatives TP=number correct positive TN=number correct negative FP=number false positive FN=number false negative Sn=TP/AP Sp=TP/PP AC = ((TP/(TP+FN)) + (TP/(TP+FP)) + (TN/(TN+FP)) + (TN/(TN+FN))) / 2 - 1 Chuck Staben
Accuracy Levels DNA Sequence Error Rate!?? Chuck Staben
NEXT • Regulatory Sequences • Known Consensus Sequences • Consensus Sequence Generation • Functional (Lab) Data • Real examples Chuck Staben
Gene Regulatory Sequences • Functional sites • Consensus • Experimental tests • Inferred sites • Transcriptome analysis Chuck Staben
Regulatory Sites • Transcript initiation • mRNA processing • Translation sites Chuck Staben
Regulatory Factors • lacI, trpR, CAP, araC…. • GAL4, NDT80… Known from experiment Infer from genome? Infer from expression data? Chuck Staben
EUKARYOTES • More complex signals • More genes • More dispersed signals • Combinatoric regulation common Chuck Staben
Enhancer Elements • Octamer OCT1, OCT2 • Name some… False +, False - Chuck Staben
Consensus Sequence Databases • WWW-based • TFD (transcription factor database) • BCM Search launcher Chuck Staben
Transcriptome Analyses • Microarray transcription analysis • MEME analysis of clusters More later.... Chuck Staben
Practical Gene Finding • Use ALL tools • Comparative • BLASTN, BLASTX • Predictive: Stitch together a consensus • HMM, GRAIL… • ORF finders • Findpatterns (and WWW pattern searches) • cDNA OR protein OR genetic evidence Chuck Staben
FRAMES-aldolase gene Chuck Staben
If aldolase is so tough, how do you really do it? Combine DNA sequence with other data! Chuck Staben
Infer Promoter, Enhancer Test in cis Genome-cDNA P DNA sequencing Align (GAP) cDNA Chuck Staben
Comparative Genomics • Conservation of coding regions • Identification of transcription signals • “words” in common Chuck Staben