320 likes | 357 Views
Learn about gene structure identification in Eukaryotic genomes, including exons, introns, gene prediction, alternative splicing, and functional sequence analysis. Explore various tools and models for gene finding and assessment accuracy.
E N D
Gene Structure and Identification Eukaryotic Genes and Genomes Gene Finding Previous reading: 1.3, 9.1-9.6 Reading: 10.2, 10.4, 10.6-8 BIO520 Bioinformatics Jim Lund
~10% highly repetitive (300 Mbp) NOT GENES ~25% moderate repetitive (750 Mbp) Somegenes ~10% exons and introns (355 Mbp) 45% = ? Regulatory regions Intergenic regions Complex Genome DNA
Eukaryotic Gene Expression Transcribed Region Enhancer Promoter Terminator Transcription RNA Polymerase II Primary transcript 5’ 3’ Intron1 Exon1 Exon2, etc Cap Splice Cleave/Polyadenylate Translation 7mG An N C Transport 7mG An Polypeptide
Yeast ORFS = genes! Small ORFS (RNA genes) Regulatory Sequences
“large” Eukaryotes introns common, LONGER than exons promoter/enhancer genome sparse Fungi introns common, short relative to exons promoter/enhancer genome dense Eukaryotes, cont’d
Intron Prevalence % of genes Introns
Intron Size % of genes Introns
Exon Size % of genes Exon size (bps)
Sew together exons ORF regions consensus sequences domain/polypeptide matches Fungi
Exon/Intron Structure CCACATTgtn(30-10,000)an(5-20)agCAGAA ...CCACATTCAGAA... ...ProHisSerGlu...
Alternative Splice CCACATTgtn(30-10,000)an(5-20)agcagAA ...CCACATTAA... ...ProHisSTOP
Internal exons (donor-acceptor) Initial exons (5’-donor) Terminal exons (acceptor-3’) Single exon genes (5’-3’) Gene prediction targets
Sequence based Consensus sites Signal sequences Homology Confirm prediction is a protein Known coding sequences cDNAs, SAGE Comparative analysis Identify exons, promoter/enhancer elements Gene prediction
High bias = high confidence Low bias = low confidence Codon Bias/Nucleotide Frequency-useful?
Known Consensus Sequences Consensus Sequence Generation Functional Tests Finding Functional Sequences
Position Weight Matrices Sequence Logos Hidden Markov Models Describing consensus sequences
Splicing Consensus A64G73GTA62A68G84T63… Y80NY80Y87R75AY95…C65AGNN Vertebrate GTRNGT(N){30-1000} CTRAC(N){5-15}YAG Fungi Alternate Splicing!??
Non-repetitive DNA!! Long ORF similar to known protein ORF extended by “reasonable” splices ORF begins with “good” ATG Promoter/terminator flanks Linguistic approach to combining gene features
BLASTN DNA:DNA comparison (ALWAYS!) Not sensitive (DNA conservation low) BLASTX/TBLASTX 6 frame ORFS:polypeptide database 6 frames vs. 6 frames of a DNA database DATABASE SEARCH www.ncbi.nlm.nih.gov
Very helpful for the “known” What about the unknown??? Protein Database Matches
Basal Promoters Enhancers/Silencers/Regulatory Sites Boundary elements? Transcription Initation Transcript Initiation Prokaryotes vs Eukaryotes Organism-to-Organism
TATA-box -25 to -30 TBP CCAAT-box -212 to -57 CTF/NF1 GC-box -164 to +1 SP1 K C W K Y Y Y Y +1 to +5 cap signal GC CAAT TATA Basal Promoter Analysis Myers and Maniatis, Genes VI, 831 +1
Basal Promoter Analysis Cao and Moi, Ped Res 51:415-421 (2002)
Exon/Intron Alternate splicing Polyadenylation/Cleavage Stability mRNA processing
Metazoans AATAAA, ATTAAA 15-20 bps 5’ of polyA addition site. YGTGTTYY (diffusive GT-rich sequence) 100-700 bps 3’ UTR typical. Yeast -> different PolyA sites
Initiation site 1st AUG used 95% of the time. Translational regulatory elements translational enhancers upstream ORFs Translation
Genscan Genie GRAIL II: integrated gene parsing GenLang HMMGene (lock ESTs, etc.) GENEMARK Tools-WWW
Probabilistic Models Applicable to linear sequences P(all states)=1, infer probabilities of all states from observed (hidden states unobserved) Work best when local correlations unimportant Genefinding, phylogeny, secondary structure, genetic mapping Parameters are set using a “Training Set” of gene annotations Quantitative probabilities Hidden Markov Models
Accuracy Assessment PP=predicted coding AP=“real” positive TP=number correct positive TN=number correct negative FP=number false positive FN=number false negative Sensitivity=Sn=TP/AP Specificity=Sp=TP/PP Approximate Correlation (AC) = ((TP/(TP+FN)) + (TP/(TP+FP)) + (TN/(TN+FP)) + (TN/(TN+FN))) / 2 - 1
Accuracy Levels Bp Exon
Regulatory Sequences Known Consensus Sequences Consensus Sequence Generation Functional (Lab) Data A few examples NEXT