320 likes | 353 Views
Gene Structure and Identification. Eukaryotic Genes and Genomes Gene Finding. Previous reading: 1.3, 9.1-9.6 Reading: 10.2, 10.4, 10.6-8. BIO520 Bioinformatics Jim Lund. ~10% highly repetitive (300 Mbp) NOT GENES ~25% moderate repetitive (750 Mbp) Some genes
E N D
Gene Structure and Identification Eukaryotic Genes and Genomes Gene Finding Previous reading: 1.3, 9.1-9.6 Reading: 10.2, 10.4, 10.6-8 BIO520 Bioinformatics Jim Lund
~10% highly repetitive (300 Mbp) NOT GENES ~25% moderate repetitive (750 Mbp) Somegenes ~10% exons and introns (355 Mbp) 45% = ? Regulatory regions Intergenic regions Complex Genome DNA
Eukaryotic Gene Expression Transcribed Region Enhancer Promoter Terminator Transcription RNA Polymerase II Primary transcript 5’ 3’ Intron1 Exon1 Exon2, etc Cap Splice Cleave/Polyadenylate Translation 7mG An N C Transport 7mG An Polypeptide
Yeast ORFS = genes! Small ORFS (RNA genes) Regulatory Sequences
“large” Eukaryotes introns common, LONGER than exons promoter/enhancer genome sparse Fungi introns common, short relative to exons promoter/enhancer genome dense Eukaryotes, cont’d
Intron Prevalence % of genes Introns
Intron Size % of genes Introns
Exon Size % of genes Exon size (bps)
Sew together exons ORF regions consensus sequences domain/polypeptide matches Fungi
Exon/Intron Structure CCACATTgtn(30-10,000)an(5-20)agCAGAA ...CCACATTCAGAA... ...ProHisSerGlu...
Alternative Splice CCACATTgtn(30-10,000)an(5-20)agcagAA ...CCACATTAA... ...ProHisSTOP
Internal exons (donor-acceptor) Initial exons (5’-donor) Terminal exons (acceptor-3’) Single exon genes (5’-3’) Gene prediction targets
Sequence based Consensus sites Signal sequences Homology Confirm prediction is a protein Known coding sequences cDNAs, SAGE Comparative analysis Identify exons, promoter/enhancer elements Gene prediction
High bias = high confidence Low bias = low confidence Codon Bias/Nucleotide Frequency-useful?
Known Consensus Sequences Consensus Sequence Generation Functional Tests Finding Functional Sequences
Position Weight Matrices Sequence Logos Hidden Markov Models Describing consensus sequences
Splicing Consensus A64G73GTA62A68G84T63… Y80NY80Y87R75AY95…C65AGNN Vertebrate GTRNGT(N){30-1000} CTRAC(N){5-15}YAG Fungi Alternate Splicing!??
Non-repetitive DNA!! Long ORF similar to known protein ORF extended by “reasonable” splices ORF begins with “good” ATG Promoter/terminator flanks Linguistic approach to combining gene features
BLASTN DNA:DNA comparison (ALWAYS!) Not sensitive (DNA conservation low) BLASTX/TBLASTX 6 frame ORFS:polypeptide database 6 frames vs. 6 frames of a DNA database DATABASE SEARCH www.ncbi.nlm.nih.gov
Very helpful for the “known” What about the unknown??? Protein Database Matches
Basal Promoters Enhancers/Silencers/Regulatory Sites Boundary elements? Transcription Initation Transcript Initiation Prokaryotes vs Eukaryotes Organism-to-Organism
TATA-box -25 to -30 TBP CCAAT-box -212 to -57 CTF/NF1 GC-box -164 to +1 SP1 K C W K Y Y Y Y +1 to +5 cap signal GC CAAT TATA Basal Promoter Analysis Myers and Maniatis, Genes VI, 831 +1
Basal Promoter Analysis Cao and Moi, Ped Res 51:415-421 (2002)
Exon/Intron Alternate splicing Polyadenylation/Cleavage Stability mRNA processing
Metazoans AATAAA, ATTAAA 15-20 bps 5’ of polyA addition site. YGTGTTYY (diffusive GT-rich sequence) 100-700 bps 3’ UTR typical. Yeast -> different PolyA sites
Initiation site 1st AUG used 95% of the time. Translational regulatory elements translational enhancers upstream ORFs Translation
Genscan Genie GRAIL II: integrated gene parsing GenLang HMMGene (lock ESTs, etc.) GENEMARK Tools-WWW
Probabilistic Models Applicable to linear sequences P(all states)=1, infer probabilities of all states from observed (hidden states unobserved) Work best when local correlations unimportant Genefinding, phylogeny, secondary structure, genetic mapping Parameters are set using a “Training Set” of gene annotations Quantitative probabilities Hidden Markov Models
Accuracy Assessment PP=predicted coding AP=“real” positive TP=number correct positive TN=number correct negative FP=number false positive FN=number false negative Sensitivity=Sn=TP/AP Specificity=Sp=TP/PP Approximate Correlation (AC) = ((TP/(TP+FN)) + (TP/(TP+FP)) + (TN/(TN+FP)) + (TN/(TN+FN))) / 2 - 1
Accuracy Levels Bp Exon
Regulatory Sequences Known Consensus Sequences Consensus Sequence Generation Functional (Lab) Data A few examples NEXT