1 / 32

Gene Structure and Identification

Gene Structure and Identification. Eukaryotic Genes and Genomes Gene Finding. Previous reading: 1.3, 9.1-9.6 Reading: 10.2, 10.4, 10.6-8. BIO520 Bioinformatics Jim Lund. ~10% highly repetitive (300 Mbp) NOT GENES ~25% moderate repetitive (750 Mbp) Some genes

Download Presentation

Gene Structure and Identification

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Gene Structure and Identification Eukaryotic Genes and Genomes Gene Finding Previous reading: 1.3, 9.1-9.6 Reading: 10.2, 10.4, 10.6-8 BIO520 Bioinformatics Jim Lund

  2. ~10% highly repetitive (300 Mbp) NOT GENES ~25% moderate repetitive (750 Mbp) Somegenes ~10% exons and introns (355 Mbp) 45% = ? Regulatory regions Intergenic regions Complex Genome DNA

  3. Eukaryotic Gene Expression Transcribed Region Enhancer Promoter Terminator Transcription RNA Polymerase II Primary transcript 5’ 3’ Intron1 Exon1 Exon2, etc Cap Splice Cleave/Polyadenylate Translation 7mG An N C Transport 7mG An Polypeptide

  4. Yeast ORFS = genes! Small ORFS (RNA genes) Regulatory Sequences

  5. “large” Eukaryotes introns common, LONGER than exons promoter/enhancer genome sparse Fungi introns common, short relative to exons promoter/enhancer genome dense Eukaryotes, cont’d

  6. Intron Prevalence % of genes Introns

  7. Intron Size % of genes Introns

  8. Exon Size % of genes Exon size (bps)

  9. Sew together exons ORF regions consensus sequences domain/polypeptide matches Fungi

  10. Exon/Intron Structure CCACATTgtn(30-10,000)an(5-20)agCAGAA ...CCACATTCAGAA... ...ProHisSerGlu...

  11. Alternative Splice CCACATTgtn(30-10,000)an(5-20)agcagAA ...CCACATTAA... ...ProHisSTOP

  12. Internal exons (donor-acceptor) Initial exons (5’-donor) Terminal exons (acceptor-3’) Single exon genes (5’-3’) Gene prediction targets

  13. Sequence based Consensus sites Signal sequences Homology Confirm prediction is a protein Known coding sequences cDNAs, SAGE Comparative analysis Identify exons, promoter/enhancer elements Gene prediction

  14. High bias = high confidence Low bias = low confidence Codon Bias/Nucleotide Frequency-useful?

  15. Known Consensus Sequences Consensus Sequence Generation Functional Tests Finding Functional Sequences

  16. Position Weight Matrices Sequence Logos Hidden Markov Models Describing consensus sequences

  17. Translation Initiation Sites

  18. Splicing Consensus A64G73GTA62A68G84T63… Y80NY80Y87R75AY95…C65AGNN Vertebrate GTRNGT(N){30-1000} CTRAC(N){5-15}YAG Fungi Alternate Splicing!??

  19. Non-repetitive DNA!! Long ORF similar to known protein ORF extended by “reasonable” splices ORF begins with “good” ATG Promoter/terminator flanks Linguistic approach to combining gene features

  20. BLASTN DNA:DNA comparison (ALWAYS!) Not sensitive (DNA conservation low) BLASTX/TBLASTX 6 frame ORFS:polypeptide database 6 frames vs. 6 frames of a DNA database DATABASE SEARCH www.ncbi.nlm.nih.gov

  21. Very helpful for the “known” What about the unknown??? Protein Database Matches

  22. Basal Promoters Enhancers/Silencers/Regulatory Sites Boundary elements? Transcription Initation Transcript Initiation Prokaryotes vs Eukaryotes Organism-to-Organism

  23. TATA-box -25 to -30 TBP CCAAT-box -212 to -57 CTF/NF1 GC-box -164 to +1 SP1 K C W K Y Y Y Y +1 to +5 cap signal GC CAAT TATA Basal Promoter Analysis Myers and Maniatis, Genes VI, 831 +1

  24. Basal Promoter Analysis Cao and Moi, Ped Res 51:415-421 (2002)

  25. Exon/Intron Alternate splicing Polyadenylation/Cleavage Stability mRNA processing

  26. Metazoans AATAAA, ATTAAA 15-20 bps 5’ of polyA addition site. YGTGTTYY (diffusive GT-rich sequence) 100-700 bps 3’ UTR typical. Yeast -> different PolyA sites

  27. Initiation site 1st AUG used 95% of the time. Translational regulatory elements translational enhancers upstream ORFs Translation

  28. Genscan Genie GRAIL II: integrated gene parsing GenLang HMMGene (lock ESTs, etc.) GENEMARK Tools-WWW

  29. Probabilistic Models Applicable to linear sequences P(all states)=1, infer probabilities of all states from observed (hidden states unobserved) Work best when local correlations unimportant Genefinding, phylogeny, secondary structure, genetic mapping Parameters are set using a “Training Set” of gene annotations Quantitative probabilities Hidden Markov Models

  30. Accuracy Assessment PP=predicted coding AP=“real” positive TP=number correct positive TN=number correct negative FP=number false positive FN=number false negative Sensitivity=Sn=TP/AP Specificity=Sp=TP/PP Approximate Correlation (AC) = ((TP/(TP+FN)) + (TP/(TP+FP)) + (TN/(TN+FP)) + (TN/(TN+FN))) / 2 - 1

  31. Accuracy Levels Bp Exon

  32. Regulatory Sequences Known Consensus Sequences Consensus Sequence Generation Functional (Lab) Data A few examples NEXT

More Related