1 / 37

Genome Annotation

Genome Annotation. BBSI July 14, 2005 Rita Shiang. Genome Annotation. Identification of important components in genomic DNA. What is a Gene?. Fundamental unit of heredity

yosefu
Download Presentation

Genome Annotation

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Genome Annotation BBSI July 14, 2005 Rita Shiang

  2. Genome Annotation • Identification of important components in genomic DNA

  3. What is a Gene? • Fundamental unit of heredity • DNA involved in producing a polypeptide; it includes regions preceding and following the coding region (leader and trailer) as well as intervening sequences (introns) • Entire DNA sequence including exons, introns, and noncoding transcription-control regions

  4. What Components are Important in Protein Coding Genes? • Sequences that initiate transcription • Sequences that process hnRNA to mRNA • Signals important in translation

  5. TATA Box Lodishet al, Molecular Cell Biology, 2000, Fig. 10.30.

  6. Other Promoters • Initiator consensus • 5’Py Py A(+1) N T/A Py Py Py • N = A, T, G or C • Py = pyrimidine = C or T • GC rich sequences • Stretch of 20-50 GC nucleotides ~100 bp upstream of start site (CpG not common in genome) • Housekeeping genes • Multiple initiation sites

  7. Polyadenylation & Cleavage • Addition of a string of As to mRNAs • Polyadenylation signal AAUAAA found before cleavage site • GU or UU rich region ~50 bp from the cleavage site • Stabilizes mRNA transcripts Lodishet al, Molecular Cell Biology, 2000, Fig. 11.23.

  8. Splicing Electron micrograph of adenovirus DNA and hexon gene mRNA Lodishet al, Molecular Cell Biology, 2000, Fig. 11,13.

  9. Splice Reaction Lodishet al, Molecular Cell Biology, 2000, Fig. 11.15.

  10. Splice Sites Lodishet al, Molecular Cell Biology, 2000, Fig. 11,14.

  11. Additional Splice Sites Consensus Py7NCAG-G(exon)AG – GUAAGU 98.12% Nonconsensus GC U12 introns ACPuUAUCCUPy 0.76% Other rare sequences 1% Py = C or U Pu = A or G

  12. Translation Signals • 5’ Cap structure directs ribosomal binding • AUG codes for methionine. The first AUG in a transcript is where translation starts • Open reading frame (ORF) • Stretch of sequence that codes for amino acids before a stop codon • Translation stop codons UAG, UAA, UGA

  13. Capping of 5’RNA with 7’-methylguanylate (m7G) Lodish et al, Molecular Cell Biology, 2000, Fig. 11.8.

  14. Known Gene Components Lodishet al, Molecular Cell Biology, 2000, Fig. 10.34.

  15. Genome Annotation • What is in a genome besides protein coding genes?

  16. Repetitive DNA makes up at least 50% of the genome • Transposon-derived interspersed repeats • Inactive retroposed copies of genes –pseudogenes • Simple short repeats • Segmental Duplications • Blocks of tandemly repeated sequences • Centromeres • Telomeres • Short arm of acrocentric chromosomes • Ribosomal gene clusters

  17. Non-protein coding genes or non-coding RNA (ncRNA) • tRNA genes • rRNA genes • snRNA genes • Splicing • Telomere maintenance • snoRNA genes • Other • microRNA

  18. Annotation of Genomic DNA • Identifying Protein Coding Genes • Placing the genes on the genome (where are they?)

  19. How Many Genes in the Genome? • Early on based on reassociation kinetics the estimate was ~40,000 • Walter Gilbert estimated ~100,000 based on gene and genome size • 70,000 – 80,000 based on an extrapolated number of CpG islands • With the Human sequence the estimate is 30,000 – 40,000

  20. Annotation of Genomic DNA Specifically for Genes that Code for Proteins • Match genomic DNA to genes that have been previously cloned and sequenced looking for sequence similarity using BLAST programs • Predict genes using computer programs to scan genomic DNA using known elements • Many strategies use a combination of both methods

  21. cDNA Library Construction Lodishet al, Molecular Cell Biology, 2000, Fig. 7.14

  22. Lodishet al, Molecular Cell Biology, 2000, Fig. 7.15

  23. Gene AnnotationCelera • Constructed gene models using sequence from cDNAs • Used Unigene database • Partitions GenBank sequences (mRNAs & ESTs) into non-redundant set using 3’ UTRs • 111,064 Unigene clusters for human

  24. Gene AnnotationCelera cont. • Predicts gene boundaries by identifying overlapping sets of EST and protein matches • Known full-length genes were annotated on the map (matched w/50% of the length & >92% identity) • Clusters that did not match a full-length gene were evaluated using other references • Conservation of genomic sequence between mouse & human • Similarity between human & rodent transcripts • Similarity to known proteins

  25. Validation • Validated by construction of known genes (RefSeq) • 6.1% of RefSeq genes were not annotated by Otto

  26. Gene Annotation - Human Genome Sequencing Consortium • Start with Ensemble predicted genes • ab initio predictions using Genscan • Based on probabilistic model of genome sequence composition and gene structure • Confirm similarity to mRNAs, ESTs, protein motifs from all organisms • Extend protein matches using GeneWise • Compares protein based information to genomic sequence and allows for frameshifts and large introns • Produces partial gene predictions

  27. Consortium cont. • Merge Ensemble gene predictions w/ Genie predictions • Genie identifies matches of mRNAs and ESTs • Employs hidden Markov models (HMMs) to extend matches using ab initio statistical methods • Links information from 5’ and 3’ ESTs from the same cDNA clone to complete a sequence from the ATG to the stop codon • Can generate alternatively spliced products (though only longest used in this build) • Merge results with genes in RefSeq, SWISSPROT and TrEMBL databases

  28. Validation • Validate method by comparing to a new set of known genes, a set of mouse cDNAs and genes on Chromosome 22 (Finished Sequence) • 85% Sensitivity • 13% spurious predictions

  29. Factors Affecting Gene Annotation • Splice sites do not conform to consensus • Noncoding exons are common • Exon – what is left over after splicing after introns are removed and does not refer to a stretch of coding information • tRNAs are spliced but noncoding • >35% of human genes have noncoding exons • No statistical bias so they are difficult to identify

  30. Factors Affecting Gene Annotation Cont. • Internal exons can be very small • Avg. size of internal exons are ~130 bp • ~65% of vertebrate exons are 68-208 bp • >10% are <60 bp • Exons < 10 bp have been identified • Invected gene in Drosophila • One of four exons is 6 bp (GTCGAA) • Flanked by introns of 27.6 and 1.1 kb • Not correctly recognized by cDNA alignment software and creates a frameshift in the gene • Exons of size 0 • Resizing exons create an intermediate splice product

  31. Places to View Annotated Genomes • National Center for Biotechnology Information (NCBI) • Ensemble • The Golden Path (UCSC Genome Browser) • Celera

  32. Verification of Annotation in C. elegans by Experimentation • Complete genomic sequence • Small introns • Small intergenic regions

  33. Results • 11,984 cDNAs successfully cloned out of a prediction of 19,477 • 4,365 were not represented by cDNAs or ESTs • Failure of cloning could be due to: • Wrongly predicted exons • Very low expressing genes • Not a real gene

  34. Verification of intron/exon structures

  35. Comparison of a Single Transcript

  36. Greater than 50% of intron/exon structures need correcting?

More Related