The Havana-Gencode annotation

The Havana-Gencode annotation GENCODE CONSORTIUM

Loci annotated in the 44 ENCODE regions

Experimental validations of the manual annotations The annotations produced by the Havana team at Sanger are being verified experimenally through RT-PCRs and RACEs (University of Geneva) Initial annotation Experimental validations Experimental validation of the single exon annotated 5'RACEs to obtain full length mRNA(s) RT-PCRs to check 360 junctions Updated annotation New set of confirmed genes Bidirectionnal RACEs to obtain full length mRNAs

Experimental validations of the manual annotations 5’RACEs to extend Known and Novel protein genes - 214 / 426 loci provided positive RACEs for at least one primer (50%) - About 10% of the successful RACEs extend the loci in 5’ (and some provide new exon junctions) (some RACE products are still being analysed)

Experimental validations of the manual annotations RT-PCRs VEGA Novel_transcript and Putative When more than one junction were submitted for the same transcript, all the junctions were in accordance in 2/3 of the cases (mostly all junctions negative). • The Novel transcript loci have a higher success rate than the Putative loci (in accordance to their definition)

Experimental validations of the manual annotations RT-PCRs on non canonical splice sites • 43 non canonical splice sites (non GT-AG or GC-AG) were detected in the 13 training ENCODE regions 32 could be tested by RT-PCR (others: too short exons for primer picking) • 1 was confirmed: it is actually a canonical U12 intron (AT-AC) • 6 provided canonical junctions (already existing in other annotated splice forms) • 25 were negative => None of the non canonical splice sites could be validated experimentally (83 other splice sites are being checked in the 31 other regions)

Gene predictions outside of Havana-Gencode annotations 6 computational gene prediction programs (geneid, genscan, SGP, twinscan, fgenesh, exonify) ; 3 EST-based methods (acembly, Ecgene, Ensembl EST) In 13 ENCODE regions, 1255 predicted introns (by one or more of the 9 methods) are not annotated in VEGA: - 380 (30%) extend VEGA objects (1) - 530 (42%) are in introns of VEGA objects (2) - 11 (1%) link exons from distinct VEGA objects (3) - 334 (27%) are completely outside of VEGA annotations (4) Havana-Gencode: Predictions: (1) (2) (3) (4)

*1: RT-PCR successful ; 2: RT-PCR povided a product with a wrong exon junction Gene predictions outside of Havana-Gencode annotations RT-PCRs on exons junctions 1255 predicted introns tested: => 16 RT-PCRs confirmed the predicted junction, 9 provided another junction. (excluding pseudogenes) => Only 3 are intergenic (new loci?) --> being extended by RACE

Gene predictions outside of Havana-Gencode annotations: 31 last regions -About 3500 introns predicted by standard prograns from UCSC tracks are outside of the Havana-Gencode annotation (about 900 intergenic). Very few of those could correspond to real positive (=> Need to prioritize) - Additionaly, the EGASP predictions add about 7000 other new introns (about 1000 intergenic)

Description of the annotations: gene density

Description of the annotations: alternative splicing Avg: 4.2 transcripts per locus 6.7 exons per transcript

Description of the annotations: coding loci 424 coding loci in 44 ENCODE regionsOn average, 44.6% of the transcripts are annotated as coding

Description of the annotations: lengths of exons, introns, cds, utrs…

Comparison between Havana-Gencode annotation and other sets ENSEMBL, REFSEQ, MGC, CCDS

Gene level => Most of the genes from the other sets are contained in Havana-Gencode annotation (less for ENSEMBL)

Transcript level => Very few full transcripts are exactly identical The coding part of the transcripts is better conserved

Relaxed criterion: allows transcripts from the other sets to be included in Havana-Gencode transcripts Havana-Gencode transcript: Transcript from other sets: Supporting the annotated transcript Not supporting the annotated transcript => Few transcripts are exactly identical but most of the transcripts from other sets are included in transcripts from Havana-Encode, especially MGC genes (transcripts not as extended as the annotation)

Transcript level: relaxed criterion => =>

Exon/intron level => More common introns than exons: could be explained by the fact that most differences are in UTRs (last exons)

Nucleotide level Conclusions • Havana-Gencode annotation is richer than the other data sets. • REFSEQ, MGC and CCDS are almost completely contained in Havana –Gencode, especially CCDS (smaller set) • ENSEMBL contains more “false positives” (bigger set) - Transcripts from the other sets are less extended than transcripts from Havana-Gencode annotations, especially MGC (very few transcripts are completely identical)

Exon pair level (exon-intron-exon)

The Havana-Gencode annotation

The Havana-Gencode annotation

Presentation Transcript

Havana

Havana, Cuba

Annotation

Annotation

The Annotation Conversation :

The Havana-Gencode annotation

Annotation

Havana, Cuba, 21.11.2003

HAPPY HAVANA HAVE…

Comparison of array detected transcription map with GENCODE/HAVANA annotations in ENCODE regions

Annotation

Welcome to Havana

Star-spangled Havana

Havana Villa

Havana

Old Havana

Cuba Havana Travel

Havana Private Tours | Havana Vintage Car Tours | Havana2000

Havana Airport Taxi

Annotation