1 / 22

Comparison of array detected transcription map with GENCODE/HAVANA annotations in ENCODE regions

Comparison of array detected transcription map with GENCODE/HAVANA annotations in ENCODE regions. Acknowledgements. AFFX Transcriptome Group Computation M olecular Biology S. Bekiranov P. Kapranov S. Brubaker I. Bell J. Cheng J. Drenkow

nansen
Download Presentation

Comparison of array detected transcription map with GENCODE/HAVANA annotations in ENCODE regions

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Comparison of array detected transcription map with GENCODE/HAVANA annotations in ENCODE regions

  2. Acknowledgements AFFX Transcriptome Group Computation Molecular Biology S. BekiranovP. Kapranov S. BrubakerI. Bell J. Cheng J. Drenkow S. Ghosh D. Kampa-Bailey G. Helt J. Long G. Madhavan J. Manak S. Patel V. Sementchenko H. Tammana A. Piccolboni Harvard Medical School NCI K. Struhl H. Hirsch H. H. Ng E. Sekinger Broad Institute B. Bernstein M. Kamal K. Lindblad-Toh D. J. Huebert S. McMahon E. K. Karlsson E. J. Kulbokas III S. L. Schreiber E. S. Lander Support: NCI Contract (21XS019C Phases I- III) 2001-2006 NHGRI ENCODE Grant AFFYMETRIX

  3. Compute median (M) of all chip medians (if multiple arrays in a set) Median Scaling Quantile Normalization Probe Mapping to Genome Wilcoxon Signed Rank Test RNA or IP CEL file CEL file CEL file RNA: Transfrag Generation Chromation IP: Site Generation Transcription Map & Modification Site Generation…I • Median Scaling: Scale all features on chip such that chip median = M • Quantile Normalization(QN):QN Feature intensities within replicates only.QN Treatment and Control separately. • Probe Mapping to Genome:Map PM,MM pairs to genome via exact 25-mer alignment of PM. • Wilcoxon Signed Rank Test: • Perform test on probe-pair signal S = log2(PM-MM) • Apply a sliding window to estimate intensity of each probe pair as a pseudo-median of all probes in the window. • A Sliding window makes use of neighboring probes; this reduces false positive rate and increases sensitivity. • Window size varies w/ experiment: RNA~50bp, IP~250bp • Map and Site Generation: • RNA • Join probes w/ intensity > 5%FPR & maxgap, minrun to generate transcribed fragments • Chromatin IP • Generate Hodges Lehman Estimator to estimate expression level :logDiff = log2(min(PM-MM)T,1) – log2(PM-MM)C,1) • Generate p-Value estimate per probe • Join probes w/ p-value  10-5 & maxgap, minrun to generate modification/transcription factor binding sites

  4. Filtration of 10 Chromosome Data • (Cheng, J., et al. Science Express; March 24, 2005) • ( see UCSD Browser for 8 cell line data see Version 33) • Low Complexity Repeats • Processed Pseudogenes • BLAT hits more than itself • (lose some members of gene families) • Use of all filters this reduces the transfrag • by ~20% of transfrags, ~30% of which are • pseudogenes. With BLAT data reduction is • 14%

  5. RACE Model (Need isothermal RT for unannotated transfrags)

  6. RACE Analysis of Coding Gene DeGeorge Critical Region 14 gene

  7. Un-annotated transfrags of PISD are part of at least 9 different, yet overlapping sense-antisense transcripts Sense Strand Anti-sense strand

  8. RACE Regions Validated for 768 Loci

  9. Data sets analyzed • Part 1 : a) Analysis done on v34 of the human genome. Total number of Encode regions analyzed = 12 ( region Enm006 ignored for this analysis since no annotations are available for v34). b) Set of Known/validated exons c) Set of predicted exons (from multiple gene predictions) d) Array detected transcript maps from HL-60 cell lines at 4 time points after RA stimulation. (i.e one cell line at 4 biological states) • Part 2 : a) Analysis done on v35 of the human genome. Total number of Encode regions analyzed = 44 b) Set of Known/validated exons. c) Set of Vega putative exons. d) Set of predicted exons outside sets b & c (from multiple gene predictions). d) Array detected transcript maps from HL-60 cell lines at 4 time points after RA stimulation.

  10. Repeats (RepeatMasker) Coverage of interogated Regions using algorithms used To call Transfrags Probes 35 bp avg. distance Genomic sequence Annotation (e.g. Vega) Exon 2 is 100% Covered Exon 1 < 100% Covered Predicted exons Analyses done only within interrogated regions How Comparisons are carried out using arrays, Annotations and predicted regions

  11. Probes Genomic sequence Positive probes X Transfrags after minrun/maxgap parameters Annotation Exon 2 Predicted exons

  12. Coverage of Annotation by array detected transfrags from HL60 cell line in 13 ENCODE regions

  13. Analysis results of 12/13 ENCODE Regions

  14. Mode size of annotated exons is ~120bp • Detection of exons is not dependent upon size (bp) of the exon (i.e. • small exons are not biased against) • If an exon is detected by transfrag, 65% of these are covered at >75%

  15. Mode size of predicted exons is ~120bp • Approximately 30.5 % of predicted exons are covered (i.e. at least 1bp coverage) • by transfrags. • If an exon is detected by transfrag, 48.6% of these are covered at >75%

  16. Coverage of Annotation by array Detected transfrags from HL60 cell line in all 44 Encode regions

  17. Analysis results of 44 ENCODE regions

  18. Mode size of annotated exons is ~120bp • Detection of exons is not dependent upon size (bp) of the exon (i.e. • small exons are not biased against) • If an exon is detected by transfrag, 61.4% of these are covered at >75%

  19. Mode size of predicted exons is ~80bp • Approximately, 18.2% of predicted exons are detected by transfrags ( ie. by at least 1 bp) • If an exon is detected by transfrag, 44.6% of these are covered at >75%

  20. Important Caveats To Recall In Pondering the Prediction vs Array Results • Only one cell line used in this evaluation. • We have set very conservative thresholds for transfrag prediction. Other thresholds can be used • Strand information not deducible from transfrag map. TUFs (transcripts of unknown function) are collection of transfrags shown to be on the same molecule by RACE-RT/PCR-cloning/sequencing. • Array interrogation resolution is 20bp on average • for non-repeat portion of the genome and probes are 25mers. Thus, the boundaries of transfrags are not as precise as arrays with 5bp interrogation resolution and some small exons will not not be interrogated or detected • Have not included other functional features (e.g.TF binding) • which would provide additional confidence to transfrag data. These will be • added under ENCODE project.

  21. Conclusions • Array based method detects ~53.9% of known/validated exons. • Similarly, array based method provides evidence for ~18.2% of predicted exons. These detected exons should be analyzed further to improve the annotation. • A combination of array based RNA map generation, followed by RACE experiments can significantly improve the rate of validation of gene predictions. • Transfrags that map outside validated and predicted exons can be used to improve gene prediction programs and can form the basis for further experiments.

More Related