1 / 31

O verrepresented Segment Strings (Aug/8/2011)

1. O verrepresented Segment Strings (Aug/8/2011). Bob Harris Penn State Center for Comparative Genomics and Bioinformatics. rsharris@bx.psu.edu. Overview. Analysis of segmentation sequences, incorporating longer local context Update of previous enrichment/depletion plots

kateb
Download Presentation

O verrepresented Segment Strings (Aug/8/2011)

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. 1 Overrepresented Segment Strings(Aug/8/2011) Bob Harris Penn State Center for Comparative Genomics and Bioinformatics rsharris@bx.psu.edu

  2. Overview • Analysis of segmentation sequences, incorporating longer local context • Update of previous enrichment/depletion plots • For the round8 segmentations

  3. Motivation Quick eyeball test using one-character class-encoding: A=class 0 B=class 1 … 2,13,24 is C,N,Y > segway.k562.coordinated chr10:812820-872329 AOUNDKAGAGXGRXNXNCDUXNYNUNCNCYCYCYNYCYCYCNCYNCYCNX CNCYCNXNYNYNCNCYCNDCYNDYCYCYCNCICICDNXCICIWTMJMTWI CYCYNCBDUXRNCURDXNUDUVRGVUAVAGKUVUXGAVARXRDKDVXKXA GAXDXRAXRVKPBPIQBQBQVBQBQLQHQHLQVKQVQVLTLBVUVQVKVL QVQBVLVQVOVQLQLQLQLQLQLHLVUVQLVLQLQLQVLQLQHQLVLQVL

  4. Redundancy Apparent, but… 4 • How surprising are the C,N,Y (2,13,24) groups? • Together these classes have only average probability • But 1st and 2nd order probabilities favor continuing in this group > segway.k562.coordinated chr10:812820-872329 AOUNDKAGAGXGRXNXNCDUXNYNUNCNCYCYCYNYCYCYCNCYNCYCNX CNCYCNXNYNYNCNCYCNDCYNDYCYCYCNCICICDNXCICIWTMJMTWI CYCYNCBDUXRNCURDXNUDUVRGVUAVAGKUVUXGAVARXRDKDVXKXA GAXDXRAXRVKPBPIQBQBQVBQBQLQHQHLQVKQVQVLTLBVUVQVKVL QVQBVLVQVOVQLQLQLQLQLQLHLVUVQLVLQLQLQVLQLQHQLVLQVL

  5. Overrepresented Strings 5 • String of 2N segments • Estimate expected probability with Nth order model • e.g. pr(ABCD) = pr(AB) pr(C|AB) pr(D|BC) • “Evaluate” strings with high observed:expected ratio • Comparison to “features”. In this case RNAseq contigs • Caveat(?): length of segments ignored

  6. Overrepresented Strings, Example 6 • Length-4 strings in segway.k562.coordinated • Highest obs/exp ratio, after eliminating rare observations string #obs’d #exp’d obs/exp 21-10-0-21 3761 970.80 3.874112 21-0-10-21 3561 966.65 3.683865 13-23-20-13 5227 2386.44 2.190296 13-20-23-13 5177 2371.56 2.182953 13-23-17-13 3205 1530.04 2.094711 13-17-23-13 3156 1535.76 2.055004 16-21-11-16 4833 2466.86 1.959174 14-23-17-14 3263 1711.13 1.906928 16-11-21-16 4629 2443.15 1.894687 10-6-0-10 6980 3686.84 1.893222 14-17-23-14 3180 1686.41 1.885658 10-0-6-10 6846 3632.72 1.884536 23-0-6-23 3265 1748.77 1.867023 23-6-0-23 3254 1749.80 1.859644 23-6-14-23 8780 4821.21 1.821121 23-14-6-23 8933 4927.23 1.812985 24-13-3-24 5419 3007.67 1.801727 23-0-14-23 7142 4023.34 1.775141 24-3-13-24 5270 2987.69 1.763906 23-6-10-3 3045 1734.93 1.755115 24-3-10-3 3192 1832.07 1.742287 3-10-6-23 3046 1751.86 1.738724 23-14-0-23 7000 4028.87 1.737461 3-10-3-24 3126 1809.36 1.727681 …

  7. CSHL RNAseq contigs 7 • CSHL RNAseq contigs • ftp: //genome.crg.es/pub/Encode/data_analysis/ • ForDeadZones/Contigs_IDR0.1_CSHL.tar.gz • Differentiated by cell line (14), compartment (6), RNA fraction (4) • and attributed to 11 biotypes (gencode v7 exons) • non coding, protein coding, etc. • and a 12th type — empty, or “no exon” • From Sarah Djebali, Felix Schlesinger, Wei Lin

  8. Measuring Enrichment 8 • Vf,s = enrichment of string s for feature f • {s} = set of bases covered by string s (in either direction) • {f} = set of bases covering the feature • {fs} = intersection of {f} and {s} • {F} = union of {f’} for all features f’ • # = size of set • I plot log2(Vf,s ), fold enrichment • Or, if negative, fold depletion

  9. Single-segment Enrichment 9 segway.k562.coordinated vs CSHL RNAseq contigs white = no occurrences

  10. Length-4 Strings Enrichment 10 segway.k562.coordinated vs CSHL RNAseq contigs (highest observed/expected strings) white = no occurrences

  11. Length-4 Strings Enrichment 11 segway.k562.coordinated vs CSHL RNAseq contigs (highest observed/expected strings)

  12. To Do • Incorporate single-segment enrichment into evaluation of multi-segment strings • Longer strings • Run on all 14 round 8 segmentations • And the bake-off composites

  13. Aligning Class Sequences • Work in progress, with these questions… • Do longer, highly similar sequences indicate similar function? • segway.k562.coordinated chr10:88422790-88427017 CYCNCYNCNYNCNCNCNCN • segway.k562.coordinated chr13:113696011-113701344 CYCNCYNCNYNCNCNCNCN • Or do small changes indicate functional differences? • segway.k562.coordinated chr10:133868081-133875219 NCNXnXNXNXNCYNCNCNCNXNCN • segway.k562.coordinated chr13:113638232-113645027- NCNXoXNXNXNCYNCNCNCNXNCN

  14. Aligning Class Sequences • Do longer, highly similar sequences indicate similar function?

  15. Aligning Class Sequences • Or do small changes indicate functional differences?

  16. Alignments • Confounded by presence of 2- and 3-segment cycles • Implement separate search for short repeated cycles • Then align with those masked • Should incorporate segment lengths • May be better to align in peak space

  17. Appendix • The following slides show single-segment enrichment heatmaps for all 14 round 8 segmentations

  18. Single-segment Enrichment 18 segway.gm12878.coordinated vs CSHL RNAseq contigs white = no occurrences

  19. Single-segment Enrichment 19 segway.h1hesc.coordinated vs CSHL RNAseq contigs white = no occurrences

  20. Single-segment Enrichment 20 segway.helas3.coordinated vs CSHL RNAseq contigs white = no occurrences

  21. Single-segment Enrichment 21 segway.hepg2.coordinated vs CSHL RNAseq contigs white = no occurrences

  22. Single-segment Enrichment 22 segway.huvec.coordinated vs CSHL RNAseq contigs white = no occurrences

  23. Single-segment Enrichment 23 segway.k562.all vs CSHL RNAseq contigs white = no occurrences

  24. Single-segment Enrichment 24 segway.k562.coordinated vs CSHL RNAseq contigs white = no occurrences

  25. Single-segment Enrichment 25 segway.tier1-2.coordinated vs CSHL RNAseq contigs white = no occurrences

  26. Single-segment Enrichment 26 chromhmm.GM12878_concatenate_25 vs CSHL RNAseq contigs white = no occurrences

  27. Single-segment Enrichment 27 chromhmm.H1_concatenate_25 vs CSHL RNAseq contigs white = no occurrences

  28. Single-segment Enrichment 28 chromhmm.HELA_concatenate_25 vs CSHL RNAseq contigs white = no occurrences

  29. Single-segment Enrichment 29 chromhmm.HEPG2_concatenate_25 vs CSHL RNAseq contigs white = no occurrences

  30. Single-segment Enrichment 30 chromhmm.HUVEC_concatenate_25 vs CSHL RNAseq contigs white = no occurrences

  31. Single-segment Enrichment 31 chromhmm.K562_concatenate_25 vs CSHL RNAseq contigs white = no occurrences

More Related