310 likes | 413 Views
1. O verrepresented Segment Strings (Aug/8/2011). Bob Harris Penn State Center for Comparative Genomics and Bioinformatics. rsharris@bx.psu.edu. Overview. Analysis of segmentation sequences, incorporating longer local context Update of previous enrichment/depletion plots
E N D
1 Overrepresented Segment Strings(Aug/8/2011) Bob Harris Penn State Center for Comparative Genomics and Bioinformatics rsharris@bx.psu.edu
Overview • Analysis of segmentation sequences, incorporating longer local context • Update of previous enrichment/depletion plots • For the round8 segmentations
Motivation Quick eyeball test using one-character class-encoding: A=class 0 B=class 1 … 2,13,24 is C,N,Y > segway.k562.coordinated chr10:812820-872329 AOUNDKAGAGXGRXNXNCDUXNYNUNCNCYCYCYNYCYCYCNCYNCYCNX CNCYCNXNYNYNCNCYCNDCYNDYCYCYCNCICICDNXCICIWTMJMTWI CYCYNCBDUXRNCURDXNUDUVRGVUAVAGKUVUXGAVARXRDKDVXKXA GAXDXRAXRVKPBPIQBQBQVBQBQLQHQHLQVKQVQVLTLBVUVQVKVL QVQBVLVQVOVQLQLQLQLQLQLHLVUVQLVLQLQLQVLQLQHQLVLQVL
Redundancy Apparent, but… 4 • How surprising are the C,N,Y (2,13,24) groups? • Together these classes have only average probability • But 1st and 2nd order probabilities favor continuing in this group > segway.k562.coordinated chr10:812820-872329 AOUNDKAGAGXGRXNXNCDUXNYNUNCNCYCYCYNYCYCYCNCYNCYCNX CNCYCNXNYNYNCNCYCNDCYNDYCYCYCNCICICDNXCICIWTMJMTWI CYCYNCBDUXRNCURDXNUDUVRGVUAVAGKUVUXGAVARXRDKDVXKXA GAXDXRAXRVKPBPIQBQBQVBQBQLQHQHLQVKQVQVLTLBVUVQVKVL QVQBVLVQVOVQLQLQLQLQLQLHLVUVQLVLQLQLQVLQLQHQLVLQVL
Overrepresented Strings 5 • String of 2N segments • Estimate expected probability with Nth order model • e.g. pr(ABCD) = pr(AB) pr(C|AB) pr(D|BC) • “Evaluate” strings with high observed:expected ratio • Comparison to “features”. In this case RNAseq contigs • Caveat(?): length of segments ignored
Overrepresented Strings, Example 6 • Length-4 strings in segway.k562.coordinated • Highest obs/exp ratio, after eliminating rare observations string #obs’d #exp’d obs/exp 21-10-0-21 3761 970.80 3.874112 21-0-10-21 3561 966.65 3.683865 13-23-20-13 5227 2386.44 2.190296 13-20-23-13 5177 2371.56 2.182953 13-23-17-13 3205 1530.04 2.094711 13-17-23-13 3156 1535.76 2.055004 16-21-11-16 4833 2466.86 1.959174 14-23-17-14 3263 1711.13 1.906928 16-11-21-16 4629 2443.15 1.894687 10-6-0-10 6980 3686.84 1.893222 14-17-23-14 3180 1686.41 1.885658 10-0-6-10 6846 3632.72 1.884536 23-0-6-23 3265 1748.77 1.867023 23-6-0-23 3254 1749.80 1.859644 23-6-14-23 8780 4821.21 1.821121 23-14-6-23 8933 4927.23 1.812985 24-13-3-24 5419 3007.67 1.801727 23-0-14-23 7142 4023.34 1.775141 24-3-13-24 5270 2987.69 1.763906 23-6-10-3 3045 1734.93 1.755115 24-3-10-3 3192 1832.07 1.742287 3-10-6-23 3046 1751.86 1.738724 23-14-0-23 7000 4028.87 1.737461 3-10-3-24 3126 1809.36 1.727681 …
CSHL RNAseq contigs 7 • CSHL RNAseq contigs • ftp: //genome.crg.es/pub/Encode/data_analysis/ • ForDeadZones/Contigs_IDR0.1_CSHL.tar.gz • Differentiated by cell line (14), compartment (6), RNA fraction (4) • and attributed to 11 biotypes (gencode v7 exons) • non coding, protein coding, etc. • and a 12th type — empty, or “no exon” • From Sarah Djebali, Felix Schlesinger, Wei Lin
Measuring Enrichment 8 • Vf,s = enrichment of string s for feature f • {s} = set of bases covered by string s (in either direction) • {f} = set of bases covering the feature • {fs} = intersection of {f} and {s} • {F} = union of {f’} for all features f’ • # = size of set • I plot log2(Vf,s ), fold enrichment • Or, if negative, fold depletion
Single-segment Enrichment 9 segway.k562.coordinated vs CSHL RNAseq contigs white = no occurrences
Length-4 Strings Enrichment 10 segway.k562.coordinated vs CSHL RNAseq contigs (highest observed/expected strings) white = no occurrences
Length-4 Strings Enrichment 11 segway.k562.coordinated vs CSHL RNAseq contigs (highest observed/expected strings)
To Do • Incorporate single-segment enrichment into evaluation of multi-segment strings • Longer strings • Run on all 14 round 8 segmentations • And the bake-off composites
Aligning Class Sequences • Work in progress, with these questions… • Do longer, highly similar sequences indicate similar function? • segway.k562.coordinated chr10:88422790-88427017 CYCNCYNCNYNCNCNCNCN • segway.k562.coordinated chr13:113696011-113701344 CYCNCYNCNYNCNCNCNCN • Or do small changes indicate functional differences? • segway.k562.coordinated chr10:133868081-133875219 NCNXnXNXNXNCYNCNCNCNXNCN • segway.k562.coordinated chr13:113638232-113645027- NCNXoXNXNXNCYNCNCNCNXNCN
Aligning Class Sequences • Do longer, highly similar sequences indicate similar function?
Aligning Class Sequences • Or do small changes indicate functional differences?
Alignments • Confounded by presence of 2- and 3-segment cycles • Implement separate search for short repeated cycles • Then align with those masked • Should incorporate segment lengths • May be better to align in peak space
Appendix • The following slides show single-segment enrichment heatmaps for all 14 round 8 segmentations
Single-segment Enrichment 18 segway.gm12878.coordinated vs CSHL RNAseq contigs white = no occurrences
Single-segment Enrichment 19 segway.h1hesc.coordinated vs CSHL RNAseq contigs white = no occurrences
Single-segment Enrichment 20 segway.helas3.coordinated vs CSHL RNAseq contigs white = no occurrences
Single-segment Enrichment 21 segway.hepg2.coordinated vs CSHL RNAseq contigs white = no occurrences
Single-segment Enrichment 22 segway.huvec.coordinated vs CSHL RNAseq contigs white = no occurrences
Single-segment Enrichment 23 segway.k562.all vs CSHL RNAseq contigs white = no occurrences
Single-segment Enrichment 24 segway.k562.coordinated vs CSHL RNAseq contigs white = no occurrences
Single-segment Enrichment 25 segway.tier1-2.coordinated vs CSHL RNAseq contigs white = no occurrences
Single-segment Enrichment 26 chromhmm.GM12878_concatenate_25 vs CSHL RNAseq contigs white = no occurrences
Single-segment Enrichment 27 chromhmm.H1_concatenate_25 vs CSHL RNAseq contigs white = no occurrences
Single-segment Enrichment 28 chromhmm.HELA_concatenate_25 vs CSHL RNAseq contigs white = no occurrences
Single-segment Enrichment 29 chromhmm.HEPG2_concatenate_25 vs CSHL RNAseq contigs white = no occurrences
Single-segment Enrichment 30 chromhmm.HUVEC_concatenate_25 vs CSHL RNAseq contigs white = no occurrences
Single-segment Enrichment 31 chromhmm.K562_concatenate_25 vs CSHL RNAseq contigs white = no occurrences