1 / 84

Pinball: comparative genomics on regulatory regions

Albert Vilella Vertebrate Genomics EMBL-EBI Cambridge, UK. Pinball: comparative genomics on regulatory regions. EBI is an Outstation of the European Molecular Biology Laboratory. Assembling Chip-seq data. Aims: reconstruct chip-seq regions for species with no reference genome

eudora
Download Presentation

Pinball: comparative genomics on regulatory regions

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Albert Vilella Vertebrate Genomics EMBL-EBI Cambridge, UK Pinball: comparative genomics on regulatory regions EBI is an Outstation of the European Molecular Biology Laboratory.

  2. Assembling Chip-seq data • Aims: • reconstruct chip-seq regions for species with no reference genome • mapping regions to closest genomes

  3. Achieved so far

  4. Read clustering • Idea: divide reads into overlapping clusters • Groups of highly-overlapping reads at summit • SGA read-read overlap • finds clusters of reads: • Even clean TF chip-seq, only ~20% of sequenced reads are in peaks (CEBPA, CTCF, etc) • Reconstruct (1) many (2) small + isolated regions

  5. SGA vs kmer-based • SGA can handle error-rates in overlaps: • Uses BWT + FM indexing of the reads • Handles sequence errors (miscalls) • Handles SNPs and Fixed SNVs: • SNPs/fSNVs quickly break your kmers in a sliding window of size K: • Keeping kmers is important for genome assembly • SGA does not handle indels: • or PacBio/454/Ion Torrent indel miscalls

  6. SGA index, SGA filter, SGA merge • Don't bother keeping track of error-rate • Index files, use them for different purposes • SGA merge chip/sono/faire/RNA-seq • Pipeline: • sga preprocess • sga index • sga filter --no-kmer-check • sga cluster • sga merge chip1 sono1 chip2 sono2 faire1 rna1 > combined • sga filter combined • sga cluster –db=combined –extend=subset.cluster(s)

  7. Some results

  8. Min-overlap parameter • Error-rate = 0 • “I want read-read overlaps of 30bp or more” • Error-rate > 0 • “I want read-read overlaps of 30bp or more, allowing for N mismatches every X matches” • Min-overlap assessment: • Ran sga cluster with a range of overlaps • Compared results read-by-read to Swembl called peaks

  9. Some real data

  10. Axolotl H3K4me3 sga preprocess --min=34 --dust=1 ame.base.fq 1> ame.fq sga index -t 8 --disk=1000000 ame.fq ame.fq ame.bwt ame.rbwt ame.sai ame.rsai sga filter --no-kmer-check -t 8 ame.fq sga cluster -m 25 -c 10 -e 0.05 -t 8 ame.filter.pass.fa -o ame.clusters 3.0G Jul 3 16:06 ame.h3k4me3.base.fq 2.9G Jul 3 16:04 ame.h3k4me3.fq 442M Jul 3 17:03 ame.h3k4me3.bwt 264M Jul 3 17:03 ame.h3k4me3.sai 442M Jul 3 18:03 ame.h3k4me3.rbwt 264M Jul 3 18:03 ame.h3k4me3.rsai

  11. Axolotl H3K4me3 sga cluster -m 25 -c 10 -e 0.05 -t 8 ame.filter.pass.fa -o ame.clusters 27,090,026 initial reads 26,128,313 kept reads (0.964499) 940M bp 2,140,780 unique reads in 64845 clusters size 5 or + 3.0G Jul 3 16:06 ame.h3k4me3.base.fq 2.9G Jul 3 16:04 ame.h3k4me3.fq 442M Jul 3 17:03 ame.h3k4me3.bwt 264M Jul 3 17:03 ame.h3k4me3.sai 442M Jul 3 18:03 ame.h3k4me3.rbwt 264M Jul 3 18:03 ame.h3k4me3.rsai

  12. Axolotl H3K4me3 sga preprocess --min=34 --dust=1 ame.base.fq 1> ame.fq Preprocess stats: Reads parsed: 27,090,026 Reads kept: 26,128,313 (0.964499) Reads failed primer screen: 289 (1.06681e-05) Bases parsed: 975,240,936 Bases kept: 940,619,268 (0.964499) Number of incorrectly paired reads that were discarded: 0 Number of reads failed dust filter: 502,008 [timer - sga preprocess] wall clock: 503.94s CPU: 502.68s

  13. Large inserts sonication

  14. Suncus murinus • Sister species to Sorex araneus (~ 30MYA) • H3K4me3 long insert 75bp

  15. H3K4me3 (li/75bp+si/36bp) - Axolotl (ame1) si/36bp: 27,090,026 initial reads 26,128,313 kept reads (0.964499) 940M bp 2,140,780 unique reads in 64845 clusters size 5 or + 80 clusters mapping to human genome - Axolotl (ame1+2) li/75bp + si/36bp: 26,128,313 (0.964499) si/36bp 20,943,097 (0.843625) li/75bp 3,972,081 unique reads in 74722 clusters size 5 or + 783 clusters mapping to human genome (most fall within first coding exon) - Suncus murinus (smu1) li/75bp: 13,123,042 (0.552039) (redo lower threshold?) 3,648,150 unique reads in 68417 clusters size 5 or + 12,914 clusters mapping to sorex araneus genome

  16. ame1

  17. ame1 ame1+2

  18. ame1 ame1+2 smu1

  19. SGA walk –exhaustive reads.fa –walks=file.walks –desc=file.wdesc • Exhaustively walk through all possible paths (.walk.fa file) and record read order for each cluster (.wdesc file) • TF data: seconds per cluster • Mostly a handful of walks, tail with 10^2s -10^3s • Histone modification data: • Up to ~100.000 walks per cluster sometimes (36bp dataset), still doable Pinball::Walk job only takes 3 hours.

  20. Problematic walks • Caroli chip-seq: • car.do1030.CEBPA • car.do1191.CEBPA • car.do1029.HNF4a • car.do1032.HNF4a • car.do1031.FoxA1 • car.do1034.FoxA1 • Separate runs 36bp / m=28bp / 5+ cluster size • 1,150,000 clusters – 15% problematic walks

  21. sga-cluster-extend • Mix & match different read sets: • Chip/sono/FAIRE/RNA/any-kind-of-seq • De novo cluster regions of interest • Then “prop-up” with other readsets • Sorex araneus 2x Sanger genome assembly • 22 Illumina lanes resequenced (Broad) • (TO DO) sga-cluster-extend smu1 → sorex reseq

  22. sga-cluster-extend • Axolotl 75bp PE liver RNA-seq • Assembled with Trinity (56000 contigs) • (1) sga-cluster-extend rnaseq → chip-seq • (2) map extended transcripts to human

  23. thanks

  24. Pinball pipeline

  25. Pinball pipeline Preprocess (min) → index (hours/days 64CPU 64GB RAM) → cluster (hours 64CPU 64GB RAM) → walk (sec/hours* ~1-100K jobs) → search (sec/hours* ~1-100K jobs) → extend (min/hours ~ 1000 jobs) → BAMreport (min 1 job ) Start ------------> end (overnight**) perl init_pinball.pl \ -sets file1.fq.gz:file2.fq.gz:file3.fq.gz \ -url mysql://avilella:xxxx@mysql-pinball:4307/avilella_axolotl perl beekeeper.pl \ -url mysql://avilella:xxxx@mysql-pinball:4307/avilella_axolotl \ -loop * farm with N cpus or workstation 24CPUs ** overnight (farm mysql) or a few days (workstation sqlite) -url sqlite:///disk/avilella_axolotl -loop -local_cpus 24

  26. Pinball pipeline preprocess ( 1) DONE job(0/1 1226034ms) [0/100] index ( 2) WORKING job(1/2 7562521ms) [1/100] index_h2 ( 3) READY job(0/0 0ms) [0/100] cluster ( 5) DONE job(0/1 4179654ms) [0/20] clusterh2 ( 6) READY job(0/0 0ms) [0/20] clusterh3 ( 7) READY job(0/0 0ms) [0/20] walk ( 8) WORKING job(2/64840 fail:731)[1/200] search ( 9) DONE job(0/64107 fail:0) [0/200] extend (10) BLOCKED job(1/1 0ms) [1/8] BAMreport (11) BLOCKED job(1/1 0ms) [1/8] ====== Live workers according to Queen: walk : 1 workers search : 4 workers 5 total workers =========================== current hive load = 0.0250 walk ( 8) WORKING index ( 2) WORKING (0.960 remaining-hive-load) use need a total of 2 workers (availLoad=0.96000) hive 99.996% complete (< 2.103 CPU_hrs) (128953 total)

  27. Datasets • Axolotl H3K4me3 (27M reads) • Caroli chip-seq + rnaPol2 • CTCF • CEBPa and new CEBPa • Histone modifications – vertebrates+TC1 (Mike) • Inputs • Public datasets: • Xenopus histmod, Nrsf, prdm9, FAIRE islets, … • Encode 75bp BCLF2, Encode long-motif ATF3, … • VitaminD receptor (un)stimulated/CEPH

  28. V-saver indel calling xxx xxxxx xxxxxxxxxx xxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxx ------------------------------- ← individual-1 genome assembly region x xx xx xxx xxxx xxxxxx xxxxxx xxxxxxxx xxxxxxxxx xxxxxxxxxxx xxxxxxxxxx xxxxxxxxxxxxxx ----------====================---------------- ← reference genome

  29. Examples

  30. De novo motif discovery cisfinder / cfa3 / 1 walk x cluster

  31. De novo motif discovery (cisfinder cfa3 1 walk per cluster)

  32. SNPs and Fixed variants

  33. Small indels

  34. Cfam4 vs dog reference vsaver

  35. CTCF mml against human

  36. Mouse fixed variants & small indel

More Related