840 likes | 1.06k Views
Albert Vilella Vertebrate Genomics EMBL-EBI Cambridge, UK. Pinball: comparative genomics on regulatory regions. EBI is an Outstation of the European Molecular Biology Laboratory. Assembling Chip-seq data. Aims: reconstruct chip-seq regions for species with no reference genome
E N D
Albert Vilella Vertebrate Genomics EMBL-EBI Cambridge, UK Pinball: comparative genomics on regulatory regions EBI is an Outstation of the European Molecular Biology Laboratory.
Assembling Chip-seq data • Aims: • reconstruct chip-seq regions for species with no reference genome • mapping regions to closest genomes
Read clustering • Idea: divide reads into overlapping clusters • Groups of highly-overlapping reads at summit • SGA read-read overlap • finds clusters of reads: • Even clean TF chip-seq, only ~20% of sequenced reads are in peaks (CEBPA, CTCF, etc) • Reconstruct (1) many (2) small + isolated regions
SGA vs kmer-based • SGA can handle error-rates in overlaps: • Uses BWT + FM indexing of the reads • Handles sequence errors (miscalls) • Handles SNPs and Fixed SNVs: • SNPs/fSNVs quickly break your kmers in a sliding window of size K: • Keeping kmers is important for genome assembly • SGA does not handle indels: • or PacBio/454/Ion Torrent indel miscalls
SGA index, SGA filter, SGA merge • Don't bother keeping track of error-rate • Index files, use them for different purposes • SGA merge chip/sono/faire/RNA-seq • Pipeline: • sga preprocess • sga index • sga filter --no-kmer-check • sga cluster • sga merge chip1 sono1 chip2 sono2 faire1 rna1 > combined • sga filter combined • sga cluster –db=combined –extend=subset.cluster(s)
Min-overlap parameter • Error-rate = 0 • “I want read-read overlaps of 30bp or more” • Error-rate > 0 • “I want read-read overlaps of 30bp or more, allowing for N mismatches every X matches” • Min-overlap assessment: • Ran sga cluster with a range of overlaps • Compared results read-by-read to Swembl called peaks
Axolotl H3K4me3 sga preprocess --min=34 --dust=1 ame.base.fq 1> ame.fq sga index -t 8 --disk=1000000 ame.fq ame.fq ame.bwt ame.rbwt ame.sai ame.rsai sga filter --no-kmer-check -t 8 ame.fq sga cluster -m 25 -c 10 -e 0.05 -t 8 ame.filter.pass.fa -o ame.clusters 3.0G Jul 3 16:06 ame.h3k4me3.base.fq 2.9G Jul 3 16:04 ame.h3k4me3.fq 442M Jul 3 17:03 ame.h3k4me3.bwt 264M Jul 3 17:03 ame.h3k4me3.sai 442M Jul 3 18:03 ame.h3k4me3.rbwt 264M Jul 3 18:03 ame.h3k4me3.rsai
Axolotl H3K4me3 sga cluster -m 25 -c 10 -e 0.05 -t 8 ame.filter.pass.fa -o ame.clusters 27,090,026 initial reads 26,128,313 kept reads (0.964499) 940M bp 2,140,780 unique reads in 64845 clusters size 5 or + 3.0G Jul 3 16:06 ame.h3k4me3.base.fq 2.9G Jul 3 16:04 ame.h3k4me3.fq 442M Jul 3 17:03 ame.h3k4me3.bwt 264M Jul 3 17:03 ame.h3k4me3.sai 442M Jul 3 18:03 ame.h3k4me3.rbwt 264M Jul 3 18:03 ame.h3k4me3.rsai
Axolotl H3K4me3 sga preprocess --min=34 --dust=1 ame.base.fq 1> ame.fq Preprocess stats: Reads parsed: 27,090,026 Reads kept: 26,128,313 (0.964499) Reads failed primer screen: 289 (1.06681e-05) Bases parsed: 975,240,936 Bases kept: 940,619,268 (0.964499) Number of incorrectly paired reads that were discarded: 0 Number of reads failed dust filter: 502,008 [timer - sga preprocess] wall clock: 503.94s CPU: 502.68s
Suncus murinus • Sister species to Sorex araneus (~ 30MYA) • H3K4me3 long insert 75bp
H3K4me3 (li/75bp+si/36bp) - Axolotl (ame1) si/36bp: 27,090,026 initial reads 26,128,313 kept reads (0.964499) 940M bp 2,140,780 unique reads in 64845 clusters size 5 or + 80 clusters mapping to human genome - Axolotl (ame1+2) li/75bp + si/36bp: 26,128,313 (0.964499) si/36bp 20,943,097 (0.843625) li/75bp 3,972,081 unique reads in 74722 clusters size 5 or + 783 clusters mapping to human genome (most fall within first coding exon) - Suncus murinus (smu1) li/75bp: 13,123,042 (0.552039) (redo lower threshold?) 3,648,150 unique reads in 68417 clusters size 5 or + 12,914 clusters mapping to sorex araneus genome
ame1 ame1+2
ame1 ame1+2 smu1
SGA walk –exhaustive reads.fa –walks=file.walks –desc=file.wdesc • Exhaustively walk through all possible paths (.walk.fa file) and record read order for each cluster (.wdesc file) • TF data: seconds per cluster • Mostly a handful of walks, tail with 10^2s -10^3s • Histone modification data: • Up to ~100.000 walks per cluster sometimes (36bp dataset), still doable Pinball::Walk job only takes 3 hours.
Problematic walks • Caroli chip-seq: • car.do1030.CEBPA • car.do1191.CEBPA • car.do1029.HNF4a • car.do1032.HNF4a • car.do1031.FoxA1 • car.do1034.FoxA1 • Separate runs 36bp / m=28bp / 5+ cluster size • 1,150,000 clusters – 15% problematic walks
sga-cluster-extend • Mix & match different read sets: • Chip/sono/FAIRE/RNA/any-kind-of-seq • De novo cluster regions of interest • Then “prop-up” with other readsets • Sorex araneus 2x Sanger genome assembly • 22 Illumina lanes resequenced (Broad) • (TO DO) sga-cluster-extend smu1 → sorex reseq
sga-cluster-extend • Axolotl 75bp PE liver RNA-seq • Assembled with Trinity (56000 contigs) • (1) sga-cluster-extend rnaseq → chip-seq • (2) map extended transcripts to human
Pinball pipeline Preprocess (min) → index (hours/days 64CPU 64GB RAM) → cluster (hours 64CPU 64GB RAM) → walk (sec/hours* ~1-100K jobs) → search (sec/hours* ~1-100K jobs) → extend (min/hours ~ 1000 jobs) → BAMreport (min 1 job ) Start ------------> end (overnight**) perl init_pinball.pl \ -sets file1.fq.gz:file2.fq.gz:file3.fq.gz \ -url mysql://avilella:xxxx@mysql-pinball:4307/avilella_axolotl perl beekeeper.pl \ -url mysql://avilella:xxxx@mysql-pinball:4307/avilella_axolotl \ -loop * farm with N cpus or workstation 24CPUs ** overnight (farm mysql) or a few days (workstation sqlite) -url sqlite:///disk/avilella_axolotl -loop -local_cpus 24
Pinball pipeline preprocess ( 1) DONE job(0/1 1226034ms) [0/100] index ( 2) WORKING job(1/2 7562521ms) [1/100] index_h2 ( 3) READY job(0/0 0ms) [0/100] cluster ( 5) DONE job(0/1 4179654ms) [0/20] clusterh2 ( 6) READY job(0/0 0ms) [0/20] clusterh3 ( 7) READY job(0/0 0ms) [0/20] walk ( 8) WORKING job(2/64840 fail:731)[1/200] search ( 9) DONE job(0/64107 fail:0) [0/200] extend (10) BLOCKED job(1/1 0ms) [1/8] BAMreport (11) BLOCKED job(1/1 0ms) [1/8] ====== Live workers according to Queen: walk : 1 workers search : 4 workers 5 total workers =========================== current hive load = 0.0250 walk ( 8) WORKING index ( 2) WORKING (0.960 remaining-hive-load) use need a total of 2 workers (availLoad=0.96000) hive 99.996% complete (< 2.103 CPU_hrs) (128953 total)
Datasets • Axolotl H3K4me3 (27M reads) • Caroli chip-seq + rnaPol2 • CTCF • CEBPa and new CEBPa • Histone modifications – vertebrates+TC1 (Mike) • Inputs • Public datasets: • Xenopus histmod, Nrsf, prdm9, FAIRE islets, … • Encode 75bp BCLF2, Encode long-motif ATF3, … • VitaminD receptor (un)stimulated/CEPH
V-saver indel calling xxx xxxxx xxxxxxxxxx xxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxx ------------------------------- ← individual-1 genome assembly region x xx xx xxx xxxx xxxxxx xxxxxx xxxxxxxx xxxxxxxxx xxxxxxxxxxx xxxxxxxxxx xxxxxxxxxxxxxx ----------====================---------------- ← reference genome