EDACC Primary Analysis Pipelines

EDACCPrimary Analysis Pipelines Cristian Coarfa Bioinformatics Research Laboratory Molecular and Human Genetics

Data Levels

ChIP-Seq Shotgun Bisulfite Sequencing Methyl-C Reduced Representation Bisulfite Sequencing RRBS MRE-Seq MeDIP-Seq Chromatin Accessibility small RNA-Seq mRNA-Seq Data Types Submitted To EDACC

Common processing step to all pipelines High throughput Sequence space: Illumina Color space: SOLID Quick and accurate anchoring Reads size varies 36-76 bp Short read aligners 1st generation: Maq, soap Ungapped alignment 2nd generation: bowtie, bwa, soap 2 Tradeoff speed for sensitivity, good enough for many applications Mapping tools Robust to indels Sensitive to variable number of mismatches Read Mapping

Positional Hashing Regular reads mapping Bisulfite sequencing mapping Integrate basepair variation with epigenetic variation SAM output, easy integration with other analysis tools Accuracy without sacrificing efficiency Pash 3.0

Current tools: BSMAP, RMAP-BS, mrsFast, Zoom Pash 3.0 Integrate mutation discovery with basepair-level methylation discovery Speedup General approach Covert C’s to T’s in reads and/or reference Use mappings, reads and reference to determine methylated sites Pash 3 Generate and hash all possible kmers for reads CTT: CCC, CCT, CTC, CTT Map against forward and reverse complement chromosome strands Superior sensitivity to other tools, without loss of efficiency Bisulfite Sequencing

Developed at Penn State University Benefits Rapid deployment tool Share pipelines w/ others Alan Harris, Sriram Raghuram Deployed Galaxy/Genboree Integration w/ Genboree API for upload/download Adaptors for LFF file format support EDACC XML validation tools Sriram Raghuram, Andrew Jackson, Cristian Coarfa Integration with compute clusters Arpit Tandon, Sriram Raghuram Deployed analysis tools Galaxy/Genboree http://genboree.org/galaxy

Implemented & exposed via Galaxy/Genboree Read mapping Bisulfite Sequencing read mapping Peak calling (ChIP-Seq, MeDIP-Seq) MACS (Harvard), FindPeaks (UBC) Chromatin accessibility HotSpot (UW) Small RNA-seq Coming soon mRNA seq Expression, alternative splicing Gene fusion Typical user interaction Use Galaxy for user input Submit jobs to a cluster Upload results to Genboree Primary Analysis Pipelines

Reads Mapping

Select uniquely mapping reads Build read density maps Extend each read 200bp along the mapping strand Remove monoclonal reads Generate WIG data Can be visualized in Genboree and UCSC Peak calling FindPeaks, MACS Intepret Peaks Overlap with genomic features of interest: gene promoters, etc ChIP-Seq

Select uniquely mapping reads Build read density maps Determine methylated CpGs FindPeaks MeDIP-Seq

Finding methylated CpGs

MeDIP-Seq Signal Visualization

Select uniquely mapping reads Determine unmethylated CpGs MRE-Seq

Shotgun Bisulfite Sequencing Methyl-C Genome wide Reduced Representation Bisulfite Sequencing RRBS Enzyme cocktail Map using Pash Build methylation maps Bisulfite Sequencing

Bisulfite Sequencing Read Mapping

Methylation Maps Position Strand CHHStatus Methylation Unmethylated TotalReads 50100242 + CG 1 0 1 50100243 - CG 40 11 51 50100250 + CG 1 0 1 50100251 - CG 37 8 46

Trim adapters Map reads onto target genome up to 100 locations per read Interpret Overlap w/ miRNAs, piRNAs, sno/scaRNAs Small RNA-Seq

Download the input MeDIP-Seq file from the workshop wiki Analyze it using FindPeaks in Galaxy Obtain results in Genboree Lff format Upload the results to Genboree database View the results in a tabular view Find the largest peaks Explore them in the Genboree browser Exercise

EDACC Primary Analysis Pipelines

EDACC Primary Analysis Pipelines

Presentation Transcript

Arab Petroleum Pipelines Company - Strategic SWOT Analysis

Pipelines

Primary Data Analysis

PIPELINES

Pipelines!

Primary Analysis Descriptions

Seawater Pipelines

EDACC Primary Analysis Pipelines

Dynamic Pipelines

Decoupled Pipelines: Rationale, Analysis, and Evaluation

Transmission Pipelines

Asynchronous Pipelines

Analytical Pipelines

Bioinformatics Pipelines for RNA- Seq Data Analysis

Pipelines

Next Generation Sequencing and Bioinformatics Analysis Pipelines

Next Generation Sequencing and Bioinformatics Analysis Pipelines

Dynamic Pipelines

Pipelines

Decoupled Pipelines: Rationale, Analysis, and Evaluation

Next Generation Sequencing and Bioinformatics Analysis Pipelines