EDACC Quality Characterization for Various Epigenetic Assays

EDACCQuality Characterization for Various Epigenetic Assays Cristian Coarfa Bioinformatics Research Laboratory Molecular and Human Genetics

ChIP-Seq Methyl-C RRBS MRE-Seq MeDIP-Seq Chromatin Accessibility small RNA-Seq mRNA-Seq Data Types Submitted To EDACC

Quality Characterization • How to measure the quality of mapped reads? • Note: not quality of sequencing • statistics on this are provided by the sequencer • Most labs do some sort of visual inspection • Metrics for characterizing level 2 data quality • Apply it to various data types submitted to EDACC

Enrichment Based Protocols • ChIP-Seq, MeDIP-Seq, Chromatin Accessibility • Methods implemented • PTIH (percent tags in hotspots) • iROC (integral of ROC) • Percent tags in peaks (FindPeaks) • Poisson enrichment metric • Implemented in EDACC pipeline • Metrics computed on all submitted data

PTIH (percent tags in hotspots) • Detect enriched regions using “hotspot” algorithm • PTIH = percentage of all tags that fall in hotspots

Hotspot algorithm Scan statistic gauging enrichment with a z-score based on the binomial distribution. n tags 250 bp 50kb N tags Binomial distribution gives probability of seeing n tags in the small window given N tags total in the large window. This adjusts for local background fluctuations (due to CNV, for instance).

PTIH values 0.48 0.19 0.72 0.48

Determine uniquely mapping reads Use FindPeaks to call peaks Count reads mapping into peaks percentage of total mapped reads Ratio of Tags in Peaks

Determine uniquely mapping reads Remove duplicate reads Bin the reads into 1kb windows Infer parameters of a simple poisson distribution Filter enriched windows p-value < 0.01 Count reads mapping into enriched windows Poisson Based Enrichment Method

Next Step – Metrics Evaluation • Metrics probe different features of data • Use visual inspection to ascertain which (one or more) of the proposed methods captures useful aspects of data quality.

Collaborative efforts between centers ~330 lanes of verified ChIP-Seq, MeDIP-Seq, and Chromatin accesibility data Accesible in Epigenome Atlas ChIP-Seq/Chromatin Accessibility/FindPeaks QC Metrics

Going forward • EDACC will run continuously on all submitted data • Option to automatically flag data that fall below specified thresholds • For most data types we need further experience on what thresholds make sense • Include QC metrics in metadata • Provide downstream users with this information • Note that we are breaking new ground • uniform quality scoring is not being performed by other major consortia (ENCODE, modENCODE)

Using raw density maps at 10kb resolution Process Select uniquely mapping reads Extend 200bp in mapping strand direction Remove monoclonal reads Build density map Pearson correlation with other submitted marks Ideally: a mark correlates best with other experiments for the same assay How well does Pearson correlation work ? Help us identify 5 bad lanes, REMCs retracted the data Pearson correlation for ChIP-Seq Histone Modification

10kb windows on chr20 PCA using Pearson correlation metric PCA Analysis

Input H3K36me3 H3K9me3 H3K79me1 H3K20me1 Pearson correlation metric H3K27me3 PCA 53.8% H3K4me3 H3K9ac H2AK5ac H2BK120ac H2BK12ac H2BK15ac H2BK20ac H3K14ac H3K18ac H3K23ac H3K27ac H3K4ac H3K56ac H4K5ac H4K8ac H4K91ac

Reads are mapped onto reference genome Uniquely mapping reads are kept Build the fragment map of expecting mapping locations based on the enzyme cocktail used Count reads mapping within the expected digest fragments 76-99% of reads map within expected fragment MRE-Seq

Reads are mapped onto reference genome Uniquely mapping reads are kept Count reads mapping within UCSC genes exons 70-90% of reads map within gene exons UCSC known genes Entrez genes mRNA-Seq

Trim adaptors Reads are mapped onto reference genome Reads mapping up to 100 locations are kept Count reads overlapping with known small RNAs miRNAs, piRNAs, sno/scaRNAs, piRNAs, repeat RNAs At least 30% of reads overlap with known small RNAs Small RNA-Seq

Map using Pash Methyl-C Genome wide QC C->T Conversion rates; typically 99% RRBS Enzyme cocktail QC Map within expected cut sites Ratio varies 40%-90% Bisulfite Sequencing

QC for MeDIP-Seq Data Using Galaxy

Download the input MeDIP-Seq file from the workshop wiki Determine the ratio of reads in peaks using FindPeaks in Galaxy Exercise

EDACC Quality Characterization for Various Epigenetic Assays

EDACC Quality Characterization for Various Epigenetic Assays

Presentation Transcript

NanoPro Assays

Epigenetic Analysis

Solutions for Scheduling Assays

Repository for Targeted Proteomics Assays

siRNA and Epigenetic

Hormonal Assays

Biological Assays

HORMONES ASSAYS

Factor Assays

Humoral immune status: Comparison of various serological assays

LUMIER-Assays

VIRUS ASSAYS

Epigenetic phenomena

Developing New Assays for FXM

Epigenetic mechanisms

Epigenetic Therapy

MOLECULAR ASSAYS FOR GENETIC TESTING

Epigenetic Inheritance

Immunology Assays for Clinical Research

EDACC Primary Analysis Pipelines

CORTISOL ASSAYS

Epigenetic Service