760 likes | 975 Views
High-Throughput Sequencing. Advanced Microarray Analysis BIOS 691-803, 2008 Dr. Mark Reimers, VCU. Quantitative HTS - Outline. Technology Preprocessing Quantitative analysis Applications ChIP-Seq RNA-Seq Methyl-Seq. The Technology.
E N D
High-Throughput Sequencing Advanced Microarray Analysis BIOS 691-803, 2008 Dr. Mark Reimers, VCU
Quantitative HTS - Outline • Technology • Preprocessing • Quantitative analysis • Applications • ChIP-Seq • RNA-Seq • Methyl-Seq
The Technology • Most sequencing proceeds by addition of fluor-labeled bases • Do this in parallel on a flat surface • Capture each stage with good camera • Align images
Roche - 454 • Parallel Pyrosequencing on beads
Resquencing each fragment with different primers Reconstruct each fragment separately ABI SOLiD
Issues • Pre-processing • Base calling • Mapping reads • QA • Quantitative analysis • Variation and noise • Biases • Models • Accuracy and validation
Pre-processing – Base Calling • Not all steps completed properly • Sequence can lag behind or skip ahead • Hence most light spots a mixture of different colors • Simple rule: use brightest signal
Types of mismatches in uniquely mapped tags with a single mismatch are profoundly asymmetric and biased Courtesy Thierry-Mieg
Position of single mismatch in uniquely mapped tags Courtesy Thierry-Mieg
Pre-processing – Mapping Reads • Huge numbers (10M – 70M) • BLAT (2002 high-speed method) • Eland (proprietary Illumina) • Other new methods: MAQ, SOAP
Fraction of reads mapping to targets Typically 5-10M reads per lane and 60-80% map to targets Some repetitive sequence Quality Assessment
Comparing Samples - A Simple Normalization • Different numbers of counts per lane • Divide counts in a region of interest (a genomic region or a gene or an exon) by all counts (total per million reads -TPM) • For comparing genomic regions of different lengths divide also by length of region TPKM (total per kilobase per million)
Quant. Analysis - Variation • Poisson model often used for random variation • Most HTS data ‘over-dispersed’ relative to Poisson • Negative Binomial often used • Parameter fitted
Quantitative Analysis - Biases • Not all regions represented equally • GC rich regions represented more • Independent of GC some chromosome regions represented more • Euchromatin bias • Sequence initiation site biases • ‘Mapability’ biases – some regions won’t have any uniquely mapped tags
Density of reads depends strongly on GC content of regions GC Bias GC content (%)
Genomic Position Biases • Count tags from randomly sheared DNA in red with GC content in blue
Consistent Start Position Bias Counts per start site in lane 1 vs lane 2
RNA-Seq Data Gene Model Kidney Reads Liver Reads From Marioni et al 2008
Issues How replicable is RNA-Seq? How consistent are the two technologies? Which is better? Marioni et al, Genome Research, 2008 Comparing RNA-Seq & Affy
Comparing Fold-Changes • D.E. by ILM • Red >250 • Green <250 • Black Not DE by ILM
Model for Variation • Poisson counts hypergeometric comparison • Make uniform p-values by adding random term • Use lower tails only
False Positive Rates • QQ-plots of p-values between tech. reps
Different Concentrations are NOT Comparable! • QQ-plots of p-values between 3pM and 1.5 pM
Normalization of RNA-Seq • Robinson et al noticed that most genes appeared less expressed in liver Fig 1 from Robinson & Oshlak, Genome Biology 2010
A Better Normalization for RNA-Seq - TMM • Drop extremes of ratios • Drop very high count genes • Compute trimmed means of samples • Center log-ratios between samples
New Things to do with RNA-Seq • Allele-specific expression • Splice variation • Between tissues • In disease • Alternate initiation sites • Select 5’ capped RNA fragments • Alternate termination
It is possible to compare allele-specific expression counts Sample from VCU Replicate samples P-values for binomial tests of equality About half show differential expression! Allelic Comparison
Detecting Splice Variation • Deep sequencing shows up clear variation in exon usage • Wang et al Nature 2008
Tissue Map of Splice Variation From Wang et al • Brain is most distinctive • Individuals seem to differ • Cell lines seem to have distinct splice patterns
Splicing is Complex • Many different splice operations exist • Only some of these characterized by counting exon reads
Issues in Detecting Splice Variants • Counts in exons reflect biases (as yet uncharacterized) as well as actual abundance • Reads that bridge splice junctions would be definitive but mapping is very dubious with short (<40 base) reads • All possible splice junctions are not known • Hard to even search through the known ones
Methodology for Splice Variants • Count reads mapped to exons and and compare ratios across samples • Wang et al, and most others • Count reads that cross splice junctions
ChIP-Seq Workflow • Cross-link proteins to DNA • Fragment DNA • Extract with antibody • Reverse cross links • Sequence fragments • DO CONTROLS!
ChIP-Seq Data • From Rozowsky et al, Nature Biotech 2009
Peak-Finding - Simple • Extend tags and count overlap • How much to extend?
Peak Finding – Better • Tags starting on opposite strands are likely to start at opposite ends • Identifying the cross-over point leads to improved accuracy