Next Generation Sequencing

Next Generation Sequencing

Sequencing techniques • ChIP-seq • MBD-seq (MIRA-seq) • BS-seq • RNA-seq • miRNA-seq

ChIP-seq • ChIP-Seq is a new frontier technology to analyze in vivo protein-DNA interactions. • ChIP-Seq • Combination of chromatin immunoprecipitation (ChIP) with ultra high-throughput massively parallel sequencing • Allow mapping of protein–DNA interactions in-vivo on a genome scale

Workflow of ChIP-Seq Mardis, E.R. Nat. Methods4, 613-614 (2007)

The advantages of ChIP-seq • Current microarray and ChIP-ChIP designs require knowing sequence of interest as a promoter, enhancer, or RNA-coding domain. • Lower cost • Higher resolution • Higher accuracy • Alterations in transcription-factor binding in response to environmental stimuli can be evaluated for the entire genome in a single experiment.

Sequencers • Solexa (Illumina) • 1 GB of sequences in a single run • 35 bases in length • 454 Life Sciences (Roche Diagnostics) • 25-50 MB of sequences in a single run • Up to 500 bases in length • SOLiD (Applied Biosystems) • 6 GB of sequences in a single run • 35 bases in length

8 lanes 100 tiles per lane Illumina Genome Analysis System

Sequencing

Sequence Files Quality Scores Sequencer Output

Sequence Files • 10-40 million reads per lane • ~500 MB files

Quality Score Files • Quality scores describe the confidence of bases in each read • Solexa pipeline assigns a quality score to the four possible nucleotides for each sequenced base • 9 million sequences (500MB file)  ~6.5GB quality score file

Bioinformatics Challenges • Rapid mapping of these short sequence reads to the reference genome • Visualize mapping results • Thousand of enriched regions • Peak analysis • Peak detection • Finding exact binding sites • Compare results of different experiments • Normalization • Statistical tests

Mapping of Short Oligonucleotides to the Reference Genome • Mapping Methods • Need to allow mismatches and gaps • SNP locations • Sequencing errors • Reading errors • Indexing and hashing • genome • oligonucleotide reads • Use of quality scores • Use of SNP knowledge • Performance • Partitioning the genome or sequence reads

Mapping Methods: Indexing the Genome • Fast sequence similarity search algorithms (like BLAST) • Not specifically designed for mapping millions of query sequences • Take very long time • e.g. 2 days to map half million sequences to 70MB reference genome (using BLAST) • Indexing the genome is memory expensive

SOAP (Li et al, 2008) • Both reads and reference genome are converted to numeric data type using 2-bits-per-base coding • Load reference genome into memory • For human genome, 14GB RAM required for storing reference sequences and index tables • 300(gapped) to 1200(ungapped) times faster than BLAST • 2 mismatches or 1-3bp continuous gap • Errors accumulate during the sequencing process • Much higher number of sequencing errors at the 3’-end (sometimes make the reads unalignable to the reference genome) • Iteratively trim several basepairs at the 3’-end and redo the alignment • Improve sensitivity

Mapping Methods: Indexing the Oligonucleotide Reads • ELAND (Cox, unpublished) • “Efficient Large-Scale Alignment of Nucleotide Databases” (Solexa Ltd.) • SeqMap (Jiang, 2008) • “Mapping massive amount of oligonucleotides to the genome” • RMAP (Smith, 2008) • “Using quality scores and longer reads improves accuracy of Solexa read mapping” • MAQ (Li, 2008) • “Mapping short DNA sequencing reads and calling variants using mapping quality scores”

Mapping Algorithm (2 mismatches) • Partition reads into 4 seeds {A,B,C,D} • At least 2 seed must map with no mismatches • Scan genome to identify locations where the seeds match exactly • 6 possible combinations of the seeds to search • {AB, CD, AC, BD, AD, BC} • 6 scans to find all candidates • Do approximate matching around the exactly-matching seeds. • Determine all targets for the reads • Ins/del can be incorporated • The reads are indexed and hashed before scanning genome • Bit operations are used to accelerate mapping • Each nt encoded into 2-bits

ELAND (Cox, unpublished) • Commercial sequence mapping program comes with Solexa machine • Allow at most 2 mismatches • Map sequences up to 32 nt in length • All sequences have to be same length

RMAP (Smith et al, 2008) • Improve mapping accuracy • Possible sequencing errors at 3’-ends of longer reads • Base-call quality scores • Use of base-call quality scores • Quality cutoff • High quality positions are checked for mismatces • Low quality positions always induce a match • Quality control step eliminates reads with too many low quality positions • Allow any number of mismatches

Mapped to a unique location Mapped to multiple locations No mapping Low quality 7.2 M 1.8 M 2.5 M 0.5 M 3 M Quality filter 12 M Map to reference genome Map to reference genome

Visualization • BED files are build to summarize mapping results • BED files can be easily visualized in Genome Browser http://genome.ucsc.edu

Visualization: Genome Browser Robertson, G. et al. Nat. Methods 4, 651-657 (2007)

Visualization: Custom 300 kb region from mouse ES cells Mikkelsen,T.S. et al. Nature448, 553-562 (2007)

Screen shot for ZNF263 peaks Frietze et al JBC 2010

ChIP-seq peak analysis programs • SISSRs (Site Identification from Short Sequence Reads): Jothi et al. NAR, 2008. • MACS (Model-based Analysis of ChIP-Seq): Zhang et al, Genome Biology, 2008. • QuEST (Genome-wide analysis of transcription factor binding sites based on ChIP–seq data): Valouev, A. et al. Nature Methods, 2008. • PeakSeq (PeakSeq enables systematic scoring of ChIP–seq experiments relative to controls): Rozowsky, J. et al. Nature Biotech. 2009. • FindPeaks (FindPeaks 3.1: a tool for identifying areas of enrichment from massively parallel short-read sequencing technology.): Fejes, A .P. et al. Bioinformatisc, 2008. • Hpeak (An HMM-based algorithm for defining read-enriched regions from massive parallel sequencing data): Xu et al, Bioinformatics, 2008.

MBD-seq (MIRA-seq) • The MBD methyl-CpG binding domain-based (MBDCap) technology to capture the methylation sites. Double stranded methylated DNA fragments can be detected. It is sensitive to different methylation densities • Genome-wide sequencing technology was used to get the sequence of each short fragment. • The sequenced read was mapped to human genome to find the locations.

BALM – High resolution program for MBD-seq Initial scan enriched region using tag shifting method Set t > 0, s = 1 A B Methylated CpG Unmethylated CpG Measure tags distribution around target sites Fragmentation MBD2 • Estimate parameters of Bi-asymmetric-Laplace (MLE) MBD2 enrichment • Scan genome for • signal enriched regions • s = s + 1 BALM 1 BALM 2 Mixture model Tags distribution BALM Tags distribution Elution Unenriched input 500mM 1000mM 2000mM Sequencing and Alignment • Decompose the mixture model using Expectation Maximization (EM) Tags mapped to forward strand No • s = t Yes Tags mapped to reverse strand Define hypermethylated regions and methylation score for each CpGdinucleotides BALM analysis Lan et al, PLoS ONE, 2011, 6:e22226

Application on MBD-seq data (MCF7)

BS-seq • BS-seq: genomic DNA is treated with sodium bisulphite (BS) to convert cytosine, but not methylcytosine, to uracil, and subsequent high-throughput sequencing. • Truly single-base resolution

RNA-seq • RNA-Seq is a new approach to transcriptome profiling that uses deep-sequencing technologies. • Studies using this method have already altered our view of the extent and complexity of eukaryotic transcriptomes. RNA-Seq also provides a far more precise measurement of levels of transcripts and their isoforms than other methods.

RNA-seq protocol

The advantages of RNA-seq • Single base resolution • High throughput • Low background noise • Ability to distinguish different isoforms and alleic expression • Relatively low cost

Next Generation Sequencing