320 likes | 855 Views
Lecture Outline. Odds and Ends from Next Generation Sequencing Lecture 1Third Generation SequencingThe Poisson Distribution and Fold CoverageAligning Reads to a Reference - A Practical Introduction. Read Length is Not As Important For Resequencing. Jay Shendure. Sequencing a complete cancer genome.
E N D
1. Next Generation Sequencing Technologies Rob Mitra
5488 Lecture
1/24/10
3. Read Length is Not As Important For Resequencing
4. Sequencing a complete cancer genome
5. What did we learn? Mutation spectrum carries a signature of damage by UV light.
Transcription coupled repair is a important DNA repair mechanism.
Plausible candidate genes.
6. What will we learn?
7. What will we learn part 2? In non-small cell lung cancer, the tyrosine kinase domain of EGFR is mutated in some patients (often non-smokers)
This mutation activates an anti-apoptotic pathway
The tyrosine kinase inhibitor Gefitinib is effective in patients with this mutation (60% show no growth or remission after 1 year versus 7% on chemotherapy)
Very few side effects!
Personalized Therapy
8. Pacific Biosciences: A Third Generation Sequencing Technology
12. Real Time Sequencing
13. How did they do? 150 bp circular template
~93% raw accuracy
15x coverage 99.3% accuracy
Still early days
14. Where are they going? Commercial specifications: 80,000 zero mode waveguides
10-15 minute runs
Throughput ~140 MB per hour* (currently Illumina is at ~100 MB per hour,expect 625 MB per hour this summer)
Methyl C?
Dark reads?
Phi29 so long read lengths possible (3kb now, up to 70kb later?)
Ease of sample prep
Camera costs
15. Math Aside: Sequencing coverage calculations Let’s say you need a base to be sequenced 5x for an accurate base call
If you sequence at 10x coverage how much of the genome will be sequenced at least 5 times?
16. Poisson Distribution
18. Example Average coverage = 5x
Probability of a given base being sequenced exactly 10 times is:
510e-5/10! = 0.018 or about 2% of bases will have 10x coverage.
19. Math Aside: Sequencing coverage calculations If you sequence at 10x coverage how much of the genome will be sequenced at least 5 times?
1 – [f(0,10) + f(1,10) + f(2,10) + f(3,10) + f(4,10)] = 0.97
20. Nuts and Bolts of Sequencing Resequencing
Map reads back to genome
Call bases
RNA-seq
Map reads back to genome
Count tags to determine gene expression levels
Chip Seq
Map reads back to genome
Peaks determine binding sites.
21. Mapping Reads Back Hash Table (Lookup table)
FAST, but requires perfect matches
Array Scanning
Can handle mismatches, but not gaps
Dynamic Programming (Smith Waterman, Forward, Viterbi)
Indels
Mathematically optimal solution
Slow (most programs use Hash Mapping as a prefilter)
Burrows-Wheeler Transform (BW Transform)
FAST (memory efficient)
But for gaps/mismatches, it lacks sensitivity
22. Aligners Evaluated
23. CPU Time 2M Reads to Hs36: SE/PE Benchmark: Maq (~8 hours)
24. Placement: 2M Reads on Hs36 Most aligners place ~80% of reads uniquely in Hs36.
25. PE Mode Increases Unique Mapping
26. What we find BW transform algorithms (Bowtie) are great for RNA-seq, ChIP-Seq
We prefer Maq, or an in-house alignment program that uses a seed-based approach followed by DP to find SNPs and INDELs
27. A Case Study in Resequencing
28. Exon Capture
29. Four individuals is enough!
30. A case study in RNA seq
31. Workflow Reads per kilobase of exonic sequence per million mapped readsReads per kilobase of exonic sequence per million mapped reads
33. RNA-Seq Versus Microarray Splice forms
Allele Specific Expression
Accuracy
Dynamic Range