1 / 70

Vanderbilt Center for Quantitative Sciences Summer Institute Sequencing Analysis

Vanderbilt Center for Quantitative Sciences Summer Institute Sequencing Analysis. Yan Guo. What is Sequencing?. Sequencing is the process of determining the precise order of nucleotides.

shiri
Download Presentation

Vanderbilt Center for Quantitative Sciences Summer Institute Sequencing Analysis

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Vanderbilt Center for Quantitative Sciences Summer InstituteSequencing Analysis Yan Guo

  2. What is Sequencing? • Sequencing is the process of determining the precise order of nucleotides. • Non high throughput sequencing: Sanger Sequencing: The basic chain termination method, developed by Frederick Sanger in 1974. Generates all possible single-stranded DNA molecules complementary to a given template, and beginning at a common 5' base.

  3. The Pros and Cons of Sanger Sequencing • Pros: Highly accurate • targetable • Cons: • Cost $15 per /1000 base pairs, to sequencing the whole genome will cost roughly: 30bil/1000x$15=$15m • Low detection rate of alternative allele

  4. Current Generation Sequencing

  5. Sequencing Type By Source • RNA: mRNA, Small RNA, Total RNA • DNA: Whole Genome or targeted (Exome, mitochondrial, genes of interest, etc)

  6. Sequencing Data • Raw Image data is more than 2TB per sample • Raw data is about 5-15GB per single end sample or 10-30GB per pair end sample for RNAseq or Exome Sequencing. Whole genome data can easily exceed 200GB per sample. • In general 5x raw data size is needed to finish processing • Raw data is usually in FASTQ format, the base quality is in Phred scale • Older Illumina pipeline uses Phred 64 scale, newer CASAVA 1.8 pipeline uses Sanger scale.

  7. Single end vs Paired end • Paired end data has double amount of data than single end. • Paired end is more expensive than single end. • Paired end data is easier to do quality control (insert size, removing duplicate) • Paired end data provides more opportunities to detect structural variance.

  8. What can you obtain from DNAseq • SNPs (require only normal or tumor) • Somatic Mutations (require tumor and normal pair) • Copy Number Variation (work best with whole genome sequencing) • Small Structural Variance: Insertion, deletion • Large Structure Variance: (Translocation, Inversion)

  9. What can you obtain from RNAseq • Gene Expression • SNP (only for expressed genes) • Novel Splicing Variants • Genes Fusion RNAseq has been used primarily as a replacement of microarray

  10. How does RNAseq compare to Microarray? • Since 2008, people has been saying that RNAseq will replace microarray for gene expression profiling. • VANTAGE stopped offering microarray service earlier this year. • Wang, Z., M. Gerstein, and M. Snyder, RNA-Seq: a revolutionary tool for transcriptomics. Nat Rev Genet, 2009. 10(1): p. 57-63. • 2. Shendure, J., The beginning of the end for microarrays? Nat Methods, 2008. 5(7): p. 585-7.

  11. Data Distribution Guo, Y., et al., Large Scale Comparison of Gene Expression Levels by Microarrays and RNAseq Using TCGA Data. PLoS One, 2013. 8(8): p. e71462.

  12. Result Consistency Guo, Y., et al., Large Scale Comparison of Gene Expression Levels by Microarrays and RNAseq Using TCGA Data. PLoS One, 2013. 8(8): p. e71462.

  13. RNAseqvs Microarray - advantages

  14. Processing RNA

  15. Raw data • @HWI-ST508:203:D078GACXX:8:1101:1296:1011 1:N:0:ATCACG • NTGGAGTCCTAGGCACAGCTCTAAGCCTCCTTATTCGAGCCGAGCTGGGCC • + • #4=DDDDDDDDDDE<DAEEEIDFEIEIEIEIIIIIIDEDDDDA@DDDDII@

  16. @EAS139:136:FC706VJ:2:2104:15343:197393 1:Y:18:ATCACG

  17. Phred Score

  18. Quality Control • Quality control should be conducted at multiple steps during sequencing data processing • Raw data • Alignment • Results (Expression for RNA, and SNP/mutation for DNA) Guo, Y., et al., Three-stage quality control strategies for DNA re-sequencing data. Brief Bioinform, 2013.

  19. Raw Data QC - Tools • FAST QC http://www.bioinformatics.babraham.ac.uk/projects/fastqc/ • FASTX-Toolkit http://hannonlab.cshl.edu/fastx_toolkit/ • QC3 https://github.com/slzhao/QC3 • NGS QC Toolkit http://59.163.192.90:8080/ngsqctoolkit/

  20. Raw Data QC - What to Look For

  21. Alignment QC - Tools • QC3 https://github.com/slzhao/QC3 • Qqplothttp://genome.sph.umich.edu/wiki/QPLOT • SAMStat http://samstat.sourceforge.net/

  22. Alignment QC - What to Look For

  23. Expression QC - Tools • MultiRankSeqhttps://github.com/slzhao/MultiRankSeq

  24. Clustering Algorithms • Start with a collection of n objects each represented by a p–dimensional feature vector xi , i=1, …n. • The goal is to divide these n objects into k clusters so that objects within a clusters are more “similar” than objects between clusters. k is usually unknown. • Popular methods: hierarchical, k-means, SOM, mixture models, etc.

  25. Distance Calculation in SequencingSmith-Waterman algorithm Sequence 1 = ACACACTA Sequence 2 = AGCACACA w(gap) = 0 w(match) = +2 w(a, − ) = w( − ,b) = w(mismatch) = − 1

  26. Distance Calculation in Microarray • Pearson Correlation Two profiles (vectors) and +1  Pearson Correlation  – 1

  27. Similarity Measurements • Euclidean Distance

  28. Linkage • Single Linkage: D(X, Y) = min(d(x, y)), x ϵ X, y ϵ Y • Complete Linkage: D(X, Y) = max(d(x, y)), x ϵ X, y ϵ Y • Average Linkage:

  29. Experssion QC - What to Look For

  30. Batch Effect

  31. Correction of Batch Effect Guo, Y., et al., Statistical strategies for microRNAseq batch effect reduction. Translational Cancer Research, 2014. 3(3): p. 260-265.

  32. Normalization of RNAseq Reads PerKilo base per Million reads (RPKM)

  33. RNAseq Data Alignment • TopHat2 http://ccb.jhu.edu/software/tophat/index.shtml • MapSplicehttp://www.netlab.uky.edu/p/bioinfo/MapSplice

  34. Gene Quantification • CufflInks for RPKM http://cufflinks.cbcb.umd.edu/ • HTSeq for read count http://www-huber.embl.de/users/anders/HTSeq/doc/overview.html

  35. Data

  36. Example of Quantile Normalization Red = G1; Green = G2; Blue = G3; Yellow = G4; Black = G5 Original Original Sort S1 Sort S2 Sort S3 Sorted

  37. Take Average for Each Row Sorted Averaged

  38. Reorder Red = G1; Green = G2; Blue = G3; Yellow = G4; Black = G5 Averaged

  39. Differential Expression Analysis • Cuffdiff from Cufflinks package Trapnell, C., et al., Differential gene and transcript expression analysis of RNA-seq experiments with TopHat and Cufflinks. Nat Protoc, 2012. 7(3): p. 562-78. • DESeqhttp://bioconductor.org/packages/release/bioc/html/DESeq.html • EdgeR • http://www.bioconductor.org/packages/release/bioc/html/edgeR.html • NBPSeqhttp://cran.r-project.org/web/packages/NBPSeq/index.html • TSPM http://omictools.com/sequencing/rna-seq/normalization-de/tspm-r-s2496.html • baySeqhttp://www.bioconductor.org/packages/release/bioc/html/baySeq.html

  40. Which Method Is the Best? Guo, Y., et al., Evaluation of read count based RNAseq analysis methods. BMC Genomics, 2013. 14 Suppl 8: p. S2.

  41. Consistency

  42. Consistency

More Related