1 / 51

Vanderbilt Center for Quantitative Sciences Summer Institute Sequencing Analysis (DNA)

Vanderbilt Center for Quantitative Sciences Summer Institute Sequencing Analysis (DNA). Yan Guo. Alignment. ATCGGGAATGCCGTTAACGGTTGGCGT. Reference genome. Human genome is about 3 billion base pair (3,000,000,000)in length.

maegan
Download Presentation

Vanderbilt Center for Quantitative Sciences Summer Institute Sequencing Analysis (DNA)

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Vanderbilt Center for Quantitative Sciences Summer InstituteSequencing Analysis (DNA) Yan Guo

  2. Alignment ATCGGGAATGCCGTTAACGGTTGGCGT Reference genome Human genome is about 3 billion base pair (3,000,000,000)in length. If read is 100 bp long, what is the probability of unique alignment? 1/(4x4x4…4) =1/4100 =1/1.60694E+60

  3. Alignment Tools • BWA http://bio-bwa.sourceforge.net/ • Bowtie http://bowtie-bio.sourceforge.net/index.shtml Doing accurate alignment for a 30 million reads will take 30 million x 3billion time units. Both are based on Borrows-Wheeler Algorithm

  4. Alignment Results – Bam files • SAM – uncompressed • Bam – compressed • http://samtools.github.io/hts-specs/SAMv1.pdf • Sort and index before performing analysis • Don’t forget to perform QC on alignment

  5. How to call SNPs http://www.broadinstitute.org/igv/

  6. Local Realignment

  7. Recalibration Why do we need realignment and recalibration for DNA but not RNA?

  8. SNP calling • GATK https://www.broadinstitute.org/gatk/ • Varscanhttp://varscan.sourceforge.net/

  9. VCF files

  10. Annotation using ANNOVAR http://www.openbioinformatics.org/annovar/

  11. Somatic Mutation • Different from SNP (not germline) • Both tumor and normal samples are needed to accurately define a somatic mutation • Tumor sample is almost never 100% tumor

  12. Somatic mutation callers • MuTecthttp://www.broadinstitute.org/cancer/cga/mutect • Varscanhttp://varscan.sourceforge.net/

  13. Quality Control on SNPs • Number of Novel Non-synonymous SNP ~ 100 – 200 • Transition / transversion ratio • Heterozygous / non reference homozygous ratio • Heterozygous consistency • Strand Bias • Cycle Bias

  14. Ti/Tv ratio

  15. Heterozygous / non reference homozygous ratio

  16. Ti/Tv ratio by race and regions

  17. Heterozygous / non reference homozygous ratio by race and regions

  18. Heterozygous Genotype Consistency

  19. Strand Bias

  20. Cycle Bias

  21. Pooled Analysis • Pool samples together without barcode • Save money • Can only be used to evaluate allele frequency

  22. Pooled Analysis - Conclusion

  23. Advanced Data Mining

  24. The known and unknown of sequencing data

  25. The known and unknown of sequencing data

  26. The known and unknown of sequencing data

  27. Known – Things we always know that Sequencing data can do SNV, mutation CNV Xie et al. BMC Bioinformatics 2009 Structural Variants Alkan et al. Nature Review Genetics, 2011

  28. Known Unknown – Other information we found that sequencing data contain SNVs and Mutations in non targeted regions Mitochondria Virus and Microbe

  29. How is additional data mining possible? • Data mining is possible because capture techniques are not perfect.

  30. Capture Efficiency of The Three Major Capture Kits

  31. Potential Functions of Intron and Intergenic ENCODE suggested that over 80% human genome maybe functional. Majority of the GWAS SNPs are not in coding regions (706 exon, 3986 intron, 3323 intergenic)

  32. Coverage of the Unintended Regions • The coverage don’t just drop off suddenly after the capture region end. • Capture region example: chr1 1000 1500 1000 1500 1000 1500

  33. Reads Aligned to Non Target Regions Can Be Used to Detect SNPs • Tibetan exome study : Through exome sequencing of 50 Tibetan subjects, 2 intron SNPs were identified to be associated with high altitude. (Yi, et al. Science 2010) • Non capture region study: Non capture region’s reads were studied to show they can infer reliable SNPs. (Guo, et al BMC Genomics)

  34. Known unknown - Mitochondria However, mitochondria is only 16569 BP Assumptions: 40 mil reads 100BP long read

  35. Dealing with nuMTs

  36. Alignment Results

  37. Extract mitochondria from exome sequencing Tools: • Picardi et al. Nature Methods 2012 • Guo et al. Bioinformatics, 2013 (MitoSeek) Diagnosis: • Dinwiddie et al. Genmics 2013 • Nemeth et al, Brain 2013

  38. Virus • Virus sequences can be captured through high throughput sequencing of human samples • HBV in liver cancer samples (Sung, et al. Nature Genetics, 2012) (Jiang, et al. Genome Research, 2012) • HPV in head and neck cancer (Chen, et al. Bioinformatics, 2012)

  39. HPV AlignmentExample

  40. Tools for Detecting Virus from Sequencing data • PathSeq (Kostic, et al. Nature, 2011 Biotechnology) • VirusSeq (Chen, et al. Bioinformatics, 2012) • ViralFusionSeq (Li, et al. Bioinformatics, 2012) • VirusFinder (Wang, et al. PlOS ONE, 2013)

  41. The Data Mining Ideas applied to RNA • RNAseq has been used a replacement of microarray. • Other application of RNAseq include dection of alternative splicing, and fusion genes. • Additional data mining opportunities also available for RNAseq data

  42. SNV and Indel • Difficulty due to high false positive rate • RNAMapper (Miller, et al. Genome Research, 2013) • SNVQ (Duitama, et al. (BMC Genomics, 2013) • FX (Hong, et al. Bioinformatics, 2012) • OSA (Hu, et al. Binformatics, 2012)

  43. Microsatellite instability Examples: • Yoon, et al. Genome Research 2013 • Zheng, et al. BMC Genomics, 2013

  44. RNA Editing and Allele-specific expression RNA editing tools and database • DARNED, REDidb, dbRES, RADAR Allele-specific expression • asSeq (Sun, et al. Biometrics, 2012) • AlleleSeq (Rozowsky, et al. Molecular Systems Biology, 2011)

  45. Exogenous RNA • Virus (Same as DNA) • Food RNA (you are what you eat) Wang, et al. PLOS ONE, 2012

  46. nonCoding RNA

  47. Unknown Unknown Contamination Unknown treasures Reference is not perfect

  48. Exome Samuels, et al. Trends in Genetics, 2013

  49. RNAseq

More Related