330 likes | 590 Views
NGS cancer genomics data processing and analysis. Somak Roy, MD Clinical fellow Division of Urologic Surgical Pathology University of Pittsburgh Medical Center. Outline. Introduction to NGS technology Buzz words Bioinformatics analysis Laboratory workflow and information management QA.
E N D
NGS cancer genomics data processing and analysis Somak Roy, MD Clinical fellow Division of Urologic Surgical Pathology University of Pittsburgh Medical Center
Outline • Introduction to NGS technology • Buzz words • Bioinformatics analysis • Laboratory workflow and information management • QA
Background • Next generation sequencing (NGS) technology is rapidly evolving. • Massively parallel processing. • Dramatic decrease in cost of sequencing has led to wide spread use. http://www.genome.gov/sequencingcosts/
Application in Cancer Genomics NGS Gene fusion detection Mutation profiling Structural variants Copy number variations Epigenetic profiling
Theme of DNA Sequencing Sequence the sample DNA to obtain a string of characters (ATGC) Compare the obtained sequence to the reference sequence (expected normal) Any deviation from the reference (single or multiple base(s)) is a variant.
Evolution of Sequencing Sanger Shotgun approach Next generation sequencing
Semiconductor Sequencing • Robison. Nat Biotechnol 2011;29:805-7 • Rothberg et al. Nature. 2011;475:348-52
Optics-based Sequencing • Arch Pathol Lab Med. 2012;136:000–000; doi: 10.5858/arpa.2012-0107-RA
NGS data processing elements Signal processing Alignment / mapping Assembly / de-novo Variant calling Annotation / Visualization Reporting, storage and sharing of results
Signal Processing – Non-optical • CCGCTAGCTATATTATATCGGGGCCCTAGATAGCTAGATATAGAGGGCTCTAGAGATCGATAGCTAGAG • CTAGCTCGCCGGGGCCCTAGAGTATATTATAGGCTCTAGAGATCGATAGCTGATAGCTAGATATAAGAG • ATATAAGCGCGGCTCGATCGGTCTAGAGAGGCCCTAGAGTATATTACTAGCTTAAGCTGATAGCTAGAG • CCGCTAGCTATATTATATCGGGGCCCTAGATAGCTAGATATAGAGGGCTCTAGAGATCGATAGCTAGAG • CTAGCTCGCCGGGGCCCTAGAGTATATTATAGGCTCTAGAGATCGATAGCTGATAGCTAGATATAAGAG • ATATAAGCGCGGCTCGATCGGTCTAGAGAGGCCCTAGAGTATATTACTAGCTTAAGCTGATAGCTAGAG • CCGCTAGCTATATTATATCGGGGCCCTAGATAGCTAGATATAGAGGGCTCTAGAGATCGATAGCTAGAG • CTAGCTCGCCGGGGCCCTAGAGTATATTATAGGCTCTAGAGATCGATAGCTGATAGCTAGATATAAGAG • ATATAAGCGCGGCTCGATCGGTCTAGAGAGGCCCTAGAGTATATTACTAGCTTAAGCTGATAGCTAGAG • CCGCTAGCTATATTATATCGGGGCCCTAGATAGCTAGATATAGAGGGCTCTAGAGATCGATAGCTAGAG • CTAGCTCGCCGGGGCCCTAGAGTATATTATAGGCTCTAGAGATCGATAGCTGATAGCTAGATATAAGAG • ATATAAGCGCGGCTCGATCGGTCTAGAGAGGCCCTAGAGTATATTACTAGCTTAAGCTGATAGCTAGAG
Signal Processing - Optical • CCGCTAGCTATATTATATCGGGGCCCTAGATAGCTAGATATAGAGGGCTCTAGAGATCGATAGCTAGAG • CTAGCTCGCCGGGGCCCTAGAGTATATTATAGGCTCTAGAGATCGATAGCTGATAGCTAGATATAAGAG • ATATAAGCGCGGCTCGATCGGTCTAGAGAGGCCCTAGAGTATATTACTAGCTTAAGCTGATAGCTAGAG • CCGCTAGCTATATTATATCGGGGCCCTAGATAGCTAGATATAGAGGGCTCTAGAGATCGATAGCTAGAG • CTAGCTCGCCGGGGCCCTAGAGTATATTATAGGCTCTAGAGATCGATAGCTGATAGCTAGATATAAGAG • ATATAAGCGCGGCTCGATCGGTCTAGAGAGGCCCTAGAGTATATTACTAGCTTAAGCTGATAGCTAGAG • CCGCTAGCTATATTATATCGGGGCCCTAGATAGCTAGATATAGAGGGCTCTAGAGATCGATAGCTAGAG • CTAGCTCGCCGGGGCCCTAGAGTATATTATAGGCTCTAGAGATCGATAGCTGATAGCTAGATATAAGAG • ATATAAGCGCGGCTCGATCGGTCTAGAGAGGCCCTAGAGTATATTACTAGCTTAAGCTGATAGCTAGAG • CCGCTAGCTATATTATATCGGGGCCCTAGATAGCTAGATATAGAGGGCTCTAGAGATCGATAGCTAGAG • CTAGCTCGCCGGGGCCCTAGAGTATATTATAGGCTCTAGAGATCGATAGCTGATAGCTAGATATAAGAG • ATATAAGCGCGGCTCGATCGGTCTAGAGAGGCCCTAGAGTATATTACTAGCTTAAGCTGATAGCTAGAG
Signal Processing - Homopolymer Semiconductor sequencing and Pyrosequencing technology
Take a Peak into FASTQ ! Header: Sequence ID, additional info Sequence Optional header Quality score Phred Score / Phred-like score Per Base Call score Q = -10*log10p
Take a Peak into FASTQ ! Q = -10*log10p 30 = -10*log10(10-3) 20 = -10*log10(10-2) What are these characters ? ASCII format 67 = -10*log10(p=?)
Mapping, Assembly & Variant Identification Read ATTGCGCTATTATAGCTCTAGAGAAAAGCGCTAGCGGGCCCGCGATAGCTAGCG Var (G) frequency =3/5 (60%) 3x 5x ATTGCGCTATTATAGCTCTAGGGAAAAGCGCTAGCGGGCCCGCGATAGCTAGCG Pile-up ATTATAGCTCTAGAGAAAAGCGCTAGCGGGCCCGCGATAGCTAGCGCTT GGCCAATCGATTGCGCTATTATAGCTCTAGAGAAAAGCGCTAGCGGGCCCGCGATAGCTAGCG ATTGCGCTATTATAGCTCTAGGGAAAAGCGCTAGCGGGCCCGC CTAGGGAAAAGCGCTAGCGGGCCCGCGATAGCTAGCGCTTA Depth of Coverage Variant frequency
Mapping / Alignment • Mapping algorithms • Dynamic programming algorithms • Needleman-Wunsch • Smith-Waterman • Heuristic algorithms • BLAST • Newer algorithms for NGS data • Modified hash-table method • Modified seed-and-extend method • Burrows-Wheeler transformation • Next-Generation DNA Sequencing Informatics. Ed. Brown SM. Cold Spring Harbor Laboratory Press. 2013
Mapping / Alignment Mapping algorithms Ungapped Gapped - better for indel detection Mapping applications BWA Bowtie SOAP2 ELAND MAQ T-map • Pabinger et al. Briefings in Bioinformatics. Jan 2013
Mapping / Alignment - QC P value assignment for each aligned read based on MAPPING QUALITY SCORE Base quality scores Position of mismatch Issues with mapping short reads Gaps due to true indels Heterogeneity in coverage across the genome – Poisson distribution • Next-Generation DNA Sequencing Informatics. Ed. Brown SM. Cold Spring Harbor Laboratory Press. 2013
Variant identification • Pertains to detection of SNV, INDELs, structural variants, CNV • Different applications exists • Stand alone applications • Input – aligned reads (BAM / SAM) • Integrated with alignment process
Variant identification • Variant callers • GATK • VCFtools • SAMtools • DiIndel • ATLAS-2 • CONTRA • ExomeCNV • BreakDancer • CLEVER • BreakPointer • ……. • Pabinger et al. Briefings in Bioinformatics. Jan 2013
Variant identification - QC Variety of filters Depth of coverage Base quality score Mapping quality score Presence of gaps and homopolymer runs F/R bias • Next-Generation DNA Sequencing Informatics. Ed. Brown SM. Cold Spring Harbor Laboratory Press. 2013
Annotation • Crucial step in the analysis pipeline • 1522648G>A ? • 8124526T>A • 44512584G>C • 55124785GA>CC • 2544856_2544860 AATGC .. • Public / custom databases • Nomenclature • Biological implication (Gene, transcript and protein level) • Genotype-phenotype correlations • Prognostic implication • Predictive implication
Visualization Easily interpret the data
User interface << Level 1 << Level 2 << Level 3 << Level 4 << Level 5
Result Reporting, Management and Sharing An area of active development in clinical laboratory No consensus yet in terms of format and data points CAP / CDC / ACMG / AMP – recommendations for reporting Major issues in clinical implementation of NGS Variant management LIS / NGS system interoperability Transmission of results to EMR Knowledgebase development Generation of data warehouse Proficiency testing
Whole genome sequencing Whole exome sequencing Targeted sequencing
Future direction Huge scope for NGS in research and clinical domain Better technology – Quantum bioinformatics? Better information management systems Large and highly curated public domain knowledgebase Better and affordable healthcare
Thank You Questions