740 likes | 769 Views
Introduction to next -generation sequencing technologies and bioinformatics. Nicholas D. Socci Bioinformatics Core Memorial Sloan Kettering Cancer Center. Part i. Disclosure. Disclosure (being honest). Bioinformatics/ Computational Biology. Disclosure (being honest). Bioinformatics/
E N D
Introduction to next-generation sequencing technologies and bioinformatics Nicholas D. Socci Bioinformatics Core Memorial Sloan Kettering Cancer Center
Part i Disclosure
Disclosure(being honest) Bioinformatics/ Computational Biology
Disclosure(being honest) Bioinformatics/ Computational Biology Next Generation Sequencing Bioinformatics/Computational Biology
Disclosure(being honest) Bioinformatics/ Computational Biology Next Generation Sequencing Bioinformatics/Computational Biology
Disclosure(being honest) Subject area covered in this talk Bioinformatics/ Computational Biology Next Generation Sequencing Bioinformatics/Computational Biology
Useful fact #1Understand the data • Because Bioinformatics draws on many disciplines: • Biology • Mathematics/Statistics • Computer Science It can be hard to understand all parts of a project/problem But you have to, either individually or in collaboration.
BIOinformatics • For the computer scientist, mathematics, statisticians, physicists: • Need to learn some biology, both: • General: DNA vs RNA • Specific to problem working • No time for me to talk about it hear though
Part ii Sequencing Technologies (lite)
Part of understanding the data • Is understanding how it was measured
DNA sequencing “history” Efficiency (bp/person year) 1870 MIESCHER : Discovered DNA 1 1940 AVERY: proposed DNA as “genetic material” 15 WATSON & CRICK: double-helix structure 1953 150 1,500 1977 SANGER/Dideoxy Termination MAXAM & Gilbert/ Chemical Degradation 15,000 1980 25,000 PCR sequencing concept was introduced 1986 PARTIAL AUTOMATION 50kb to 100kb Human genome project 1995 FULL AUTOMATION 120MB/person week 2005 NEXT GENERATION SEQUENCING 60GB week 2009 NEXT NEXT GENERATION SEQUENCING
Sanger sequencing 1- Labeling DNA fragments are labeled by using fluorescently labeled ddNTP 2- Capillary electrophoresis 3- Reading
Sanger sequencing • Two main problems: • Not very high throughput • Expensive • Example: Human genome project • 13 years • $2.7 billions • Impossible to use if one thinks about sequencing patients genome for example • Need for new sequencing technologies to reach the $1000 genome goal. • NEXT GENERATION SEQUENCING INSTRUMENTS
Major NextGen Technologies • Sequencing by ligation • SOLiD (ABI/Lifetech); dibaseprobes double reads • some test most accurate • Short (75x35) medium throughput • Sequencing by synthesis • 454 Roche – pyrosequencing • Homopolymer issue, very expensive • Long reads ~ 400 • Ion Torrent/Proton (LifeTect) use pH instead of fluorophores • homopolymers, expense could go down • medium reads ~ 150 • Throughput could scale rapidly • IlluminaHiSeq/MiSeq • Seems to be the best trade off of accuracy; throughput; cost and length
Different platforms but same concept DNA Fragmented DNA “sequencing library” Ligation of Adaptors PCR cluster for Illumina Clonal amplification of the different fragments Emulsion PCR for 454, SOLiD, PGM Sequencing flavors -sequencing by synthesis -Pyrosequencing -sequencing by ligation -sequencing by measuring pH changes Sequencing
Cluster Generation by Bridge Amplification In contrast to the 454 and ABI methods which use a bead-based emulsion PCR to generate "polonies", Illumina utilizes a unique "bridged" amplification reaction that occurs on the surface of the flow cell. The flow cell surface is coated with single stranded oligonucleotides that correspond to the sequences of the adapters ligated during the sample preparation stage. Single-stranded, adapter-ligated fragments are bound to the surface of the flow cell exposed to reagents for polyermase-based extension. Priming occurs as the free/distal end of a ligated fragment "bridges" to a complementary oligo on the surface. Repeated denaturation and extension results in localized amplification of single molecules in millions of unique locations across the flow cell surface. This process occurs in what is referred to as Illumina's "cluster station", an automated flow cell processor. http://seqanswers.com/forums/showthread.php?t=21
Sequencing by Synthesis A flow cell containing millions of unique clusters is now loaded into the 1G sequencer for automated cycles of extension and imaging. The first cycle of sequencing consists first of the incorporation of a single fluorescent nucleotide, followed by high resolution imaging of the entire flow cell. These images represent the data collected for the first base. Any signal above background identifies the physical location of a cluster (or polony), and the fluorescent emission identifies which of the four bases was incorporated at that position. This cycle is repeated, one base at a time, generating a series of images each representing a single base extension at a specific cluster. Base calls are derived with an algorithm that identifies the emission color over time. At this time reports of useful Illumina reads range from 26-50 bases.
Pac-Bio (next-next gen) Long reads!
Part iii Algorithms/Pipelines
Data Sizes per Analysis Phases Raw Data (Images) Terabytes/Run 1o anal 2o analysis 3o anal BCLs 250Gb tmp SNP Tbl Exprs Val Profiles Reads (+Quals) Dedicated Cluster on Instrument Map Files BAMs (1-100Gb) Biology Computer Science
Next Generation Resequencing Steps Secondary Analysis Tertiary Analysis SNP variant runs: Calling: Unified Genotyper, Haplotype caller. Annotation: coding/non-coding, syn/non-synonymous, Functional (HUGE) Structural Copy number rearrangments RNA-seq: Expression matrix: Genes, Transcripts, Exons Splicing ?Fusions? ChIP-seq/Methylation: MACS, Custom • Mapping: to known genome or reference database reads. • DNA mappers • BWA, SHRiMP, Bowtie • RNA mappers (spliced): • TopHat, rnaStar • BAM Processing: • MarkDups • Indel/Realign • BaseQ Recalibration • QC Reports: • New BAM Compression
RNAseq programs/ workflows Alamancos, et. al. http://arxiv.org/abs/1304.5952
Two very important points • Impossible to get a comprehensive list of algorithms/programs • Too many • Constantly changing • Updateing • New ones added/old ones go away • Huge job staying current, do your own research • However should try to stick to standard data formats: • FASTQ • SAM/BAM • VCF • Resist temptation to invent your own
Missing huge area • de novo, looking for new or novel sequences: What most think sequencing is • Sequencing the human genome • Lots of work on sequencing new organisms • Not my specialty (1 yeast project; will last a lifetime) • Some “de novo” stuff in resequencing: • Splicing, fusions, structural rearrangements • But most resequencing is either • Small changes (variations) to the reference • Counting: RNAseq, ChIPseq, X-seq
mini de novo: looking for novel exons • Not new sequence but new “structure” • Androgen Receptor in Prostate Cancer Cell-line
Focus on one pipeline • Detection of variants • SNV or small Insertions/Deletions • My primary work is in Somatic Variants (differential) • Tumor vs Normal • Metastasis vs Primary • However most of pipeline is the same for Germline events:
List of useful websites:especially for variant analysis • http://samtools.sourceforge.net/ • http://picard.sourceforge.net/ • http://www.broadinstitute.org/gatk/ • http://seqanswers.com/ • http://www.biostars.org/ • For GERMLINE studies • http://www.1000genomes.org/ • http://hapmap.ncbi.nlm.nih.gov/ • For Cancer • http://cancergenome.nih.gov/ Not even close to a comprehensive list; just some jumping off points and must see places
Side Note • Bioinformatics is very much a science of the internet • Much of the “knowledge” is not in “paper” (published) form • Need to read blogs as much as journals • Google: • perhaps one the most important tools for bioinformatics research • Both blessing and curse
State of the Art for Variant Dectection • GATK pipeline form the Broad • At MSKCC have pipelines that use both • GATK 1.6 branch • new GATK 2.x branch
non-GATK Box • Actually pretty complex
non-GATK Box • 3 (4) key steps • Adapter clipping (FASTX or cutadapt) • Mapping to genome (BWA: genome issues) • SAM/BAM massaging • Add ReadGroups (PICARD) • Sort by coordinates at this step • MarkDuplicates (PICARD) • Filter on MAPQ
Which mapper • There are many; we have settled on BWA because the pros seem to use it more widely • GATK 1000g • TCGA • Many mappers do not compute MAPQ which our algorithms need
Genome • Some controversy/discussion over what to use when mapping • For SNP detection key is to not have misplaced reads due to homologous regions not in main build • Map to all chromosomes include • random (unplaced) • unassigned • decoy genome • http://www.cureffi.org/2013/02/01/the-decoy-genome/ • everything but the haplotype chromosomes
Intermediate Output BAM • BAM is a standard format for representing alignments. • A very useful tool for visualizing BAM’s is IGV • http://www.broadinstitute.org/igv/
GATK Stuff • Currently we stick as closely as possible to the:Best practices guidelines from GATK: • http://www.broadinstitute.org/gatk/guide/topic?name=best-practices • Currently at version 4 • STRONGLY, encourage people to go through slides from GATK BP series. • Much of what I show hear is an excerpt from those. • Send someone to one the Broad courses.
GATK Stuff • Lots of stuff; cover here 3 critical steps • In/Del Realignment • BaseQ Recalibration • Variant Calling
Realignment • InDels in reads (especially near the ends) will often be miss-aligned by most mappers as reads with mismatches. • These false mismatches can degrade base quality recalibration and lead to false positives SNP calls
Base Quality Recalibration • The Q-score is a measure provided by the machine of how accurate “it” thinks it read the base: • Q = -10*log10(Perror) • referred to has PHREAD score • Lots and lots of problems with the vendors estimate of this value • Different vendors are not even on the same scale sometimes • Same vendor is not on same scale • Illumina FASTQ debacle