Bioinformatics: Variant Identification, Focusing on SVs

Bioinformatics:Variant Identification, Focusing on SVs Mark Gerstein, Yale Universitygersteinlab.org/courses/452 (last edit in spring ’17)

Main Steps in Genome Resequencing [Snyder et al. Genes & Dev. ('10)]

Bayes’ Theorem to detect genomic variant • In the above equation: • refers to the observed data • is the genotype whose probability is being calculated • refers to the th possible genotype, out of n possibilities

Calculating the conditional distribution Assuming an error free model, for each heterozygous SNP site of the diploid genome, covered by K reads, the number of reads representing one of the two alleles follows binomial distribution. With errors, the calculation is more complicated. In general:

Main Steps in Genome Resequencing [Snyder et al. Genes & Dev. ('10)]

1. Paired ends Methods toFind SVs Deletion Reference Mapping Genome Reference Sequenced paired-ends 2. Split read 3. Read depth (or aCGH) Deletion Deletion Reference Reference Genome Genome Read Reads Mapping Mapping Read count Reference Zero level 4. Local Reassembly [Snyder et al. Genes & Dev. ('10)]

Read Depth

Patient 98-135 Patient 99-199 Patient 97-237 LCR A B C D Array Signal Read depth Individual genome Reads Mapping Reference genome [Urban et al. ('06) PNAS; Wang et al. Gen. Res. ('09); Abyzov et al. Gen. Res. (’11)] Counting mapped reads Read depth signal Zero level

Reads to Signal Track Reads (fasta) + quality scores (fastq) + mapping (BAM) Reads => Signal (Intermediate file) Accumulating @ >1 Pbp/yr (currently), ~20% of tot. HiSeq output [PLOS CB 4:e1000158]

Example of Application to RD data [Abyzov et al. Gen. Res. (’11)] NA12878, Solexa 36 bp paired reads, ~30x coverage

0.5 0 Fluorescence log2 ratio -0.5 ACGTGACAC AT AAGCACACCA A TTGCTTGAGGGACCT T AGGCACAGT T AAC A TG AT AAGCACACCA A TTGCTTGAGGTGAC DNA NO T T O SCALE sequence HMM • To get highest resolution on breakpoints need to smooth & segment the signal • BreakPtr: prediction of breakpoints, dosage and cross-hybridization using a system based on Hidden Markov Models Korbel*, Urban* et al., PNAS (2007)

alues v y a r r A alues S equen c e v y a Transition B Transition A r r A S equen c e Duplication Normal Deletion Transition A’ Transition B’ Statistically integrates array signal and DNA sequence signatures (using a discrete-valued bivariate HMM) Korbel*, Urban* et al., PNAS (2007)

CNVnator Mean-shift-based (MSB) segmentation: no explicit model • For each bin attraction (mean-shift) vector points in the direction of bins with most similar RD signal • No prior assumptions about number, sizes, haplotype, frequency and density of CNV regions • Not Model-based (e.g. like HMM) with global optimization, distr. assumption & parms. (e.g. num. of segments). • Achieves discontinuity-preserving smoothing • Derived from image-processing applications [Abyzov et al. Gen. Res. (’11)]

Observed depth of coverage counts as samples from PDF Kernel-based approach to estimate local gradient of PDF Iteratively follow grad to determine local modes Region of interest Intuitive Description of MSB Center of mass [ Adapted from S Ullman et al. "Advanced Topics in Computer Vision," www.wisdom.weizmann.ac.il/~vision/courses/2004_2 ] Mean Shift vector Objective : Find the densest region Distribution of identical billiard balls

Split Read

Read-depth works well on a variety of sequencing platforms but provides imprecise breakpoints [Abyzov et al. Gen. Res. (’11)] [NA18505]

Split-read Analysis Breakpoint Breakpoint Reference Deletion Read Target Genome Breakpoint Reference Read Target Genome Insertion

Deletions are the Easiest to Identify [Zhang et al. ('11) BMC Genomics]

Creative application of dynamic programming to a new problem • Problem: Map insertions and deletions to a reference genome: • Solution: SW alignment from both ends; combine max scoring alignments AGEAlignment with Gap Excision [Abyzov et al. Bioinfo. (’11)] • much more detail in SV section later

Difficulties in Defining Exact Breakpoints [Abyzov & Gerstein (’11) Bioinfo.]

Paired-End

Paired-End Mapping Breakpoint Breakpoint Breakpoint Breakpoint Breakpoint Reference Inversion Deletion Target Insertion Inversion Paired-End Sequencing and Mapping Reference Span >> expected Altered end orientation Span << expected • Both paired-ends map within repeats. • Limited the distance between pairs; therefore, neither large nor very small rearrangements can be detected

High-Resolution Paired-End Mapping (HR-PEM) Bio Bio Bio Bio • Shear to 3 kb • Adaptor ligation • Circularize Bio Bio Genomic DNA Fragments Select Random Cleavage 200-300bp 454 Massively Parallel Sequencing (250bp/reads, 400k reads/run) Map paired ends to human reference genome Korbel et al., 2007 Science 3kb

Overall Strategy for Analysis of NextGen Seq. Data to Detect Structural Variants [Korbel et al., Science ('07); Korbel et al., GenomeBiol. (‘09)]

Pseudogenes & Genomic Duplications

Pseudogenes are among the most interesting intergenic elements • Formal Properties of Pseudogenes (G) • Inheritable • Homologous to a functioning element – ergo a repeat! • Non-functional • No selection pressure so free to accumulate mutations • Frameshifts & stops • Small Indels • Inserted repeats (LINE/Alu) • What does this mean? no transcription, no translation?… [Mighell et al. FEBS Letts, 2000]

Identifiable Features of a Pseudogene (yRPL21) [Gerstein & Zheng. Sci Am 295: 48 (2006).]

Two Major Genomic Remodeling Processes Give Rise to Distinct Types of Pseudogenes [Gerstein & Zheng. Sci Am 295: 48 (2006).]

Impact of Genetic Variability: Loss-of-function Gene Polymorphic Pseudogene • Previous LoFs are considered as having high probability of being deleterious • Surprisingly, ~ 100 LoF variants per genome, 20 genes are completely inactivated • Among ~100 LoFs, we estimate 2 recessive, close to 0 dominant disease nonsense variants per healthy genome. • - Truncating nonsense SNPs • - Splice-disrupting SNPs • - Frameshift-causing indels • - Disrupting structural variants

Genomic Variation Alu Gene Ancestral State Gene Alu Gene Alu The Genome Remodeling Process

Genomic Variation Gene Dup. Gene Alu Gene Ancestral State Non-allelic homologous recombination (NAHR) Gene Alu Gene Alu The Genome Remodeling Process Segmental Duplication (SD)

Genomic Variation Gene Dup. Gene Dup. Gene Gene Gene Gene Dup. Gene Alu Gene Ancestral State Non-allelic homologous recombination (NAHR) Gene Alu Gene Alu The Genome Remodeling Process Segmental Duplication (SD) Syntenic Ortholog SD duplicate Paralog Dup. Gene family Dup. ψgene

Genomic Variation Gene Dup. Gene Dup. Gene Gene Gene Gene Dup. Gene Alu Gene Ancestral State Non-allelic homologous recombination (NAHR) Gene Alu Gene Alu The Genome Remodeling Process Segmental Duplication (SD) Syntenic Ortholog SD duplicate Paralog Dup. Gene family Pssd. ψgene Dup. ψgene Retro-transpose

GenomicVariation Gene Dup. Gene Dup. Gene Gene Gene Gene Dup. Gene Alu Gene Ancestral State Non-allelic homologous recombination (NAHR) Gene Alu Gene Alu The Genome Remodeling Process Segmental Duplication (SD) Syntenic Ortholog SD Insertion Insertion VNTR Pssd. ψgene duplicate Deletion Paralog Insertion VNTR L1 Dup. Gene Deletion Retro-elements Inversion family VNTR Pssd. ψgene Dup. ψgene Retro-transpose

Genomic Variation Gene Dup. Gene Dup. Gene Gene Gene Gene Dup. Gene Alu Gene Ancestral State Non-allelic homologous recombination (NAHR) Gene Alu Gene Alu The Genome Remodeling Process Segmental Duplication (SD) Syntenic Ortholog CNV (type of SV) Insertion Insertion VNTR Pssd. ψgene duplicate Deletion Insertion VNTR L1 Dup. Gene Deletion Retro-elements Inversion VNTR Pssd. ψgene Dup. ψgene Retro-transpose "Polymorphic" Genes & Pseudogenes

RDV & Mobile Elements

Retroduplication variation (RDV) Retroduplications (pot. retro-pseudogenes or retro-genes) mRNA Gene … Reference … Person 1 … Person 2 Known retroduplication … Person 3 . . . … Reference … Person 1 … Person 2 Novel retroduplication … Person 3 . . . mRNA [Abyzov et al. Gen. Res. ('13) ]

Novel retroduplication Gene Read pairs 1 2 3 4 … Alignment to the reference Reference … … 3 1 Unaligned reads Aligned reads Evidence from alignment Splice-junction library 2 … Evidence from cluster 4 1 [Abyzov et al. Gen. Res. ('13) ] 3 … Evidence from read depth 2 3 4 1 Pipeline to identify novel retro-dups. from 3 evidence sources Zero level

Frequency of novel retroduplications by populations. Abyzov A et al. Genome Res. 2013;23:2042-2052

Can Alkan, Bradley P. Coe & Evan E. Eichler Nature Reviews Genetics 12, 363-376 (May 2011)

1000G summary

1000G SV (Pilot, Phase I & III) • Many different callers compared & used • including SRiC & CNVnator but also VariationHunter, Cortex, NovelSeq, PEMer, BreakDancer, Mosaik, Pindel, GenomeSTRiP, mrFast…. • Merging • Genotyping (GenomeSTRiP) • Breakpoint assembly (AGE & Tigra_SV) • Mechanism Classification [1000 Genomes Consortium, Nature (2010, 2012); Mills et al., Nature (2011)]

SummaryStats of 1000GP SV Phase3 •68,818SVs •2,504 unrelated individuals •26 populaSons •37,250 SVs with resolved breakpoints [2] 1000GP Phase3 SV paper. Submided to Nature, 2015. [3] 1000GP ConsorSum. Submided to Nature, 2015. 8

Phase 3: MedianAutosomal Variant Sites PerGenome 33 [3] 1000GP Consortium. Submitted to Nature, 2015.

A Typical Genome •A typical genome diﬀers from thereferencegenomeat4.09 –5.02million sites. •The typical genomecontains2,100 – 2,500 SVs,covering~20million bases. •A typical genomecontains149 – 182 siteswith protein truncatingvariants,10 – 12 thousand siteswith peptidesequence alteringvariants, and459 – 565 thousand variantsitesoverlappingregulatoryregions. 5 [3] 1000GP ConsorSum. Submided to Nature, 2015.

Human Genetic Variation Population of 2,504 peoples A Typical Genome A Cancer Genome Class of Variants Origin of Variants Prevalence of Variants Common Common Passenger Rare (~75%) Rare* (1-4%) Driver (~0.1%) * Variants with allele frequency < 0.5% are considered as rare variants in 1000 genomes project. The 1000 Genomes Project Consortium, Nature. 2015. 526:68-74 Khurana E. et al. Nat. Rev. Genet. 2016. 17:93-108

Association of Variants with Diseases Common Variants Healthy Rare or Somatic Variants Pooled Variants High Function Impact Diseased Burden Test GWAS Positive

Structural Variations (SVs) •SVs make up themajorityof varyingnucleotides among humans. •More base pairs are altered as a resultof SVs, than of single-nucleotidevariations. –On the haploid reference assembly, a mediumof 8.9Mbp are aﬀected by SVs,while3.6Mbp aﬀected by SNPs. [1]Weischenfeldt J,et al. NatRevGenet,2013. [2] 1000GP Phase3 SV paper. Submided to Nature, 2015. 6

Distribution of Different SVsin Normal Human Populations Total ~70K SVs from over 2,500 normal individuals (the 1000 Genomes Project)

Bioinformatics: Variant Identification, Focusing on SVs

Bioinformatics: Variant Identification, Focusing on SVs

Presentation Transcript

“This is a Test. This is Only a Test!”

Software Testing

3D Test Issues

Test and Test Equipment December 2012 Hsin -Chu , Taiwan

Who wants to be a Millionaire?

Test Preparation, Test Taking Strategies, and Test Anxiety

Test Automation Tools: QF-Test and Selenium

System Test Specification

TDC ( Test Description Code)

Engine Condition Diagnosis

Chi-square test or c 2 test

200

Test del Software, con elementi di Verifica e Validazione, Qualità del Prodotto Software

Test of Significance

System Test Tools

Lesson 7