A Robust Framework for Detecting Structural Variations

A Robust Framework for Detecting Structural Variations February 6, 2008 Seunghak Lee1, Elango Cheran1, and Michael Brudno1 1University of Toronto, Canada

What are structural variations? (1) • 10^3 – 10^6 basepair variations in the genome • Insertion: a large consecutive fragment of DNA is inserted • Deletion: a large consecutive fragment of DNA is deleted • Inversion: a large consecutive fragment of DNA is inversed • Translocation: a large consecutive fragment of DNA is moved from one chromosome to another. • Copy number variations

What are structural variations? (2) Various examples of structural variations

Outline • Introduction • Type of Structural Variations • Sequencing Approaches to Detect Structural Variations • Motivation & Research Objectives • Probabilistic Framework for Detecting Structural Variations • Probabilistic Framework • Flow of our Framework • Hierarchical Clustering of Matepairs (2nd phase) • Choosing a Unique Mapped Location for Each Matepair (3nd phase) • Experiments • Comparison with Three Previous research • DMBT1 Gene for Deletion • Centromere and Translocations • Conclusions

Type of Structural Variations (1) Insertion A REF

Type of Structural Variations (2) Deletion A REF

Type of Structural Variations (3) Inversion 3’ 5’ A 5’ 3’ REF 3’ 5’

Type of Structural Variations (4) Translocation chr1 chr2

Sequencing Approaches • 1. “Fine-scale structural variation of the • human genome”[Tuzun et al, 2005] • Mapping matepairs onto the reference genome • Insertion and deletion: inconsistent mapped distance • Inversion: the same orientation of both reads • 2. “Paired-End mappings Reveals Extensive • Structural Variation in the Human Genome” [Korbel et al, 2007] • Proposed high-throughput and massive paired end mapping technique • Detailed types of structural variations

Motivation & Research Objectives (1) How can we map reads onto the reference genome? Tuzun et al used scores which are the combination of several factors. (e.g. length, identity, quality of the sequences)

Motivation & Research Objectives (2) • Sequencing method is effective to detect structural variants. • Proven by Tuzun et al, Korbel et al • However, there are multiple mappings for each read • Previous research used a priori mapped locations. • Why don’t we develop a probabilistic model without such assumptions? • Hopefully, it can be applied to short reads from NGS machines.

Probabilistic Framework (1) We play with p(Y) to describe our probabilistic framework p(Y): distribution of mapped distances of “uniquely mapped” matepairs of various sizes

Probabilistic Framework (4) Inversion p(|Y1-Y2|) μ|Y1-Y2|-δ c - d = s(X1) - s(X2) P(Xi, Xj|inv) = 1 - P(μ|Y1-Y2|- δ≤|Y1-Y2|≤μ|Y1-Y2| + δ) where δ= |μ|Y1-Y2| – (c –d)|, s(Xi) = insert size of Xi

Probabilistic Framework (5) Translocation p(|Y1-Y2|) μ|Y1-Y2|-δ (c – a) – (d – b) = s(X1) - s(X2) P(Xi, Xj|trans) = 1 - P(μ|Y1-Y2|- δ≤|Y1-Y2|≤μ|Y1-Y2| + δ) , where δ= |μ|Y1-Y2| – (c – a) – (d – b)|, s(Xi) = insert size of Xi

Flow of our Framework (1) 1. Preprocessing step Discard matepairs consistent with insert size Remove short mappings Get top K mappings Mask repeats Remove very similar mappings Remove invalid strands (-,+) Make all possible combinations of mappings

Flow of our Framework (2) 2. Clustering Do hierarchical clustering for each structural variation (Insertion, Deletion, Inversion, Translocation) 3. Finding structural variations Find initial configuration in greedy manner Parameter learning for the objective function Find a local optimum configuration

Hierarchical Clustering (1) X2 (ex) Insertion X1 C={X1, X2} A X2 X1 REF • Cluster, C, is a set of matepairs explaining the • same structural variations • Linkage distance = D(X1, X2) = - ln P(X1, X2|C)

Hierarchical Clustering (2) • Generally, linkage distance is given by, • We do hierarchical clustering for each structural variation.

R2 R1 R1 R2 Choosing a Unique Mapped Location (1) We should map matepairs onto unique pair of BLAT hits and unique cluster. C1 C1 C2 C2 1 2 3 5 4 M1,4 M2,4 M3,5

Choosing a Unique Mapped Location (2) • We define a objective Function J(ω) • ƒ1 corresponds to BLAT hit scores • ƒ2 corresponds to the probability • ƒ3 corresponds to the size of clusters

Choosing a Unique Mapped Location (3) • Find the initial configuration greedily • Learn parameters for the objective function J(ω). • We used hill climbing search to maximize the log likelihood of P(ω|λi) • Finally, find a configuration, locally maximizing J(ω) using hill climbing search

P-values • We assign p-values to give confidence to ourclusters. • The probability that the cluster is generated bythe reference genome not by structural variants • Pval(Ck)=(E choose |Ck|) ∏ P(Xi|Cnull)where E= (Expected number of matepairs mapped to the location of the cluster) • P-values depend on the length of the cluster, thenumber of matepairs involved and probabilities.

Clustering Results We started with ~360,000 matepair • ~90% were uniquely mapped • ~90% had a concordant position (mapped at  ± 2) Through the clustering procedure above (FDR 0.2) wefound • 82 Insertion clusters (53 had a uniquely mapped read) • 175 Deletion clusters (135) • 103 inversion clusters (24) • 55 Translocation (cross-chromosome) cluster(all were required to have a uniquely mapped read)

Example Deletion

Agreement with Previous Results All of the correlations (besides the zero) are significant (p-values < 0.001) via Monte Carlosimulations The DMBT1 deletion was also found in theTuzun et al dataset (but not the Levy dataset). We have compared

Translocations • A large fraction (69%) of the translocations were close to the centromeres • She et al. predicted up to 200 interchromosomalrearrangement events near centromeres per millionyears. The two donors are ~0.2 million years apart • These could also be mis-assemblies.

Conclusions • Introduced a novel framework for finding structural variants that does not rely on ab initio mapping of matepairs to genomic positions. • Introduced a probabilistic model for structural variants • Isolated 82 insertions, 175 deletions, and 103 inversions between the reference public human genome and the JCVI donor. • These results show statistically significant correlation with previous variation studies • Isolated 194 novel structural variants that do not overlap any event from the database of genomic variants (of these 121 have support from a uniquely mapped matepair)

A Robust Framework for Detecting Structural Variations

A Robust Framework for Detecting Structural Variations

Presentation Transcript

Epistasis and a Flexible Framework for Detecting Epistasis

Polycentric Institutions: A Robust Governance Framework

A Robust Method of Detecting Hand Gestures Using Depth Sensors

Robust Monotonic Optimization Framework for Multicell MISO Systems

A Method for Detecting Pleiotropy

A Framework for Detecting Malformed SMS Attack

Detecting robust time-delayed regulation in Mycobacterium tuberculosis

Kuznets Lecture New Structural Economics: A Framework for Rethinking Development

A Framework for

Psi-Blast: Detecting structural homologs

Structural framework

Detecting copy number variations using paired-end sequence data

PROSPECTS for detecting a

A Framework for

A Robust Framework for Real-Time Distributed Processing of Satellite Data

Robust microarrary experiments by design: a multiphase framework

Robust microarray experiments by design: a multiphase framework

Robust microarrary experiments by design: a multiphase framework

MECE Framework for Structural Thinking

RiSE Project: Towards a Robust Framework for Software Reuse

Psi-Blast: Detecting structural homologs