1 / 29

A Robust Framework for Detecting Structural Variations

A Robust Framework for Detecting Structural Variations. February 6, 2008 Seunghak Lee 1 , Elango Cheran 1 , and Michael Brudno 1 1 University of Toronto, Canada. What are structural variations? (1). 10^3 – 10^6 basepair variations in the genome

cody
Download Presentation

A Robust Framework for Detecting Structural Variations

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. A Robust Framework for Detecting Structural Variations February 6, 2008 Seunghak Lee1, Elango Cheran1, and Michael Brudno1 1University of Toronto, Canada

  2. What are structural variations? (1) • 10^3 – 10^6 basepair variations in the genome • Insertion: a large consecutive fragment of DNA is inserted • Deletion: a large consecutive fragment of DNA is deleted • Inversion: a large consecutive fragment of DNA is inversed • Translocation: a large consecutive fragment of DNA is moved from one chromosome to another. • Copy number variations

  3. What are structural variations? (2) Various examples of structural variations

  4. Outline • Introduction • Type of Structural Variations • Sequencing Approaches to Detect Structural Variations • Motivation & Research Objectives • Probabilistic Framework for Detecting Structural Variations • Probabilistic Framework • Flow of our Framework • Hierarchical Clustering of Matepairs (2nd phase) • Choosing a Unique Mapped Location for Each Matepair (3nd phase) • Experiments • Comparison with Three Previous research • DMBT1 Gene for Deletion • Centromere and Translocations • Conclusions

  5. Type of Structural Variations (1) Insertion A REF

  6. Type of Structural Variations (2) Deletion A REF

  7. Type of Structural Variations (3) Inversion 3’ 5’ A 5’ 3’ REF 3’ 5’

  8. Type of Structural Variations (4) Translocation chr1 chr2

  9. Sequencing Approaches • 1. “Fine-scale structural variation of the • human genome”[Tuzun et al, 2005] • Mapping matepairs onto the reference genome • Insertion and deletion: inconsistent mapped distance • Inversion: the same orientation of both reads • 2. “Paired-End mappings Reveals Extensive • Structural Variation in the Human Genome” [Korbel et al, 2007] • Proposed high-throughput and massive paired end mapping technique • Detailed types of structural variations

  10. Motivation & Research Objectives (1) How can we map reads onto the reference genome? Tuzun et al used scores which are the combination of several factors. (e.g. length, identity, quality of the sequences)

  11. Motivation & Research Objectives (2) • Sequencing method is effective to detect structural variants. • Proven by Tuzun et al, Korbel et al • However, there are multiple mappings for each read • Previous research used a priori mapped locations. • Why don’t we develop a probabilistic model without such assumptions? • Hopefully, it can be applied to short reads from NGS machines.

  12. Probabilistic Framework (1) We play with p(Y) to describe our probabilistic framework p(Y): distribution of mapped distances of “uniquely mapped” matepairs of various sizes

  13. Probabilistic Framework (2) X1, X2 = matepair 1,2Y= random variable for mapped distances of “uniquely mapped” matepairs Insertion p(Y) μY = (s+r) P(Xi, Xj|ins=r) = P(Xi|ins=r)P(Xj|ins=r) P(Xi|ins=r) = 1 - P(μY- δ≤Y≤μy+ δ) , where δ= |μY- (s+r)|, s = mapped distance μy- δ

  14. Probabilistic Framework (3) Deletion p(Y) μy- δ μY = (s-r) P(Xi, Xj|del=r) = P(Xi|del=r)P(Xj|del=r) P(Xi|del=r) = 1 - P(μY- δ≤Y≤μy+ δ) where δ= |μY- (s-r)|, s = mapped distance

  15. Probabilistic Framework (4) Inversion p(|Y1-Y2|) μ|Y1-Y2|-δ c - d = s(X1) - s(X2) P(Xi, Xj|inv) = 1 - P(μ|Y1-Y2|- δ≤|Y1-Y2|≤μ|Y1-Y2| + δ) where δ= |μ|Y1-Y2| – (c –d)|, s(Xi) = insert size of Xi

  16. Probabilistic Framework (5) Translocation p(|Y1-Y2|) μ|Y1-Y2|-δ (c – a) – (d – b) = s(X1) - s(X2) P(Xi, Xj|trans) = 1 - P(μ|Y1-Y2|- δ≤|Y1-Y2|≤μ|Y1-Y2| + δ) , where δ= |μ|Y1-Y2| – (c – a) – (d – b)|, s(Xi) = insert size of Xi

  17. Flow of our Framework (1) 1. Preprocessing step Discard matepairs consistent with insert size Remove short mappings Get top K mappings Mask repeats Remove very similar mappings Remove invalid strands (-,+) Make all possible combinations of mappings

  18. Flow of our Framework (2) 2. Clustering Do hierarchical clustering for each structural variation (Insertion, Deletion, Inversion, Translocation) 3. Finding structural variations Find initial configuration in greedy manner Parameter learning for the objective function Find a local optimum configuration

  19. Hierarchical Clustering (1) X2 (ex) Insertion X1 C={X1, X2} A X2 X1 REF • Cluster, C, is a set of matepairs explaining the • same structural variations • Linkage distance = D(X1, X2) = - ln P(X1, X2|C)

  20. Hierarchical Clustering (2) • Generally, linkage distance is given by, • We do hierarchical clustering for each structural variation.

  21. R2 R1 R1 R2 Choosing a Unique Mapped Location (1) We should map matepairs onto unique pair of BLAT hits and unique cluster. C1 C1 C2 C2 1 2 3 5 4 M1,4 M2,4 M3,5

  22. Choosing a Unique Mapped Location (2) • We define a objective Function J(ω) • ƒ1 corresponds to BLAT hit scores • ƒ2 corresponds to the probability • ƒ3 corresponds to the size of clusters

  23. Choosing a Unique Mapped Location (3) • Find the initial configuration greedily • Learn parameters for the objective function J(ω). • We used hill climbing search to maximize the log likelihood of P(ω|λi) • Finally, find a configuration, locally maximizing J(ω) using hill climbing search

  24. P-values • We assign p-values to give confidence to ourclusters. • The probability that the cluster is generated bythe reference genome not by structural variants • Pval(Ck)=(E choose |Ck|) ∏ P(Xi|Cnull)where E= (Expected number of matepairs mapped to the location of the cluster) • P-values depend on the length of the cluster, thenumber of matepairs involved and probabilities.

  25. Clustering Results We started with ~360,000 matepair • ~90% were uniquely mapped • ~90% had a concordant position (mapped at  ± 2) Through the clustering procedure above (FDR 0.2) wefound • 82 Insertion clusters (53 had a uniquely mapped read) • 175 Deletion clusters (135) • 103 inversion clusters (24) • 55 Translocation (cross-chromosome) cluster(all were required to have a uniquely mapped read)

  26. Example Deletion

  27. Agreement with Previous Results All of the correlations (besides the zero) are significant (p-values < 0.001) via Monte Carlosimulations The DMBT1 deletion was also found in theTuzun et al dataset (but not the Levy dataset). We have compared

  28. Translocations • A large fraction (69%) of the translocations were close to the centromeres • She et al. predicted up to 200 interchromosomalrearrangement events near centromeres per millionyears. The two donors are ~0.2 million years apart • These could also be mis-assemblies.

  29. Conclusions • Introduced a novel framework for finding structural variants that does not rely on ab initio mapping of matepairs to genomic positions. • Introduced a probabilistic model for structural variants • Isolated 82 insertions, 175 deletions, and 103 inversions between the reference public human genome and the JCVI donor. • These results show statistically significant correlation with previous variation studies • Isolated 194 novel structural variants that do not overlap any event from the database of genomic variants (of these 121 have support from a uniquely mapped matepair)

More Related