290 likes | 475 Views
A Robust Framework for Detecting Structural Variations. February 6, 2008 Seunghak Lee 1 , Elango Cheran 1 , and Michael Brudno 1 1 University of Toronto, Canada. What are structural variations? (1). 10^3 – 10^6 basepair variations in the genome
E N D
A Robust Framework for Detecting Structural Variations February 6, 2008 Seunghak Lee1, Elango Cheran1, and Michael Brudno1 1University of Toronto, Canada
What are structural variations? (1) • 10^3 – 10^6 basepair variations in the genome • Insertion: a large consecutive fragment of DNA is inserted • Deletion: a large consecutive fragment of DNA is deleted • Inversion: a large consecutive fragment of DNA is inversed • Translocation: a large consecutive fragment of DNA is moved from one chromosome to another. • Copy number variations
What are structural variations? (2) Various examples of structural variations
Outline • Introduction • Type of Structural Variations • Sequencing Approaches to Detect Structural Variations • Motivation & Research Objectives • Probabilistic Framework for Detecting Structural Variations • Probabilistic Framework • Flow of our Framework • Hierarchical Clustering of Matepairs (2nd phase) • Choosing a Unique Mapped Location for Each Matepair (3nd phase) • Experiments • Comparison with Three Previous research • DMBT1 Gene for Deletion • Centromere and Translocations • Conclusions
Type of Structural Variations (1) Insertion A REF
Type of Structural Variations (2) Deletion A REF
Type of Structural Variations (3) Inversion 3’ 5’ A 5’ 3’ REF 3’ 5’
Type of Structural Variations (4) Translocation chr1 chr2
Sequencing Approaches • 1. “Fine-scale structural variation of the • human genome”[Tuzun et al, 2005] • Mapping matepairs onto the reference genome • Insertion and deletion: inconsistent mapped distance • Inversion: the same orientation of both reads • 2. “Paired-End mappings Reveals Extensive • Structural Variation in the Human Genome” [Korbel et al, 2007] • Proposed high-throughput and massive paired end mapping technique • Detailed types of structural variations
Motivation & Research Objectives (1) How can we map reads onto the reference genome? Tuzun et al used scores which are the combination of several factors. (e.g. length, identity, quality of the sequences)
Motivation & Research Objectives (2) • Sequencing method is effective to detect structural variants. • Proven by Tuzun et al, Korbel et al • However, there are multiple mappings for each read • Previous research used a priori mapped locations. • Why don’t we develop a probabilistic model without such assumptions? • Hopefully, it can be applied to short reads from NGS machines.
Probabilistic Framework (1) We play with p(Y) to describe our probabilistic framework p(Y): distribution of mapped distances of “uniquely mapped” matepairs of various sizes
Probabilistic Framework (2) X1, X2 = matepair 1,2Y= random variable for mapped distances of “uniquely mapped” matepairs Insertion p(Y) μY = (s+r) P(Xi, Xj|ins=r) = P(Xi|ins=r)P(Xj|ins=r) P(Xi|ins=r) = 1 - P(μY- δ≤Y≤μy+ δ) , where δ= |μY- (s+r)|, s = mapped distance μy- δ
Probabilistic Framework (3) Deletion p(Y) μy- δ μY = (s-r) P(Xi, Xj|del=r) = P(Xi|del=r)P(Xj|del=r) P(Xi|del=r) = 1 - P(μY- δ≤Y≤μy+ δ) where δ= |μY- (s-r)|, s = mapped distance
Probabilistic Framework (4) Inversion p(|Y1-Y2|) μ|Y1-Y2|-δ c - d = s(X1) - s(X2) P(Xi, Xj|inv) = 1 - P(μ|Y1-Y2|- δ≤|Y1-Y2|≤μ|Y1-Y2| + δ) where δ= |μ|Y1-Y2| – (c –d)|, s(Xi) = insert size of Xi
Probabilistic Framework (5) Translocation p(|Y1-Y2|) μ|Y1-Y2|-δ (c – a) – (d – b) = s(X1) - s(X2) P(Xi, Xj|trans) = 1 - P(μ|Y1-Y2|- δ≤|Y1-Y2|≤μ|Y1-Y2| + δ) , where δ= |μ|Y1-Y2| – (c – a) – (d – b)|, s(Xi) = insert size of Xi
Flow of our Framework (1) 1. Preprocessing step Discard matepairs consistent with insert size Remove short mappings Get top K mappings Mask repeats Remove very similar mappings Remove invalid strands (-,+) Make all possible combinations of mappings
Flow of our Framework (2) 2. Clustering Do hierarchical clustering for each structural variation (Insertion, Deletion, Inversion, Translocation) 3. Finding structural variations Find initial configuration in greedy manner Parameter learning for the objective function Find a local optimum configuration
Hierarchical Clustering (1) X2 (ex) Insertion X1 C={X1, X2} A X2 X1 REF • Cluster, C, is a set of matepairs explaining the • same structural variations • Linkage distance = D(X1, X2) = - ln P(X1, X2|C)
Hierarchical Clustering (2) • Generally, linkage distance is given by, • We do hierarchical clustering for each structural variation.
R2 R1 R1 R2 Choosing a Unique Mapped Location (1) We should map matepairs onto unique pair of BLAT hits and unique cluster. C1 C1 C2 C2 1 2 3 5 4 M1,4 M2,4 M3,5
Choosing a Unique Mapped Location (2) • We define a objective Function J(ω) • ƒ1 corresponds to BLAT hit scores • ƒ2 corresponds to the probability • ƒ3 corresponds to the size of clusters
Choosing a Unique Mapped Location (3) • Find the initial configuration greedily • Learn parameters for the objective function J(ω). • We used hill climbing search to maximize the log likelihood of P(ω|λi) • Finally, find a configuration, locally maximizing J(ω) using hill climbing search
P-values • We assign p-values to give confidence to ourclusters. • The probability that the cluster is generated bythe reference genome not by structural variants • Pval(Ck)=(E choose |Ck|) ∏ P(Xi|Cnull)where E= (Expected number of matepairs mapped to the location of the cluster) • P-values depend on the length of the cluster, thenumber of matepairs involved and probabilities.
Clustering Results We started with ~360,000 matepair • ~90% were uniquely mapped • ~90% had a concordant position (mapped at ± 2) Through the clustering procedure above (FDR 0.2) wefound • 82 Insertion clusters (53 had a uniquely mapped read) • 175 Deletion clusters (135) • 103 inversion clusters (24) • 55 Translocation (cross-chromosome) cluster(all were required to have a uniquely mapped read)
Agreement with Previous Results All of the correlations (besides the zero) are significant (p-values < 0.001) via Monte Carlosimulations The DMBT1 deletion was also found in theTuzun et al dataset (but not the Levy dataset). We have compared
Translocations • A large fraction (69%) of the translocations were close to the centromeres • She et al. predicted up to 200 interchromosomalrearrangement events near centromeres per millionyears. The two donors are ~0.2 million years apart • These could also be mis-assemblies.
Conclusions • Introduced a novel framework for finding structural variants that does not rely on ab initio mapping of matepairs to genomic positions. • Introduced a probabilistic model for structural variants • Isolated 82 insertions, 175 deletions, and 103 inversions between the reference public human genome and the JCVI donor. • These results show statistically significant correlation with previous variation studies • Isolated 194 novel structural variants that do not overlap any event from the database of genomic variants (of these 121 have support from a uniquely mapped matepair)