390 likes | 514 Views
A New Model of Multi-Marker Correlation for Genome-Wide Tag SNP Selection. Wei-Bung Wang Tao Jiang. Introduction. Outline. Introduction Problem Related Work Our Approach Result. Introduction. C T T A G C T T. 94%. C T T A G T T T. 6%. SNP. Single Nucleotide Polymorphism.
E N D
A New Model of Multi-Marker Correlation for Genome-Wide Tag SNP Selection Wei-Bung Wang Tao Jiang
Introduction Outline • Introduction • Problem • Related Work • Our Approach • Result
Introduction C T T A G C T T 94% C T T A G T T T 6% SNP Single Nucleotide Polymorphism • Single Nucleotide Polymorphism (SNP) • A genetic variation Modified from slide by Yao-Ting Huang, National Taiwan University Department of Computer Science and Information Engineering
Introduction C T T A G C T T 94% C T T A G T T T 6% SNP SNPs • SNPs are usually bi-allelic • Major allele • Minor allele • Minor allele frequency > 1% (or 5%) • Tri-allelic: very rare
Introduction CTC Haplotype 1 -A C T T T G C T C- -A C T T A G C T T- CAT Haplotype 2 ATC -A A T T T G C T C- Haplotype 3 SNP1 SNP2 SNP3 SNP1 SNP2 SNP3 Haplotype Modified from slide by Yao-Ting Huang, National Taiwan University Department of Computer Science and Information Engineering
Introduction Tag SNP • What is a tag SNP? • Here I use some slides by Yao-Ting Huang and Kun-Mao Chao
Examples of Tag SNPs Haplotype patterns An unknown haplotype sample P1 P2 P3 P4 S1 • Suppose we wish to distinguish an unknown haplotype sample. • We can genotype all SNPs to identify the haplotype sample. S2 S3 S4 S5 S6 SNP loci S7 S8 S9 : Major allele S10 S11 : Minor allele S12
Examples of Tag SNPs Haplotype pattern P1 P2 P3 P4 S1 • In fact, it is not necessary to genotype all SNPs. • SNPs S3, S4, and S5 can form a set of tag SNPs. S2 S3 S4 S5 S6 SNP loci P1 P2 P3 P4 S7 S8 S3 S9 S4 S10 S5 S11 S12
Examples of Wrong Tag SNPs Haplotype pattern P1 P2 P3 P4 S1 • SNPsS1, S2, and S3 can not form a set of tag SNPs because P1 and P4 will be ambiguous. S2 S3 S4 S5 S6 SNP loci P1 P2 P3 P4 S7 S1 S8 S2 S9 S3 S10 S11 S12
Examples of Tag SNPs Haplotype pattern • SNPs S1 and S12 can form a set of tag SNPs. • This set of SNPs is the minimum solution in this example. P1 P2 P3 P4 S1 S2 S3 S4 S5 S6 SNP loci S7 S8 P1 P2 P3 P4 S9 S1 S10 S12 S11 S12
Problem Problem • Tag SNP selection • How to select representatives? • Many different ways
What we do Flowchart A group of individuals(all SNPs are known) Select A set of SNPs(tag SNPs) Relationships between tag SNPsand other SNPs ? Assay Haplotype: tag SNPs Haplotype: all SNPs Save money here
Problem Problem • Perfect world • Minimum set of tag SNPs • Save most money • NP-hard • Real life • Relatively small set • Sufficient accuracy/confidence
A very frequently used method is Linkage Disequilibrium (LD) A group of individuals(all SNPs are known) Select What we do A set of SNPs(tag SNPs) Relationships between tag SNPsand other SNPs ? Assay Haplotype: tag SNPs Haplotype: all SNPs Save money here
Related Work 2 r Linkage Disequilibrium (LD) • Non-random association of alleles at two or more loci • Correlated coefficient: estimation of dependency • LD = correlated coefficient =
Related Work 2 ( [ ] ) b A B 0 1 2 r a , , ; ; 2 P P P 1 r = , = = A B A B 2 ( ) P P P ¡ A B A B 2 r = P P P P b A B a Linkage Disequilibrium • r2 = 1: perfect correlation • r2 = 0.9: strong correlation (0.95, etc.) • r2 = 0: no correlation
Related Work An Example A A T T C C
Related Work Minimum Dominating Set Problem Highly correlated SNP
Examples of Tag SNPs Haplotype patterns An unknown haplotype sample P1 P2 P3 P4 S1 • Suppose we wish to distinguish an unknown haplotype sample. S2 S3 S4 S5 S6 SNP loci S7 S8 S9 : Major allele S10 S11 : Minor allele S12
Examples of Tag SNPs Haplotype patterns An unknown haplotype sample P1 P2 P3 P4 S1 • Suppose we wish to distinguish an unknown haplotype sample. S2 S3 S4 S5 S6 SNP loci S7 S8 S9 : Major allele S10 S11 : Minor allele S12
Examples of Tag SNPs Haplotype pattern • SNPs S1 and S12 can form a set of tag SNPs. • This set of SNPs is the minimum solution in this example. P1 P2 P3 P4 S1 S2 S3 S4 S5 SNPs can work together and help each other S6 SNP loci S7 S8 P1 P2 P3 P4 S9 S1 S10 S12 S11 S12
We introduce new allele AC and AC Only one mistake Our Approach A C A C : T 0 7 0 0 7 . . G 0 1 0 2 0 3 . . . 0 8 0 2 . . Our Approach A C G else T
(snp1, snp2) vs. snp3 (snp1, snp2) vs. snp4 Our Approach ( ) A C A C C T G A _ , , ( ) A C A C C T T C _ : : , , Our Approach
Our Approach In the Right Order A group of individuals(all SNPs are known) Select A set of SNPs(tag SNPs) Relationships between tag SNPsand other SNPs Second First
Our Approach If SNP 1, 4, 10 are tag SNPsPredict SNP 17 with patterns …Accuracy / LD: 0.97 . If SNP 5, 8, 13 are tag SNPsPredict SNP 11 with patterns …Accuracy / LD: 0.62 . Our Approach • Generate relationships …………
Our Approach ( ) b b b b A B C P A C A P C M D j _ _ _ _ > c a a m c a o r ) = ) d A B C D A B C ( ) d P A B P B C B i _ _ < c a a m c n o r m ) = ) d A B D A B c c ¢ ¢ ¢ How to Predict / Determine the Alleles? • LD: (tag) SNP 1, 2, 3 vs. SNP 4 • Allele A/a, B/b, C/c, D/d abc abC Abc aBc AbC aBC SNP[123] becomes bi-allelic ABC ABc majorbucket minorbucket
Our Approach Similar Work • Ke Hao also did a similar work • The same LD model • Different way to determine alleles for composite SNPs • Less flexibility • A special case of our model • Related paper: “Genome-wide selection of tag SNPs using multiple-marker correlation,” Bioinformatics, 2007
Our Approach Sketch • Get r2 value for all possible combinations • Find a small subset of SNPs according to LD
Our Approach Sketch • Find a small subset of SNPs according to LD Tag SNPs Tag SNPs are also covered by themselves covered partialcovered
Our Approach Sketch • Simple greedy algorithm (Ke Hao) • Cover more SNPs in each iteration • Modified greedy algorithm (my work) • A SNP that can’t be covered by others • High priority • A SNP that is not picked but covered • OK • Break tie: partial cover
Our Approach Supersede No longer contributes
Our Approach Supersede
Our Approach l i h A T M M M T 1 t w o a r k e r a g g e r g o r m - i f l R i t t t e q u r e : s e o r p e s h i l h d d S N P t 1 w e o e r e a r e s u n c o v e r e : i f h h d h S N P i i i i t t t 2 e n e r e s a s w n o n c o m n g e g e s : ¤ 3 s s à : l 4 e s e : ¤ h h h d S N P S N P t t t t t 5 s e a c o v e r s e m o s u n c o v e r e s à : ( ) f h l f f d ¤ i t t t B 6 o r e a c o r p e s o o r m s s s : i j ; d d d i i t t 7 r e m o v e a n s c o r r e s p o n n g e g e s : / * * / ¤ ¤ \ k d " P S N P i i i t t t t 8 u s n o a g s e s s p c e : ( ) ( ) f h l f f d ¤ ¤ i t t t B B 9 o r e a c o r p e s o o r m s s s o r s s s : i j i j ; ; i f k d h i i t 1 0 e n s s p c e : i d S N P i t t t 1 1 p u s n o c o v e r e s e : j d d d i i t t 1 2 r e m o v e a n s c o r r e s p o n n g e g e s : l 1 3 e s e : ( 0 ) ( 0 ) l l l f f i t t B B 1 4 r e m o v e a r p e s o o r m s s s o r s s s : i j i j ; ; My Program: MMTagger Pick a SNP Data structure
Our Approach ( ) T Complexity • Computing r2 value • O(nk+1) for k-marker • Picking tag SNPs • where T is the number of relationships • O(T log T) time algorithm
Result Result • Our program: MMTagger • Vs. Single-marker approach (LRTag) • A state-of-the-art program • Single-marker • Vs. Hao’s program (MultiTag) • Multi-marker
Result Vs. Single-Marker Approach
Result MMTagger Vs. MultiTag
Conclusion • We provide a new multi-marker model • Size of tag SNP set • 2- vs. 1-marker: apparently better • 3- vs. 2-marker: slightly better • 4-marker or more: slow, unacceptable • Performance • Our program outperforms the only other program with similar model