180 likes | 375 Views
Linear Reduction Method for Tag SNPs Selection. Jingwu He Alex Zelikovsky. Outline. SNPs , haplotypes and genotypes Haplotype tagging problem Linear reduction method for tagging Maximizing tagging separability Conclusions & future work. Outline.
E N D
Linear Reduction Method for Tag SNPs Selection Jingwu He Alex Zelikovsky
Outline • SNPs , haplotypes and genotypes • Haplotype tagging problem • Linear reduction method for tagging • Maximizing tagging separability • Conclusions & future work
Outline • SNPs , haplotypes and genotypes • Haplotype tagging problem • Linear reduction method for tagging • Maximizing tagging separability • Conclusions & future work
Human Genome and SNPs • Length of Human Genome 3 109 base pairs • Difference b/w any people 0.1% of genome 3 106 SNPs • Total #single nucleotide polymorphisms (SNP) 1 107 • SNPs are mostly bi-allelic, e.g., alleles A and C • Minor allele frequency should be considerable e.g. > 1% • Diploid = two different copies of each chromosome • Haplotype = description of single copy (0,1) Genotype = description of mixed two copies (0=00, 1=11, 2=01) 0 0 1 1 1 1 1 0 0 1 1 0 0 1 1 1 0 0 Two Two haplotypes haplotypes per individual per individual 1 1 1 1 0 0 1 0 0 1 1 0 0 1 0 0 0 0 Genotype for the individual Genotype for the individual 2 2 1 1 2 2 1 0 0 1 1 0 0 1 2 2 0 0
Haplotype and Disease Association • Haplotypes/genotypes define our individuality • Genetically engineered athletes might win at Beijing Olympics (Time (07/2004)) • Haplotypes contribute to risk factors of complex diseases (e.g., diabetes) • International HapMap project: http://www.hapmap.org • SNP’s causing disease reason are hidden among 10 million SNPs. • Too expensive to search • HapMap tries to identify 1 million tag SNPs providing almost as much mapping information as entire 10 million SNPs.
Outline • SNPs, haplotypes and genotypes • Haplotype tagging problem • Linear reduction method for tagging • Maximizing tagging separability • Conclusions & future work
Tagging Reduces Cost • Decrease SNP haplotyping cost: • sequence only small amount of SNPs = tag SNP • infer rest of (certain) SNPs based on sequenced tag SNPs • Cost-saving ratio = m / k (infinite population) • Traditional tagging = linkage disequilibrium (LD) needs too many SNPs, cost-saving ratio is too small (≈ 2) • Proposed linear reduction method: cost-saving ratio ≈ 20 Number of SNPs: m Number of Tags : k
Haplotype Tagging Problem • Given the full pattern of all SNPs for sample • Findminimum number of tag SNPs that will allow for reconstructing the complete haplotype for each individual
Outline • SNPs, haplotypes and genotypes • Haplotype tagging problem • Linear reduction method for tagging • Maximizing tagging separability • Conclusions & future work
Linear Rank of Recombinations • Human Haplotype Evolution = • Mutations – introduce SNPs • Recombinations – propagate SNPs over entire population • Replace notations (0, 1) with (–1, 1) • Theorem: Haplotype population generated from l haplotypes with recombinations at k spots has linear rank (l-1)(k+2) • It is much less than number of all haplotypes = l k • Conclusion: use only linearly independent SNP’s as tags
Tag SNPs Selection • Tag Selecting Algorithm • Using Gauss-Jordan Elimination find Row Reduced Echelon Form (RREF) X of sample matrix S. • Extract the basis T of sample S • Factorize sample S = T X • Output set of tags T • Fact: In sample, each SNP is a linear combination of tag SNPs • Conjecture: In entire population, each SNP is same linear combination of tags as in sample = × tags T rref X Sample S
Haplotype Reconstruction • Given tags t of unknown haplotype h andRREF X of sample matrix S • Find unknown haplotype h • Predict the h’ = t X • We may have errors, since predicted h’ may not equal to unknown haplotype h. we assign –1 if predicted values are negative and +1 otherwise. (RLRP) • Variant : randomly reshuffle SNPs before choosing tags (RLR) Unknown haplotype h rref X Predicted haplotype h’ tags set =
Results for Simulated Data • Cost-saving ratio for 2% error for LR is 3.9 and for RLRP is 13 • P =1000 different haplotypes • m =25000 sites • Sample size = k (number of tag SNP’s) = 50,100,…,750
Results for Real Data • Cost-saving ratio for 5% error for LR is 2.1 and for RLRP is 2.8 • P =158 different haplotypes (Daly el.,) • m =103 sites • Sample size = k (number of tag SNP’s) = 10,15,20,…,90
Outline • SNPs, haplotypes and genotypes • Haplotype tagging problem • Linear reduction method for tagging • Maximizing tagging separability • Conclusions & future work
Tag Separability • Correlation between number of zeros for SNPs in RREF X and number of errors in prediction column • Greedy heuristic gives a more separable basis. For 5% error, cost-saving ratio 2.8 vs 3.3 for RLRP
Conclusions and Future work • Our contributions • new SNP tagging problem formulation • linear reduction method for SNP tagging • enhancement of linear reduction using separable basis • Future work • application of tagging for genotype and haplotype disease association