10 likes | 94 Views
Our contributions. SNP and Disease. MSC. x x 1 x x 2 x x x. 0 1 1 0 1 2 1 0 2 sick. A novel combinatorial method for finding disease- associated multi-SNP combinations was developed. Multi-SNP combinations significantly associating with diseases were found.
E N D
Our contributions SNP and Disease MSC x x 1 x x 2 x x x 0 1 1 0 1 2 1 0 2 sick • A novel combinatorial method for finding disease- associated multi-SNP combinations was developed. • Multi-SNP combinations significantly associating with diseases were found. • For Crohn's disease data (Daly, et al., 2001), a few associated multi-SNP combinations with multiple-testing-adjusted to p < 0.05 were found, while no single SNP or pair of SNPs showed significant association. • For a dataset for an autoimmune disorder (Ueda, et al., 2003), a few previously unknown associated multi-SNP combinations were found. • For tick-borne encephalitis virus-induced disease, a multi-SNP combination within a group of genes showing a high degree of linkage disequilibrium significantly associated with the severity of the disease was found. • A model-fitting disease susceptibility prediction methods based on the developed search methods were proposed. • SNP - single nucleotide polymorphism where two or more different nucleotides occur in a large percentage of population • 0 = willde type/major (frequency) allele • 1 = mutation/minor (frequency) allele • 2 = heterozygous allele • Searching for genetic risk factors for diseases • Monogenic diseases • A mutated gene is entirely responsible for the disease • Complex diseases • Affected by the interaction of multiple genes • Significance of risk factor is usually measured by Risk Rate or _ _ _Odds Ratio • We measure significance by the p-value of the set of genotypes _defined by risk factor 0 1 1 1 0 2 0 0 1 sick 4 sick : 1 healthy 0 0 1 0 0 0 0 2 1 sick 0 1 1 1 1 2 0 0 1 sick check significance 0 0 1 0 1 2 1 0 2 sick 0 1 0 0 1 1 0 0 2 healthy 0 1 1 0 1 2 0 0 2 healthy Statistical significance • Multi-SNP combination (MSC) define a set of case and control individuals • MSC is considered statistically significant if the frequency of cases and controls distribution has p-value < 0.05 • A lot of reported findings are frequently not reproducible on different populations. It is believed that this happens because the p-values are unadjusted to multiple testing Disease-Associated Multi-SNP Combinations Search Disease association analysis • Given: a population of n genotypes (or haplotypes) each containing values of m SNPs from {0,1,2} and disease status (case or control) • Find: all multi-SNP combinations with multiple testing adjusted p-value of the frequency distribution below 0.05 • Analysis of variation in suspected genes in case and controls individuals is aimed at identifying SNPs with considerably higher frequencies among the case individuals than among the control individuals • Most searches are done on a SNP-by-SNP basis • Recently two-SNP analysis shows promising results (Marchini et al, 2005) • Multi-SNP analyses are expected to find even stronger disease associations • Common diseases can be caused by combinations of several unlinked gene (SNPs) variations • We address the computational challenge of searching for such multi-gene causal combinations • The number of multi-SNP combinations is infeasible high (3100 for 100 SNPs). • How to find associated multi-SNP combinations without total checking? • Disease association analysis searches for a SNPs or multi-SNP combinations with frequency among cases considerably higher than among controls. • If the reported SNP is found among 100 SNPs then the probability that the SNP is associated with a disease by mere chance becomes 100 times larger (Bonferroni). • Bonferroni is too crude (e.g., 3-SNP combinations among 100 SNPs, p < 0.05×10-6) • We adjust resulted p-values via randomization • Unadjusted p-value:Probability of case/control distribution in a set defined by MSC, computed by binomial distribution • Multiple-testing adjusted p-value :randomization • Randomly permute the disease status of the population to generate 10000 instances. • Apply searching methods on each instance to get MSCs. • Compute the probability of MSCs that have a higher unadjusted p-value than the observed p-value. • In our search we report only MSC with adjusted p-value < 0.05 • Combinatorial Search (CS) for Disease-Association: checks all one-SNP, two-SNP, ..., m-SNP case-closed MSCs Case-closureof a MSC C is an MSC C’, with maximum number of SNPs, which consists of the same set of cases and minimum number of controls. • Case-closure allow finding of the statistically significant MSC on the earlier stage of searching. • Trivial MSCs and MSCs which coincide after case-closure are avoided. That significantly speedups the searching. • Faster than exhaustive search • Finds more significant association on the early stage of searching • Still slow for wide-genome studies • Clustering-based Model-Fitting Algorithm for Disease Susceptibility Prediction: • For the given training dataset and tested genotype consider two cases: • tested genotype is added to the training dataset as a sick • tested genotype is added to the training dataset as a healthy • For the both cases obtain clustering by applying CGS to find: • the most disease-associated MSC (defines a set of sick genotypes) • the most disease-resistant MSC (defines a set of healthy genotypes) • Remove from the original dataset one which is larger • Repeat this procedure until all genotypes are removed • Predict susceptibility of the tested genotype according to the case which has lower entropy of clustering. Results for Disease Susceptibility Prediction Maximum Case(Control)-Free Cluster Problem • Quality measure Find a maximum size cluster C containing only cases or controls • Complimentary Greedy Search (CGS): 1. Find SNP with allele value removing a set of genotypes with highest ratio of controls over cases. 2. Add the SNP to resulted MSC 3. Repeat 1-2 until all controls are removed. Resultant MSC defines a subset of sick genotypes. 4. Adjust to multiple testing the p-value of the resultant MSC. • Leave-one-out cross validation results Data Sets • [3] Crohn's disease : 387 genotypes with 103 SNPs derived from the 616 KB region of human Chromosome 5q31, 144 disease genotypes and 243 nondisease genotypes. (Daly et al., 2001). • [10] Autoimmune disorder : 1024 genotypes with 108 SNPs containing gene CD28, CTLA4 and ICONS, 378 disease genotypes and 646 nondisease genotypes. (Ueda et al., 2003). • [4] Tick-borne encephalitis : 75 genotypes with 41 SNPs containing gene TLR3, PKR, OAS1, OAS2, and OAS3, 21 disease genotypes and 54 nondisease genotypes. (Barkash et al., 2006). Disease Susceptibility Prediction Problem • Given a sample population S (a training set) and one more individual tS with the known SNPs but unknown disease status (testing individual), find (predict) the unknown disease status • Disease Clustering Problem: • Given a population sample S, find a partition P of S into clusters S = S1..Sk , with disease status 0 or 1 assigned to each cluster Si , minimizing entropy(P) • Comparison of 5 prediction methods on [4] data on all SNPs. Area under the CSP’s ROC curve is 0.87 vs 0.52 under the SVM’s curve Results/comparison of searching methods • Comparison of three methods for searching the disease-associated and disease-resistant multi-SNPs combinations with the largest PPV. • Combinatorial search is able to find statistically significant multi-gene interactions, for data where no significant association was detected before • Complimentary greedy search can be used in susceptibility prediction • Optimization approach to prediction • New susceptibility prediction is by 8% higher than the best previously known • MLR-tagging efficiently reduces the datasets allowing to find associated multi-SNP combinations and predict susceptibility for a given bound on the number of individuals who are assigned incorrect status in clusters of the partition P, error(P)< *|P|.