Combinatorial Methods for Disease Association Search and Susceptibility Prediction

1. Combinatorial Methods for Disease Association Search and Susceptibility Prediction Alexander Zelikovsky joint work with Dumitru Brinza Department of Computer Science First of all, I would like to thank Dr. Srinivas for giving me the opportunity to make a presentation here at Cal Poly Pomona, And thank you everybody for coming to listen to my talk. The topic of my presentation is Combinatorial methods for Computational Problems in Genetic EpidemiologyFirst of all, I would like to thank Dr. Srinivas for giving me the opportunity to make a presentation here at Cal Poly Pomona, And thank you everybody for coming to listen to my talk. The topic of my presentation is Combinatorial methods for Computational Problems in Genetic Epidemiology

2. 2 Outline SNPs, Haplotypes and Genotypes Disease Association Search Genome-wide association search challenges Problem formulation Exhaustive & Combinatorial Search Optimization formulation & complimentary greedy search Predicting susceptibility to complex diseases Problem formulation/cross-validation Previous methods: SVM, RF, LP Optimum clustering and prediction via model-fitting Conclusions In first section I want to introduce some basics of human genetics and some terminologies Such as SNPs, haplotypes and genotypes. Then I will talk about phasing, which is a very important computational problem. In the first section I also want to give an overview for genetic epidemiology. In the second section I will talk about my research, including phasing for family trios and Genetic susceptibility to Complex Diseases. The last section is conclusions and future plans. In first section I want to introduce some basics of human genetics and some terminologies Such as SNPs, haplotypes and genotypes. Then I will talk about phasing, which is a very important computational problem. In the first section I also want to give an overview for genetic epidemiology. In the second section I will talk about my research, including phasing for family trios and Genetic susceptibility to Complex Diseases. The last section is conclusions and future plans.

3. 3 Length of Human Genome ? 3 ? 109 #Single nucleotide polymorphism (SNPs) ? 1 ? 107 SNPs are mostly biallelic, e.g., A??C Minor allele frequency should be considerable e.g. >.1% Difference b/w ALL people ? 0.25% (b/w any 2 ? 0.1%) Diploid = two different copies of each chromosome Haplotype = description of a single copy (expensive) example: 00110101 (0 is for major, 1 is for minor allele) Genotype = description of the mixed two copies example 01122110 (0=00, 1=11, 2=01) International Hapmap project: www.hapmap.org SNP, Haplotypes, Genotypes

4. 4 Challenges of Disease Association Monogenic disease A mutated gene is entirely responsible for the disease . Typically rare in population: < 0.1%. Complex disease Interaction of multiple genes 2-SNP interaction analysis for a genome-wide scan with 1 million SNPs (3 kb coverage) has 1012 pairwise tests Multiple independent causes Each cause explains < 10-20% of cases Common: > 0.1%. In NY city, 12% of the population has Type 2 Diabetes Multiple testing adjustment Reason for non-reproducible findings In fact, phasing is a preliminary step for genetic epidemiology studies. Genetic epidemiology is searching for Genetic risk factors for diseases. These factors can be SNP�s, genotypes or haplotypes. If the disease is caused by a single gene, then the disease is monogenic, and it occurs very rare, less than 0.1% in population. If the disease is caused by the interaction of multiple genes, then the disease is complex, because it occurs more than 0.1% of Population, so complex diseases are also called common diseases. In NY city, 12% of population has diabetes type 2. The significance of risk factor is measured by risk rate. For example, everybody knows that smoking is a risk factor for the lung cancer. Then smoking is an environmental factor for the disease. Risk factors are measured by risk rate. To evaluate the risk factor of smoking, we need to Compare chance of cancer to happen among smokers with the chance of cancer to happen among non-smokers, the ratio is the risk rate. The higher the risk rate, the more confident we are about the risk factor.In fact, phasing is a preliminary step for genetic epidemiology studies. Genetic epidemiology is searching for Genetic risk factors for diseases. These factors can be SNP�s, genotypes or haplotypes. If the disease is caused by a single gene, then the disease is monogenic, and it occurs very rare, less than 0.1% in population. If the disease is caused by the interaction of multiple genes, then the disease is complex, because it occurs more than 0.1% of Population, so complex diseases are also called common diseases. In NY city, 12% of population has diabetes type 2. The significance of risk factor is measured by risk rate. For example, everybody knows that smoking is a risk factor for the lung cancer. Then smoking is an environmental factor for the disease. Risk factors are measured by risk rate. To evaluate the risk factor of smoking, we need to Compare chance of cancer to happen among smokers with the chance of cancer to happen among non-smokers, the ratio is the risk rate. The higher the risk rate, the more confident we are about the risk factor.

5. 5 Disease Association Search Problem Genetic susceptibility to complex diseases is a key problem in epidemiology. Here is the problem formulation. Given genotypes of sick and healthy persons, and a testing person�s genotype, find the Disease status of the testing person. In epidemiology, the sick and healthy persons are classified as case and control. In computation, they are represented as 1/-1, respectively. You may have heard the case/control studies, which is what I�m doing.Genetic susceptibility to complex diseases is a key problem in epidemiology. Here is the problem formulation. Given genotypes of sick and healthy persons, and a testing person�s genotype, find the Disease status of the testing person. In epidemiology, the sick and healthy persons are classified as case and control. In computation, they are represented as 1/-1, respectively. You may have heard the case/control studies, which is what I�m doing.

6. 6 Significance of Risk/Resistance Factors Measured by Relative risk (RR) Odds ratio (OR) Their p-values Unadjusted p-value: Probability of case/control distribution among exposed to risk factor, computed by binomial distribution Multiple-testing adjustement: Bonferroni easy to compute overly conservative Randomization computationally expensive more accurate

7. 7 Exhaustive & Combinatorial Search Exhaustive search is infeasible sample with n genotypes/m SNPs requires O(n3m) Combinatorial search Definition: Disease-closure of a multi-SNP combination C is a multi-SNP combination C�, with maximum number of SNPs, which consists of the same set of disease individuals and minimum number of nondisease individuals. Searches only closed clusters Closure of cluster C = C� d(C�)=d(C) and h(C�) is minimized Avoids checking of trivial MSCs Small d(C) implies not looking in subclusters Finds faster associated MSCs but still too slow Tagging: compress S by extracting most informative SNPs restore other SNPs from tag SNPs multiple regression method Genetic susceptibility to complex diseases is a key problem in epidemiology. Here is the problem formulation. Given genotypes of sick and healthy persons, and a testing person�s genotype, find the Disease status of the testing person. In epidemiology, the sick and healthy persons are classified as case and control. In computation, they are represented as 1/-1, respectively. You may have heard the case/control studies, which is what I�m doing.Genetic susceptibility to complex diseases is a key problem in epidemiology. Here is the problem formulation. Given genotypes of sick and healthy persons, and a testing person�s genotype, find the Disease status of the testing person. In epidemiology, the sick and healthy persons are classified as case and control. In computation, they are represented as 1/-1, respectively. You may have heard the case/control studies, which is what I�m doing.

8. 8 MLR Tagging

9. 9 Data Sets Crohn's disease (Daly et al ): inflammatory bowel disease (IBD). Location: 5q31 Number of SNPs: 103 Population Size: 387 case: 144 control: 243 Autoimmune disorders (Ueda et al) : Location: containing gene CD28, CTLA4 and ICONS Number of SNPs: 108 Population Size: 1024 case: 378 control: 646 Tick-borne encephalitis dataset of (Barkash et al) : Location: containing gene TLR3, PKR, OAS1, OAS2, and OAS3. Number of SNPs: 41 Population Size: 75 case: 21 control: 54

10. 10 Disease association search results IES(30): exhaustive search 30 indexed SNPs with MLR based tagging method ICS(30): combinatorial search 30 indexed SNPs with MLR based tagging method. In first section I want to introduce some basics of human genetics and some terminologies Such as SNPs, haplotypes and genotypes. Then I will talk about phasing, which is a very important computational problem. In the first section I also want to give an overview for genetic epidemiology. In the second section I will talk about my research, including phasing for family trios and Genetic susceptibility to Complex Diseases. The last section is conclusions and future plans. In first section I want to introduce some basics of human genetics and some terminologies Such as SNPs, haplotypes and genotypes. Then I will talk about phasing, which is a very important computational problem. In the first section I also want to give an overview for genetic epidemiology. In the second section I will talk about my research, including phasing for family trios and Genetic susceptibility to Complex Diseases. The last section is conclusions and future plans.

11. 11 Disease Association Search Optimum Association Search Problem: Find MSC that is the most associated with the disease Measure: positive predictive value = find (non-)diseased-free cluster of maximum size Bad news: Generalization of max independent set NP complete and cannot be well approximated Hope: sample S is not arbitrary Genetic susceptibility to complex diseases is a key problem in epidemiology. Here is the problem formulation. Given genotypes of sick and healthy persons, and a testing person�s genotype, find the Disease status of the testing person. In epidemiology, the sick and healthy persons are classified as case and control. In computation, they are represented as 1/-1, respectively. You may have heard the case/control studies, which is what I�m doing.Genetic susceptibility to complex diseases is a key problem in epidemiology. Here is the problem formulation. Given genotypes of sick and healthy persons, and a testing person�s genotype, find the Disease status of the testing person. In epidemiology, the sick and healthy persons are classified as case and control. In computation, they are represented as 1/-1, respectively. You may have heard the case/control studies, which is what I�m doing.

12. 12 Complimentary Greedy Algorithm Algorithm Start with C=S (resp. MCS is empty) Repeat until h(C)=0 (non-diseased-free) Find 1-SC s maximizing (h(C)-h(C ? {s})) / (d(C) � d(C ? {s})) = minimize payment with diseased for removal of non-diseased Add s to SNPs of C�s MSC Analogy: finding independent set by greedy removing highest degree vertecies Extremely fast but inaccurate Can be used in susceptibility prediction Genetic susceptibility to complex diseases is a key problem in epidemiology. Here is the problem formulation. Given genotypes of sick and healthy persons, and a testing person�s genotype, find the Disease status of the testing person. In epidemiology, the sick and healthy persons are classified as case and control. In computation, they are represented as 1/-1, respectively. You may have heard the case/control studies, which is what I�m doing.Genetic susceptibility to complex diseases is a key problem in epidemiology. Here is the problem formulation. Given genotypes of sick and healthy persons, and a testing person�s genotype, find the Disease status of the testing person. In epidemiology, the sick and healthy persons are classified as case and control. In computation, they are represented as 1/-1, respectively. You may have heard the case/control studies, which is what I�m doing.

13. 13 Most disease-associated & disease-resistant MSC Comparison of three methods for searching the disease-associated and disease- resistant multi-SNPs combinations with the largest PPV. The starred values refer to results of the runtime-constrained exhaustive search In first section I want to introduce some basics of human genetics and some terminologies Such as SNPs, haplotypes and genotypes. Then I will talk about phasing, which is a very important computational problem. In the first section I also want to give an overview for genetic epidemiology. In the second section I will talk about my research, including phasing for family trios and Genetic susceptibility to Complex Diseases. The last section is conclusions and future plans. In first section I want to introduce some basics of human genetics and some terminologies Such as SNPs, haplotypes and genotypes. Then I will talk about phasing, which is a very important computational problem. In the first section I also want to give an overview for genetic epidemiology. In the second section I will talk about my research, including phasing for family trios and Genetic susceptibility to Complex Diseases. The last section is conclusions and future plans.

14. 14 Genetic Susceptibility Prediction Given: Genotypes of diseased and non-diseased individuals, Genotype of a testing person. Find: The disease status of the testing person Genetic susceptibility to complex diseases is a key problem in epidemiology. Here is the problem formulation. Given genotypes of sick and healthy persons, and a testing person�s genotype, find the Disease status of the testing person. In epidemiology, the sick and healthy persons are classified as case and control. In computation, they are represented as 1/-1, respectively. You may have heard the case/control studies, which is what I�m doing.Genetic susceptibility to complex diseases is a key problem in epidemiology. Here is the problem formulation. Given genotypes of sick and healthy persons, and a testing person�s genotype, find the Disease status of the testing person. In epidemiology, the sick and healthy persons are classified as case and control. In computation, they are represented as 1/-1, respectively. You may have heard the case/control studies, which is what I�m doing.

15. 15 Cross-validation Leave-one-out test: The disease status of each genotype in the data set is predicted while the rest of the data is regarded as the training set. Cross-validation method is used to compute the four numbers, TP, TN, FP,FN. The Cross-validation method includes leave-one out test and leave-many-out test. The disease status of each genotype in the data set is predicted while rest of the data is regarded as the training set. Let�s see the example. This is the data set. One genotype is left out, and others are regarded as training set. Based on information of training set, we predict the disease status for the left-out genotype. We repeat it until All genotypes of the data set have been tested. Then we compare the real disease status with the predicted status Value to get the accuracy. leave-many-out is another kind of cross-validation test. In leave-many-out tests, use 2/3 of the data set as training set and prediction the disease status of the rest. Repeat the process for many times and get the average of the prediction rate. Cross-validation method is used to compute the four numbers, TP, TN, FP,FN. The Cross-validation method includes leave-one out test and leave-many-out test. The disease status of each genotype in the data set is predicted while rest of the data is regarded as the training set. Let�s see the example. This is the data set. One genotype is left out, and others are regarded as training set. Based on information of training set, we predict the disease status for the left-out genotype. We repeat it until All genotypes of the data set have been tested. Then we compare the real disease status with the predicted status Value to get the accuracy. leave-many-out is another kind of cross-validation test. In leave-many-out tests, use 2/3 of the data set as training set and prediction the disease status of the rest. Repeat the process for many times and get the average of the prediction rate.

16. 16 Quality Measures of Prediction (confusion table) Sensitivity: The ability to correctly detect disease. sensitivity = TP/(TP+FN) Specificity: The ability to avoid calling normal as disease. specificity = TN/(FP+TN) Accuracy = (TP +TN)/(TP+FP+FN+TN) Risk Rate: Measurements for risk factors. To measure the quality of prediction methods, we need compare the prediction result with original Disease status for the population. The first column is the sick population and the second column is the healthy population. The first row is the population that predicted as sick and the second row is the population that predicted as healthy. The prediction can produce two kinds of errors: a false positive result or a false negative result False positive means you are predicted sick but you are healthy. False negative means you are predicted healthy but you are sick. True positive and True negative are correctly predict In general, a population of tested individuals may be divided into four groups: sensitivity is the ability to correctly detect a disease. Specificity is the ability to avoid calling normal as disease. Accuracy is the percent of the population that are correctly predicted. To measure the quality of prediction methods, we need compare the prediction result with original Disease status for the population. The first column is the sick population and the second column is the healthy population. The first row is the population that predicted as sick and the second row is the population that predicted as healthy. The prediction can produce two kinds of errors: a false positive result or a false negative result False positive means you are predicted sick but you are healthy. False negative means you are predicted healthy but you are sick. True positive and True negative are correctly predict In general, a population of tested individuals may be divided into four groups: sensitivity is the ability to correctly detect a disease. Specificity is the ability to avoid calling normal as disease. Accuracy is the percent of the population that are correctly predicted.

17. 17 Prediction Methods Support vector machine Random forest LP-based prediction

18. 18 Prediction via Clustering Drawback of the prediction problem formulation = need of cross-validation ? no optimization Clustering P = partition into clusters defined by MSC�s Minimizing number of errors S.t. bounded information entropy �?(Si/S) log(Si/S) Model-fitting prediction Set status of testing genotype to diseased Find number of errors Set status of testing genotype to diseased Find number of errors Predict status that implies lesser number of errors In this algorithm, we assume that Certain haplotypes are susceptible to the disease while others are resistant to the disease, and the genotype susceptibility is assumed to be a sum of susceptibilities of its two haplotypes. We want to assign a weight for each haplotypes, such that all genotypes for case are positive and all genotypes for control are negative. For testing genotypes, we just need to Add up the weight of its two haplotypesIn this algorithm, we assume that Certain haplotypes are susceptible to the disease while others are resistant to the disease, and the genotype susceptibility is assumed to be a sum of susceptibilities of its two haplotypes. We want to assign a weight for each haplotypes, such that all genotypes for case are positive and all genotypes for control are negative. For testing genotypes, we just need to Add up the weight of its two haplotypes

19. 19 Leave-1-out cross-validation results Leave-one-out cross-validation for combinatorial search-based prediction (CSP) and complimentary greedy search-based prediction (CGSP) are given when 20, 30, or all SNPs are chosen as informative SNPs. In first section I want to introduce some basics of human genetics and some terminologies Such as SNPs, haplotypes and genotypes. Then I will talk about phasing, which is a very important computational problem. In the first section I also want to give an overview for genetic epidemiology. In the second section I will talk about my research, including phasing for family trios and Genetic susceptibility to Complex Diseases. The last section is conclusions and future plans. In first section I want to introduce some basics of human genetics and some terminologies Such as SNPs, haplotypes and genotypes. Then I will talk about phasing, which is a very important computational problem. In the first section I also want to give an overview for genetic epidemiology. In the second section I will talk about my research, including phasing for family trios and Genetic susceptibility to Complex Diseases. The last section is conclusions and future plans.

20. 20 ROC curve Comparison of 5 prediction methods on (Barkash et. al,2006 ) data on all SNPs. Area under the CSP�s curve is 0.81 vs 0.52 under the SVM�s curve. In first section I want to introduce some basics of human genetics and some terminologies Such as SNPs, haplotypes and genotypes. Then I will talk about phasing, which is a very important computational problem. In the first section I also want to give an overview for genetic epidemiology. In the second section I will talk about my research, including phasing for family trios and Genetic susceptibility to Complex Diseases. The last section is conclusions and future plans. In first section I want to introduce some basics of human genetics and some terminologies Such as SNPs, haplotypes and genotypes. Then I will talk about phasing, which is a very important computational problem. In the first section I also want to give an overview for genetic epidemiology. In the second section I will talk about my research, including phasing for family trios and Genetic susceptibility to Complex Diseases. The last section is conclusions and future plans.

21. 21 Conclusions Combinatorial search is able to find statistically significant multi-gene interactions, for data where no significant association was detected before Complimentary greedy search can be used in susceptibility prediction Optimization approach to prediction New susceptibility prediction is by 8% higher than the best previously known MLR-tagging efficiently reduces the datasets allowing to find associated multi-SNP combinations and predict susceptibility

22. 22

23. 23

24. 24 Support Vector Machine (SVM) Algorithm Learning Task Given: Genotypes of patients and healthy persons. Compute: A model distinguishing if a person has the disease. Classification Task Given: Genotype of a new patient + a learned model Determine: If a patient has the disease or not.

25. 25 Random Forest Algorithm Random Forests grows many classification trees. To classify a new object from an input vector, put the input vector down to each tree in the forest. Each tree gives a classification, and we say the tree �votes� for that class. The forest chooses the classification having the most votes (over all the trees in the forest). Growing Tree, Split selection and Prediction. Random sub-sample of training data, Random splitter selection.

26. 26 LP-based Prediction Algorithm Model: Certain haplotypes are susceptible to the disease while others are resistant to the disease. The genotype susceptibility is assumed to be a sum of susceptibilities of its two haplotypes. Assign a positive weight to susceptible haplotypes and a negative weight to resistant haplotypes such that for any control genotype the sum of weights of its haplotypes is negative and for any case genotype it is positive. For each vertex-haplotype hi assign the weight pi, such that for any genotype-edge ei,j =(hi,,hj ) where s(ei,j ) ? {-1,1} is the disease status of genotype ei,j. The sum of absolute values of genotype weights is maximized. In this algorithm, we assume that Certain haplotypes are susceptible to the disease while others are resistant to the disease, and the genotype susceptibility is assumed to be a sum of susceptibilities of its two haplotypes. We want to assign a weight for each haplotypes, such that all genotypes for case are positive and all genotypes for control are negative. For testing genotypes, we just need to Add up the weight of its two haplotypesIn this algorithm, we assume that Certain haplotypes are susceptible to the disease while others are resistant to the disease, and the genotype susceptibility is assumed to be a sum of susceptibilities of its two haplotypes. We want to assign a weight for each haplotypes, such that all genotypes for case are positive and all genotypes for control are negative. For testing genotypes, we just need to Add up the weight of its two haplotypes

Combinatorial Methods for Disease Association Search and Susceptibility Prediction

Combinatorial Methods for Disease Association Search and Susceptibility Prediction

Presentation Transcript

Florida Association for Search and Rescue

Solution Counting Methods for Combinatorial Problems

A Combinatorial Prediction Market for the U.S. Elections

Advanced Topics in Combinatorial Methods for Testing

Animal Disease And Parasite Susceptibility

Combinatorial Prediction Markets

Window for Network and Disease Association Search

The search for susceptibility genes for AD with psychosis

Prediction methods:

Combinatorial Search (CS) for Disease-Association:

Combinatorial Methods for Disease Association Search and Susceptibility Prediction

CS5238 Combinatorial methods in bioinformatics

Update on Disease Susceptibility

Combinatorial Prediction Markets

Probability and Asset Updating Using Bayesian Networks for Combinatorial Prediction Markets

Combinatorial Search Methods for Genotypes Associated with Lung Cancer Dumitru Brinza

Iterative Methods and Combinatorial Preconditioners

Combinatorial Methods for Event Sequence Testing

Iterative Methods and Combinatorial Preconditioners

Combinatorial Search

Prevention Methods and Treatment for Tuberculosis Disease

Search and Optimization Methods