260 likes | 413 Views
2. Outline. SNPs, Haplotypes and GenotypesDisease Association SearchGenome-wide association search challengesProblem formulationExhaustive
E N D
1. Combinatorial Methods for Disease Association Search and Susceptibility Prediction
Alexander Zelikovsky
joint work with Dumitru Brinza
Department of Computer Science First of all, I would like to thank Dr. Srinivas for giving me the opportunity to make a presentation here at Cal Poly Pomona,
And thank you everybody for coming to listen to my talk.
The topic of my presentation is Combinatorial methods for Computational Problems in Genetic EpidemiologyFirst of all, I would like to thank Dr. Srinivas for giving me the opportunity to make a presentation here at Cal Poly Pomona,
And thank you everybody for coming to listen to my talk.
The topic of my presentation is Combinatorial methods for Computational Problems in Genetic Epidemiology
2. 2 Outline
SNPs, Haplotypes and Genotypes
Disease Association Search
Genome-wide association search challenges
Problem formulation
Exhaustive & Combinatorial Search
Optimization formulation & complimentary greedy search
Predicting susceptibility to complex diseases
Problem formulation/cross-validation
Previous methods: SVM, RF, LP
Optimum clustering and prediction via model-fitting
Conclusions In first section I want to introduce some basics of human genetics and some terminologies
Such as SNPs, haplotypes and genotypes. Then I will talk about phasing, which is a very important
computational problem. In the first section I also want to give an overview for genetic epidemiology.
In the second section I will talk about my research, including phasing for family trios and
Genetic susceptibility to Complex Diseases.
The last section is conclusions and future plans.
In first section I want to introduce some basics of human genetics and some terminologies
Such as SNPs, haplotypes and genotypes. Then I will talk about phasing, which is a very important
computational problem. In the first section I also want to give an overview for genetic epidemiology.
In the second section I will talk about my research, including phasing for family trios and
Genetic susceptibility to Complex Diseases.
The last section is conclusions and future plans.
3. 3 Length of Human Genome ? 3 ? 109
#Single nucleotide polymorphism (SNPs) ? 1 ? 107
SNPs are mostly biallelic, e.g., A??C
Minor allele frequency should be considerable e.g. >.1%
Difference b/w ALL people ? 0.25% (b/w any 2 ? 0.1%)
Diploid = two different copies of each chromosome
Haplotype = description of a single copy (expensive)
example: 00110101 (0 is for major, 1 is for minor allele)
Genotype = description of the mixed two copies
example 01122110 (0=00, 1=11, 2=01)
International Hapmap project: www.hapmap.org SNP, Haplotypes, Genotypes
4. 4 Challenges of Disease Association Monogenic disease
A mutated gene is entirely responsible for the disease .
Typically rare in population: < 0.1%.
Complex disease
Interaction of multiple genes
2-SNP interaction analysis for a genome-wide scan with 1 million SNPs (3 kb coverage) has 1012 pairwise tests
Multiple independent causes
Each cause explains < 10-20% of cases
Common: > 0.1%.
In NY city, 12% of the population has Type 2 Diabetes
Multiple testing adjustment
Reason for non-reproducible findings In fact, phasing is a preliminary step for genetic epidemiology studies. Genetic epidemiology is searching for
Genetic risk factors for diseases. These factors can be SNPs, genotypes or haplotypes.
If the disease is caused by a single gene, then the disease is monogenic, and it occurs very rare, less than 0.1% in population.
If the disease is caused by the interaction of multiple genes, then the disease is complex, because it occurs more than 0.1% of Population, so complex diseases are also called common diseases. In NY city, 12% of population has diabetes type 2.
The significance of risk factor is measured by risk rate.
For example, everybody knows that smoking is a risk factor for the lung cancer. Then smoking is an environmental factor for the disease.
Risk factors are measured by risk rate. To evaluate the risk factor of smoking, we need to
Compare chance of cancer to happen among smokers with the chance of cancer to happen among non-smokers, the ratio is the risk rate. The
higher the risk rate, the more confident we are about the risk factor.In fact, phasing is a preliminary step for genetic epidemiology studies. Genetic epidemiology is searching for
Genetic risk factors for diseases. These factors can be SNPs, genotypes or haplotypes.
If the disease is caused by a single gene, then the disease is monogenic, and it occurs very rare, less than 0.1% in population.
If the disease is caused by the interaction of multiple genes, then the disease is complex, because it occurs more than 0.1% of Population, so complex diseases are also called common diseases. In NY city, 12% of population has diabetes type 2.
The significance of risk factor is measured by risk rate.
For example, everybody knows that smoking is a risk factor for the lung cancer. Then smoking is an environmental factor for the disease.
Risk factors are measured by risk rate. To evaluate the risk factor of smoking, we need to
Compare chance of cancer to happen among smokers with the chance of cancer to happen among non-smokers, the ratio is the risk rate. The
higher the risk rate, the more confident we are about the risk factor.
5. 5 Disease Association Search Problem Genetic susceptibility to complex diseases is a key problem in epidemiology.
Here is the problem formulation.
Given genotypes of sick and healthy persons, and a testing persons genotype, find the
Disease status of the testing person.
In epidemiology, the sick and healthy persons are classified as case and control.
In computation, they are represented as 1/-1, respectively.
You may have heard the case/control studies, which is what Im doing.Genetic susceptibility to complex diseases is a key problem in epidemiology.
Here is the problem formulation.
Given genotypes of sick and healthy persons, and a testing persons genotype, find the
Disease status of the testing person.
In epidemiology, the sick and healthy persons are classified as case and control.
In computation, they are represented as 1/-1, respectively.
You may have heard the case/control studies, which is what Im doing.
6. 6 Significance of Risk/Resistance Factors Measured by
Relative risk (RR)
Odds ratio (OR)
Their p-values
Unadjusted p-value: Probability of case/control distribution among exposed to risk factor, computed by binomial distribution
Multiple-testing adjustement:
Bonferroni
easy to compute
overly conservative
Randomization
computationally expensive
more accurate
7. 7 Exhaustive & Combinatorial Search Exhaustive search is infeasible
sample with n genotypes/m SNPs requires O(n3m)
Combinatorial search
Definition: Disease-closure of a multi-SNP combination C is a multi-SNP combination C, with maximum number of SNPs, which consists of the same set of disease individuals and minimum number of nondisease individuals.
Searches only closed clusters
Closure of cluster C = C
d(C)=d(C) and h(C) is minimized
Avoids checking of trivial MSCs
Small d(C) implies not looking in subclusters
Finds faster associated MSCs but still too slow
Tagging:
compress S by extracting most informative SNPs
restore other SNPs from tag SNPs
multiple regression method Genetic susceptibility to complex diseases is a key problem in epidemiology.
Here is the problem formulation.
Given genotypes of sick and healthy persons, and a testing persons genotype, find the
Disease status of the testing person.
In epidemiology, the sick and healthy persons are classified as case and control.
In computation, they are represented as 1/-1, respectively.
You may have heard the case/control studies, which is what Im doing.Genetic susceptibility to complex diseases is a key problem in epidemiology.
Here is the problem formulation.
Given genotypes of sick and healthy persons, and a testing persons genotype, find the
Disease status of the testing person.
In epidemiology, the sick and healthy persons are classified as case and control.
In computation, they are represented as 1/-1, respectively.
You may have heard the case/control studies, which is what Im doing.
8. 8 MLR Tagging
9. 9 Data Sets Crohn's disease (Daly et al ): inflammatory bowel disease (IBD).
Location: 5q31
Number of SNPs: 103
Population Size: 387
case: 144 control: 243
Autoimmune disorders (Ueda et al) :
Location: containing gene CD28, CTLA4 and ICONS
Number of SNPs: 108
Population Size: 1024
case: 378 control: 646
Tick-borne encephalitis dataset of (Barkash et al) :
Location: containing gene TLR3, PKR, OAS1, OAS2, and OAS3.
Number of SNPs: 41
Population Size: 75
case: 21 control: 54
10. 10 Disease association search results IES(30):
exhaustive search
30 indexed SNPs with MLR based tagging method
ICS(30):
combinatorial search
30 indexed SNPs with MLR based tagging method.
In first section I want to introduce some basics of human genetics and some terminologies
Such as SNPs, haplotypes and genotypes. Then I will talk about phasing, which is a very important
computational problem. In the first section I also want to give an overview for genetic epidemiology.
In the second section I will talk about my research, including phasing for family trios and
Genetic susceptibility to Complex Diseases.
The last section is conclusions and future plans.
In first section I want to introduce some basics of human genetics and some terminologies
Such as SNPs, haplotypes and genotypes. Then I will talk about phasing, which is a very important
computational problem. In the first section I also want to give an overview for genetic epidemiology.
In the second section I will talk about my research, including phasing for family trios and
Genetic susceptibility to Complex Diseases.
The last section is conclusions and future plans.
11. 11 Disease Association Search Optimum Association Search Problem:
Find MSC that is the most associated with the disease
Measure: positive predictive value
= find (non-)diseased-free cluster of maximum size
Bad news: Generalization of max independent set
NP complete and cannot be well approximated
Hope: sample S is not arbitrary Genetic susceptibility to complex diseases is a key problem in epidemiology.
Here is the problem formulation.
Given genotypes of sick and healthy persons, and a testing persons genotype, find the
Disease status of the testing person.
In epidemiology, the sick and healthy persons are classified as case and control.
In computation, they are represented as 1/-1, respectively.
You may have heard the case/control studies, which is what Im doing.Genetic susceptibility to complex diseases is a key problem in epidemiology.
Here is the problem formulation.
Given genotypes of sick and healthy persons, and a testing persons genotype, find the
Disease status of the testing person.
In epidemiology, the sick and healthy persons are classified as case and control.
In computation, they are represented as 1/-1, respectively.
You may have heard the case/control studies, which is what Im doing.
12. 12 Complimentary Greedy Algorithm Algorithm
Start with C=S (resp. MCS is empty)
Repeat until h(C)=0 (non-diseased-free)
Find 1-SC s maximizing (h(C)-h(C ? {s})) / (d(C) d(C ? {s})) = minimize payment with diseased for removal of non-diseased
Add s to SNPs of Cs MSC
Analogy: finding independent set by greedy removing highest degree vertecies
Extremely fast but inaccurate
Can be used in susceptibility prediction Genetic susceptibility to complex diseases is a key problem in epidemiology.
Here is the problem formulation.
Given genotypes of sick and healthy persons, and a testing persons genotype, find the
Disease status of the testing person.
In epidemiology, the sick and healthy persons are classified as case and control.
In computation, they are represented as 1/-1, respectively.
You may have heard the case/control studies, which is what Im doing.Genetic susceptibility to complex diseases is a key problem in epidemiology.
Here is the problem formulation.
Given genotypes of sick and healthy persons, and a testing persons genotype, find the
Disease status of the testing person.
In epidemiology, the sick and healthy persons are classified as case and control.
In computation, they are represented as 1/-1, respectively.
You may have heard the case/control studies, which is what Im doing.
13. 13 Most disease-associated & disease-resistant MSC Comparison of three methods for searching the disease-associated and disease-
resistant multi-SNPs combinations with the largest PPV. The starred values refer to
results of the runtime-constrained exhaustive search
In first section I want to introduce some basics of human genetics and some terminologies
Such as SNPs, haplotypes and genotypes. Then I will talk about phasing, which is a very important
computational problem. In the first section I also want to give an overview for genetic epidemiology.
In the second section I will talk about my research, including phasing for family trios and
Genetic susceptibility to Complex Diseases.
The last section is conclusions and future plans.
In first section I want to introduce some basics of human genetics and some terminologies
Such as SNPs, haplotypes and genotypes. Then I will talk about phasing, which is a very important
computational problem. In the first section I also want to give an overview for genetic epidemiology.
In the second section I will talk about my research, including phasing for family trios and
Genetic susceptibility to Complex Diseases.
The last section is conclusions and future plans.
14. 14 Genetic Susceptibility Prediction Given: Genotypes of diseased and non-diseased individuals,
Genotype of a testing person.
Find: The disease status of the testing person
Genetic susceptibility to complex diseases is a key problem in epidemiology.
Here is the problem formulation.
Given genotypes of sick and healthy persons, and a testing persons genotype, find the
Disease status of the testing person.
In epidemiology, the sick and healthy persons are classified as case and control.
In computation, they are represented as 1/-1, respectively.
You may have heard the case/control studies, which is what Im doing.Genetic susceptibility to complex diseases is a key problem in epidemiology.
Here is the problem formulation.
Given genotypes of sick and healthy persons, and a testing persons genotype, find the
Disease status of the testing person.
In epidemiology, the sick and healthy persons are classified as case and control.
In computation, they are represented as 1/-1, respectively.
You may have heard the case/control studies, which is what Im doing.
15. 15 Cross-validation Leave-one-out test: The disease status of each genotype in the data set is predicted while the rest of the data is regarded as the training set. Cross-validation method is used to compute the four numbers, TP, TN, FP,FN.
The Cross-validation method includes leave-one out test and leave-many-out test.
The disease status of each genotype
in the data set is predicted while rest of the data is regarded as the training set.
Lets see the example. This is the data set. One genotype is left out, and others are regarded as training set.
Based on information of training set, we predict the disease status for the left-out genotype. We repeat it until
All genotypes of the data set have been tested. Then we compare the real disease status with the predicted status
Value to get the accuracy.
leave-many-out is another kind of cross-validation test. In leave-many-out tests, use 2/3 of the data set as training set and prediction
the disease status of the rest. Repeat the process for many times and get the average of the prediction rate. Cross-validation method is used to compute the four numbers, TP, TN, FP,FN.
The Cross-validation method includes leave-one out test and leave-many-out test.
The disease status of each genotype
in the data set is predicted while rest of the data is regarded as the training set.
Lets see the example. This is the data set. One genotype is left out, and others are regarded as training set.
Based on information of training set, we predict the disease status for the left-out genotype. We repeat it until
All genotypes of the data set have been tested. Then we compare the real disease status with the predicted status
Value to get the accuracy.
leave-many-out is another kind of cross-validation test. In leave-many-out tests, use 2/3 of the data set as training set and prediction
the disease status of the rest. Repeat the process for many times and get the average of the prediction rate.
16. 16 Quality Measures of Prediction (confusion table) Sensitivity: The ability to correctly detect disease.
sensitivity = TP/(TP+FN)
Specificity: The ability to avoid calling normal as disease. specificity = TN/(FP+TN)
Accuracy = (TP +TN)/(TP+FP+FN+TN)
Risk Rate: Measurements for risk factors.
To measure the quality of prediction methods, we need compare the prediction result with original
Disease status for the population.
The first column is the sick population and the second column is the healthy population.
The first row is the population that predicted as sick and the second row is the population that predicted as healthy.
The prediction can produce two kinds of errors: a false positive result or a false
negative result
False positive means you are predicted sick but you are healthy.
False negative means you are predicted healthy but you are sick.
True positive and True negative are correctly predict
In general, a population of tested individuals may be divided into four groups:
sensitivity is the ability to correctly detect a disease. Specificity is the ability to avoid calling normal as disease.
Accuracy is the percent of the population that are correctly predicted.
To measure the quality of prediction methods, we need compare the prediction result with original
Disease status for the population.
The first column is the sick population and the second column is the healthy population.
The first row is the population that predicted as sick and the second row is the population that predicted as healthy.
The prediction can produce two kinds of errors: a false positive result or a false
negative result
False positive means you are predicted sick but you are healthy.
False negative means you are predicted healthy but you are sick.
True positive and True negative are correctly predict
In general, a population of tested individuals may be divided into four groups:
sensitivity is the ability to correctly detect a disease. Specificity is the ability to avoid calling normal as disease.
Accuracy is the percent of the population that are correctly predicted.
17. 17 Prediction Methods Support vector machine
Random forest
LP-based prediction
18. 18 Prediction via Clustering Drawback of the prediction problem formulation = need of cross-validation ? no optimization
Clustering P = partition into clusters defined by MSCs
Minimizing number of errors
S.t. bounded information entropy ?(Si/S) log(Si/S)
Model-fitting prediction
Set status of testing genotype to diseased
Find number of errors
Set status of testing genotype to diseased
Find number of errors
Predict status that implies lesser number of errors In this algorithm, we assume that Certain haplotypes are susceptible to the disease while others are resistant to the disease, and the genotype susceptibility is assumed to be a sum of susceptibilities of its two haplotypes.
We want to assign a weight for each haplotypes, such that all genotypes
for case are positive and all genotypes for control are negative. For testing genotypes, we just need to
Add up the weight of its two haplotypesIn this algorithm, we assume that Certain haplotypes are susceptible to the disease while others are resistant to the disease, and the genotype susceptibility is assumed to be a sum of susceptibilities of its two haplotypes.
We want to assign a weight for each haplotypes, such that all genotypes
for case are positive and all genotypes for control are negative. For testing genotypes, we just need to
Add up the weight of its two haplotypes
19. 19 Leave-1-out cross-validation results Leave-one-out cross-validation for combinatorial search-based prediction (CSP) and complimentary greedy search-based prediction (CGSP) are given when 20, 30, or all SNPs are chosen as informative SNPs.
In first section I want to introduce some basics of human genetics and some terminologies
Such as SNPs, haplotypes and genotypes. Then I will talk about phasing, which is a very important
computational problem. In the first section I also want to give an overview for genetic epidemiology.
In the second section I will talk about my research, including phasing for family trios and
Genetic susceptibility to Complex Diseases.
The last section is conclusions and future plans.
In first section I want to introduce some basics of human genetics and some terminologies
Such as SNPs, haplotypes and genotypes. Then I will talk about phasing, which is a very important
computational problem. In the first section I also want to give an overview for genetic epidemiology.
In the second section I will talk about my research, including phasing for family trios and
Genetic susceptibility to Complex Diseases.
The last section is conclusions and future plans.
20. 20 ROC curve Comparison of 5 prediction methods on (Barkash et. al,2006 ) data on all SNPs.
Area under the CSPs curve is 0.81 vs 0.52 under the SVMs curve. In first section I want to introduce some basics of human genetics and some terminologies
Such as SNPs, haplotypes and genotypes. Then I will talk about phasing, which is a very important
computational problem. In the first section I also want to give an overview for genetic epidemiology.
In the second section I will talk about my research, including phasing for family trios and
Genetic susceptibility to Complex Diseases.
The last section is conclusions and future plans.
In first section I want to introduce some basics of human genetics and some terminologies
Such as SNPs, haplotypes and genotypes. Then I will talk about phasing, which is a very important
computational problem. In the first section I also want to give an overview for genetic epidemiology.
In the second section I will talk about my research, including phasing for family trios and
Genetic susceptibility to Complex Diseases.
The last section is conclusions and future plans.
21. 21 Conclusions Combinatorial search is able to find statistically significant multi-gene interactions, for data where no significant association was detected before
Complimentary greedy search can be used in susceptibility prediction
Optimization approach to prediction
New susceptibility prediction is by 8% higher than the best previously known
MLR-tagging efficiently reduces the datasets allowing to find associated multi-SNP combinations and predict susceptibility
22. 22
23. 23
24. 24 Support Vector Machine (SVM) Algorithm Learning Task
Given: Genotypes of patients and healthy persons.
Compute: A model distinguishing if a person has the disease.
Classification Task
Given: Genotype of a new patient + a learned model
Determine: If a patient has the disease or not.
25. 25 Random Forest Algorithm Random Forests grows many classification trees. To classify a new object from an input vector, put the input vector down to each tree in the forest. Each tree gives a classification, and we say the tree votes for that class. The forest chooses the classification having the most votes (over all the trees in the forest).
Growing Tree, Split selection and Prediction.
Random sub-sample of training data, Random splitter selection.
26. 26 LP-based Prediction Algorithm Model:
Certain haplotypes are susceptible to the disease while others are resistant to the disease.
The genotype susceptibility is assumed to be a sum of susceptibilities of its two haplotypes.
Assign a positive weight to susceptible haplotypes and a negative weight to resistant haplotypes such that for any control genotype the sum of weights of its haplotypes is negative and for any case genotype it is positive.
For each vertex-haplotype hi assign the weight pi,
such that for any genotype-edge ei,j =(hi,,hj )
where s(ei,j ) ? {-1,1} is the disease status of genotype ei,j. The sum of absolute values of genotype weights is maximized.
In this algorithm, we assume that Certain haplotypes are susceptible to the disease while others are resistant to the disease, and the genotype susceptibility is assumed to be a sum of susceptibilities of its two haplotypes.
We want to assign a weight for each haplotypes, such that all genotypes
for case are positive and all genotypes for control are negative. For testing genotypes, we just need to
Add up the weight of its two haplotypesIn this algorithm, we assume that Certain haplotypes are susceptible to the disease while others are resistant to the disease, and the genotype susceptibility is assumed to be a sum of susceptibilities of its two haplotypes.
We want to assign a weight for each haplotypes, such that all genotypes
for case are positive and all genotypes for control are negative. For testing genotypes, we just need to
Add up the weight of its two haplotypes