370 likes | 628 Views
Allelic Pattern Sampler: Genetic Combinations Underlying Complex Diseases. Polygenic diseases (traits). Polygenic diseases susceptibility arise due to contribution of a set of genes. Heterogeneity: different genetic backgrounds arise the same disease.
E N D
Allelic Pattern Sampler: Genetic Combinations Underlying Complex Diseases
Polygenic diseases (traits) • Polygenic diseases susceptibility arise due to contribution of a set of genes. • Heterogeneity: different genetic backgrounds arise the same disease. • The disease outcome is correlated with the genetic background rather that is determined. Environmental effect or heterogeneity: gang-specific eyebrows. А common signature is improbable.
Polygenic contribution The genes can contribute independently in an additive way. The genes interact (epistasis) The genes can behave as interacting only relatively to the disease. • Complementary alleles. An allele’s trait explication requires another allele of another gene. • Alternative pathways.
The pattern concept. An example: image recognition (1,0) (1,1/2) (1/2,1/2) (1,1) (1/2,1) (0,1)
Allelic (genetic) pattern • We know levels of a trait (i.e. disease) and we know alleles of candidate genes that these persons carry. • A pattern is a set of alleles of the genes, whose presence in a genome a whole is associated with the trait. • Any subset of the pattern is associated less reliable than the while pattern is. Any superset, too. So, a pattern is a locally minimal subset satisfying the statements above. • A pattern may contain only one allele.
Example of a genetic pattern for a complex polygenic disease. Cross-sectional comparison of MS patients and controls among carriers and non-carriers of alleles of DRB1 HLA gene, CCR5 chemokine receptorgene deletion and their combination. The solid line points to an independent combination ratio. OR 20.1 p<0.0001 Favorova OO, Andreewski TV, Boiko AN, Sudomoina MA, Alekseenkov AD, Kulakova OG, Slanova AV, Gusev EI. 2002. The chemokine receptor CCR5 deletion mutation is associated with MS in HLA-DR4-positive Russians. Neurology 59(10):1652-5.
Patterns hide each other More-than-2-allele-in-a-locus union of the combinations. ....|0 0 | a b | 0 0 |.... ....|0 0 | c 0 | 0 0 |.... The strongest association (not obligatory the most reliable) statistically shadows all the other ones. disease level
Independency question We cannot invent a correct concept of a space of patterns, because the operation of addition (as a union of allelic sets) is not defined for every pair, thus we cannot apply a component analysis technique.
Mutual isolation of patterns Set of patterns • As far as we cannot take one pattern apart, we consider a set of patterns simultaneously. • We say that a pattern is considered isolated from a set of other patterns if we remove the influence of all the other patterns before we consider our pattern’s association with the trait. • It is an analog of adjustment procedure.
Data • We have genotypic data and phenotypic trait level data for some individuals. • The trait levels are comparative characteristics. They cannot be measured, they can only be compared. • We want to obtain allelic patterns, which best characterizes the relation between genotypic and phenotypic data. • We will look for a whole set of patterns, which maximises the probability that all the patterns are associated with the disease in in the mutually isolated manner. • A good patternset forms a kind of “gradient basis” in the genome-trait association.
Trait Incidence Gene data Level matrix 0.1 1 0 0 a c | d d | f s |.... 0.4 0 1 1 c f | a b | b a |.... 0.7 0 0 0 a a | c b | a c |.... 0.9 0 0 1 c f | f b | b s |.... 0.2 1 1 1 a f | a d | b c |.... … ....... ........................ Data structures The set of patterns is a variable to be optimized Set of patterns 0 0 | d 0 | 0 0 |.... 0 0 | a 0 | 0 0 |.... 0 f | 0 0 | b 0 |.... The correspondence of the twomatrices below shows the set of patterns quality.
The incidence classification 110 111 100 101 010 011 000 001 All the cases are classified into 2n possible classes based on the row in the incidence matrix. Incidence matrix 1 0 0 0 1 1 0 0 0 0 0 1 1 1 1 1 0 1 0 1 0 1 0 1 0 0 1 ....... The classes could be represented by the vertices of a hypercube. A set of parallel edges of the cube corresponds to a pattern. It is the direction of the second pattern.
A pair of classes comparison y 110 111 100 101 x 010 011 000 001 Two classes of trait levels, which are on the same edge, differs due to the “isolated” influence of the edge’s pattern. So, we base the patternset consideration on such pairwise comparisons. We can only compare the disease (trait) levels, so the appropriate statistics for the comparison is the inversions number.
A pair of classes. Alternative hypotheses. • To test a pair of adjacent classes, we formulate three hypotheses about the corresponding pattern: • null-hypothesis: X and Y has the same median, e.g. X≡Y • “positive” hypothesis: median (Y) > median (X) (predisposing pattern) • “negative” hypothesis: median (Y) < median (X) (protecting pattern). • We compare the hypotheses in a Bayesian paradigm.
The likelihoods for a pair: example 0.25 p + - null const 0 8 inv# The larger the minor class is, the more sharp are all the likelihoods. If it is 1 or 0, all the 4 lines are equal.
The null-hypothesis posterior for a pattern • A pattern’s likelihood for a hypothesis is a product of the likelihoods of all corresponding class pairs. • If a pattern is carried by all the genomes in the data or is not carried by any (it is uninformative), null-hypothesis prior for the pattern is 1. For informative patterns, we use uniform prior.
The quality of a set of patterns 110 111 100 101 010 011 000 001 • The pairwise comparisons for all classes, which correspond to parallel edges together qualify a pattern. • All patterns together qualify a set of patterns. • A good pattern set is one without bad patterns. is the quality of a set of patterns.
Optimization of the pattern set quality • Direct enumeration is ineffective. • A kind of gradient maximisation is prone to be locked in local maxima. Thus, we use the Monte-Carlo Markov Chain (MCMC) method. Definitely, it is a hybrid Metropolis-Hastings-Gibbs with random choice of updates.
A mutation: 0 0 | d 0 | 0 0 0 0 | a 0 | 0 0 0 f | 0 0 | b 0 0 0 | d 0 | 0 0 0 0 | a 0 | 0 0 0 f | c 0 | b 0 A recombination: 0 0 | d 0 | 0 0 0 0 | a 0 | 0 0 0 f | 0 0 | b 0 0 0 | d 0 | 0 0 0 0 | a0 | b 0 0 f | 00 | 0 0 Possible updating steps
Output statistics *** Patternsets statistics: *** | alpha-fibr | fibr-249 | fibr-148 | ApoE-491 | Apoe-427 | ApoE-epsilon | Hind_III | LPL-Ser447Ter | ACE | CMA/B | gender | +-------------+-----------+-----------+-----------+-----------+---------------+-----------+---------------+-----+-------+---------+ | 0 0 | 0 0 | 0 0 | T 0 | 0 0 | 0 0 | 0 0 | 0 0 | 0 0 | 0 0 | 0 0 | | 0 0 | 0 0 | 0 0 | 0 0 | C T | 0 0 | 0 0 | 0 0 | 0 0 | 0 0 | 0 0 | Registered 64 times. Pattern posteriors to be positive: 3.709e-10 7.143e-11 Pattern posteriors to be negative: 0.001556 0.03835 Point reliability = 5.9658e-05 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Patterns statistics: | alpha-fibr | fibr-249 | fibr-148 | ApoE-491 | Apoe-427 | ApoE-epsilon | Hind_III | LPL-Ser447Ter | ACE | CMA/B | gender | +-------------+-----------+-----------+-----------+-----------+---------------+-----------+---------------+-----+-------+---------+ | 0 0 | 0 0 | 0 0 | 0 0 | C 0 | 0 0 | 0 0 | 0 0 | 0 0 | 0 0 | 0 0 | Occured 5927 times. +/- : 0/5927 (Mentioned 41 times. +/- : 0/41 ) maximal reliabilities as + and - are 4.81058e-10 and 0.0172151 . | alpha-fibr | fibr-249 | fibr-148 | ApoE-491 | Apoe-427 | ApoE-epsilon | Hind_III | LPL-Ser447Ter | ACE | CMA/B | gender | +-------------+-----------+-----------+-----------+-----------+---------------+-----------+---------------+-----+-------+---------+ | 0 0 | 0 0 | 0 0 | T 0 | 0 0 | 0 0 | 0 0 | 0 0 | 0 0 | 0 0 | 0 0 | Occured 3022 times. +/- : 0/3022 (Mentioned 19 times. +/- : 0/19 ) maximal reliabilities as + and - are 4.74783e-06 and 0.00205254 .
A(llelic) P(attern) Sampler APSampler software was developed … Favorov AV, Andreewski TV, Sudomoina MA, Favorova OO, Parmigiani G, Ochs MF: A Markov chain Monte Carlo technique for identification of combinations of allelic variants underlying complex diseases in humansGenetics 2005, 171(4):2113-2121. … and applied to real data Favorova OO, Favorov AV, Boiko AN, Andreewski TV, Sudomoina MA, Alekseenkov AD, Kulakova OG, Gusev EI, Parmigiani G, Ochs MF: Three allele combinations associated with multiple sclerosisBMC Med Genet 2006, 7:63. Sudomoina MA, Nikolaeva TY, Parfenov MG, Alekseenkov AD, Favorov AV, Gekht AB, Gusev EI, Favorova OO: Genetic risk factors of arterial hypertension: analysis of ischemic stroke patients from the Yakut ethnic groupDokl Biochem Biophys. 2006 Sep-Oct;410:324-6 (Rus). Chikhladze NM, Samedova KhF, Sudomoina MA, Thant M, Htut ZM, Litonova GN, Favorov AV, Chazova IE, Favorova OO: Contribution of CYP11B2, REN and AGT genes in genetic predisposition to arterial hypertension associated with hyperaldosteronismKardiologiia2008;48(1):37-42 (Rus).
Validation I: Exact Fisher pattern p (pattern)
Validation II: permutation Null distribution Genetic data Permuted disease data Permuted disease data Permuted disease data Permuted disease data Permutation Disease data . . . . . Pfail [pattern]= Pfail [p (pattern)] 1-st null distribution 2-nd null distribution 3-rd distribution N-th null distribution . . . . . p
Validation III: FDR FDR ≈FP/(FP+TP) p ≈FP/(FP+TN)
Validation III: FDR: calculation Original distribution Null distribution Genetic data Permuted disease data Permuted disease data Permuted disease data Permuted disease data Permutation Disease data . . . . . 1-st null distribution 2-nd null distribution 3-rd distribution N-th null distribution . . . . . p
Validation III: FDR: evaluation II Approximated Evaluated directly FDR(T1) >FDR(T2) T
Validation: FDR: example • 61 markers and gender • 120 controls and 255 MS patients • Among 255, 155 give response to a medication Pattern contains 3 informative alleles: 21:G; 37:T; 53:C. The pattern is mentioned in statistics as occurred 1 times at line: 3227. Occurred in 1 patternsets 1 times. Mentioned in patternsets at lines: 427. Fisher 4-pole table: 0 1 levels 1 19 carriers 89 118 noncarriers p-value = 0.000368247913041713 FDR <=1 (0.0067765/1e-06) Pattern contains 3 informative alleles: Gender:1; 27:T; 42:C. The pattern is mentioned in statistics as occurred 1 times at line: 3011. Occured in 1 patternsets 1 times. Mentioned in patternsets at lines: 731. Fisher 4-pole table: 1 2 levels 51 51 carriers 60 171 noncarriers p-value = 1.98632243779503e-05 FDR=0.00179340028694405 (2.5e-06/0.001394)
Authors Alexander Favorov 1,3 Olga Favorova 2 Marina Sudomoina 2 GiovanniParmigiani 3 Michael Ochs 3 Acknowledgements • Alexey Alexeenkov 2 • Alexey Boiko 2 • Evgeniy Gusev 2 • Alexey Boiko 2 • Mikhail Parfenov 2 • Tatiana Nikolaeva 5 • Mikhail Gelfand6 • Vsevolod Makeev1 • Andrew Mironov 4 • Koen Vanderbroek 7 • State Scientific Centre “GosNIIGenetica”, Moscow, Russia. • Russian State Medical University, Moscow, Russia. • The Sidney Kimmel Cancer Center at Johns Hopkins, Baltimore, MD, USA • Faculty of Bioinformatics and Biotechnology, MSU, Moscow • Yakut Research Center, Russian Academy of Medical Sciences and Government of the Sakha Republic (Yakutia), Yakutsk • Institute of Information Transmission Problems RAS, Moscow, Russia • School of Pharmacy - CCRCB – QUB, Belfast, UK Thank your for your attention.
MS case-control study • The method was applied to a database that contains results of the genotyping of DNAs from 237 unrelated patients with clinically defined MS and from 358 healthy unrelated controls (all of them were Russians). • 15 polymorphous sites of candidate loci for MS developmentwere analyzed. • The phenotypic trait (i.e. the MS susceptibility) levels were 1 for patients and 0 for controls. • There were two starts: one for 2 patterns, one for three.
APSampler identified the following patterns as MS-associated: • DRB1 *15(2) • TNFa9 • CCR532 + DRB1 *04 • TGF1-509 *C + DRB1 *18 + +49CTLA4*G (trio 1) • -238 TNF *B1 + -308 TNF *A2 + +49CTLA4 *G(trio 2)
The Fisher’s 4-pole association test result for the trios and their 2-elements subsets The permutation test gave the values for the trios were less than 0.3%
Analysis of genetic background of ischemic stroke (IS) patients of Yakut descent
IS genetic background analysis Associations identified
Allele 495TLPL carriership 0 1 2 3 p<0.0001* *p-value is counted by Fisher criteria it 8-pole table
3-allelic pattern: -249C FGB, ε4 APOE and -1903A CMA carriership 3 2 1 0 p=0.0003* -249С FGB + -1903A CMA p=0.017 ε4APOE + -1903A CMA p=0.023