1 / 24

FastANOVA: an Efficient Algorithm for Genome-Wide Association Study

FastANOVA: an Efficient Algorithm for Genome-Wide Association Study. Xiang Zhang Fei Zou Wei Wang University of North Carolina at Chapel Hill. Speaker: Xiang Zhang. Genotype-phenotype association study. Goal: finding genetic factors causing phenotypic difference. Mouse genome.

Download Presentation

FastANOVA: an Efficient Algorithm for Genome-Wide Association Study

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. FastANOVA: an Efficient Algorithm for Genome-Wide Association Study Xiang Zhang Fei Zou Wei Wang University of North Carolina at Chapel Hill Speaker: Xiang Zhang

  2. Genotype-phenotype association study • Goal: finding genetic factors causing phenotypic difference Mouse genome Phenotype variation http://www.bcgsc.ca http://www.jax.org/

  3. Genotype-phenotype association study Chrom1 bp3,568,717 Chrom6 bp120,323,342 • Single Nucleotide Polymorphism • Mutation of a single nucleotide (A,C,T,G) • The most abundant source of genotypic variation • Server as genetic markers of locations in the genome • High throughput genotyping -- thousands to millions of SNPs …… A A A C G …… A A T C C …… …… A A A C G …… A A T C C …… …… A A A C G …… A A T C G …… …… A A A C G …… A A T C G …… …… A A A C G …… A A T C G …… …… A A A C G …… A A T C G …… …… A A T C G …… A A T C C …… …… A A T C G …… A A T C C …… …… A A T C G …… A A T C G …… …… A A T C G …… A A T C C …… …… A A T C G …… A A T C C …… …… A A T C G …… A A T C C …… Thousands to millions of SNPs

  4. Genotype-phenotype association study • Genotype • SNPs can be represented as binary {0,1} (e.g. inbred mouse strains) • Quantitative phenotypes • Body weight, blood pressure, tumor size, cancer susceptibility, …… • Question • Which SNPs are the most highly associated with the phenotype? Phenotype value SNPs …… 0 0 0 1 0 1 …… 8 …… 0 0 0 0 0 0 …… 7 …… 0 1 1 0 0 1 …… 12 …… 0 1 0 0 1 0 …… 11 …… 0 1 0 1 0 1 …… 9 …… 0 1 0 0 0 0 …… 13 …… 1 0 1 1 1 1 …… 6 …… 1 0 0 0 1 0 …… 4 …… 1 1 1 1 1 1 …… 2 …… 1 0 0 1 0 0 …… 5 …… 1 0 0 1 0 1 …… 0 …… 1 0 1 1 0 0 …… 3

  5. A simple example: single marker association study • Partition individuals into groups according to genotype of a SNP • Do a statistic (t, ANOVA) test • Repeat for each SNP Phenotype value SNPs …… 0 0 0 1 0 1 …… 8 …… 0 0 0 0 0 0 …… 7 …… 0 1 1 0 0 1 …… 12 …… 0 1 0 0 1 0 …… 11 …… 0 1 0 1 0 1 …… 9 …… 0 1 0 0 0 0 …… 13 …… 1 0 1 1 1 1 …… 6 …… 1 0 0 0 1 0 …… 4 …… 1 1 1 1 1 1 …… 2 …… 1 0 0 1 0 0 …… 5 …… 1 0 0 1 0 1 …… 0 …… 1 0 1 1 0 0 …… 3

  6. Two-locus association mapping • Many phenotypes are complex traits • Due to the joint effect of multiple genes • Single marker approach may not suffice • Consider SNP-SNP interactions • Four possible genotype combinations for each SNP-pair: 00, 01, 10, 11 • Split mice into four groups according to the genotype of each SNP-pair • Do statistic test for each SNP-pair

  7. Statistical issue • Multiple test problem • Do n tests with Type I error , the family-wise error rate is • Example • Performing 20 tests with Type I error=0.05, family-wise error rate = 0.64 • 64% probability to get at least one spurious result • Solution • permutation test

  8. Permutation test • K permutations of phenotype values • For each permutation, find the maximum test value • Given Type I error α, the critical value Fαis αK-thlargest value among K maximum values • SNP-pairs whose test values are greater than Fα are significant

  9. Genome-wide association study • What’s GWA? • Simple Idea: search for the associations in the whole genome • Hard to implement • Enormoussearch space: 10,000 SNPs and 1,000 permutations, number of SNP-pairs need to be tested: 5 ×1010

  10. Preliminary: ANOVA test and F-statistic • ANOVA test • To determine whether the group meansare significantly different • Partition Total sum of squares into Between-group sum of squares and Within-group sum of squares • F-statistic • SNPs {X1, X2, …, XN}, • a quantitative phenotype Y • Single SNP test -- F(Xi, Y) • SNP-pair test --F(XiXj, Y) SST SSB SSW

  11. Problem Formalization • Dataset: M individuals, N SNPs {X1, X2, …, XN}, a quantitative phenotype Y, and its K permutations {Y1, Y2, …, Yk}. • Maximum ANOVA test (F-statistic) value of permutation Yk FYk = max {F(XiXj, Yk)|1≤i<j≤N} • Problem 1: Given Type I error threshold α, find critical valueFα, which is αK-th largest value among {FYk|1≤k≤K} • Problem 2: Given the threshold Fα, find all significant SNP-pairs such that F(XiXj, Y)≥ Fα

  12. Brute force approach • Problem 1: Permutation test to find critical value • For permutation Yk, test all SNP-pairs to find the maximum test value FYk • Repeat for all permutations • Report αK-th largest value in {FYk|1≤k≤K} • Problem 2: Finding significant SNP-pairs • For phenotype Y, test all SNP-pairs and report the SNP-pairs whose test values are above Fα Problem 1 is more demanding due to large number of permutations

  13. Overview of FastANOVA • Goal: Scale large permutation test to genome-wide • Question: Do we have to perform ANOVA tests for every SNP-pair and repeat for all permutations? • Idea: • Develop an upper bound: to filter out SNP-pairs having no chance to become significant (all nodes on the same level of the search tree, no sub-tree pruning, how?) • Efficiently compute the upper bound: calculate the upper bound for a group of SNP-pairs together (possible?) • Identify redundant computations in the permutation tests (reuse computations, how?)

  14. The upper bound • For any SNP-pair (XiXj) equivalent SSB (XiXj, Y) ≥θ F(XiXj, Y) ≥ Fα Fixed for given Fα • Bound on SSB Need to be greater than θ for (XiXj) to be significant

  15. The upper bound Given Xi ,Xj ,and Y Constant f(na) f(nb) Only depend on the genotype ofXj

  16. Applying the upper bound For a given Xi , let AP= {(XiXj)|i+1≤j≤N}. Index the SNP-pairs in AP in the 2D space of (na, nb). (X1X3) (X1X5) (X1X6) (1,3) (3,3) (X1X2) (X1X4) (2,1)

  17. Key properties f(na) f(nb) • Maximum possible size: • Many SNP-pairs share the same entry • All SNP-pairs in the same entry have the same upper bound • The indexing structure does not depend on the phenotype permutations Same upper bound value

  18. Schema of FastANOVA (for permutation test) • For each Xi , index the SNP-pairs {(XiXj)|i+1≤j≤N} in the 2D space of (na, nb) • For each permutation, find the candidate SNP-pairs by accessing the indexing structure • Candidates are SNP-pairs whose upper bounds are above the threshold. • The dynamic threshold is the maximum test value found so far.

  19. Complexity of FastANOVA • Time complexity • FastANOVA: O(N2M + KNM2 +CM) • Brute force: O(KN2M) • Space complexity • O((N+K)M) N = # SNPs M = # individuals K = # permutations C = # candidates M << N

  20. Brute force v.s. FastANOVA Two orders of magnitude faster than the brute force alternative #SNPs = 44k, #individuals = 26, phenotype: metabolism (water intake) SNP and phenotype data available at http://www.jax.org

  21. Pruning power of the bound

  22. Runtime of each component One time cost

  23. Future work • Association study involving more than two SNPs • Computationally much more demanding • Three loci VS. two loci: in the order of number of SNPs • Association study for heterozygous case • SNPs are encoded as ternary variables {0, 1, 2}

  24. Thank You ! Questions?

More Related