1 / 32

A coalescent computational platform to predict strength of association for clinical samples

A coalescent computational platform to predict strength of association for clinical samples. Genomic studies and the HapMap March 15-18, 2005 Oxford, United Kingdom. Gabor T. Marth. Department of Biology, Boston College marth@bc.edu. 1. Required marker density.

les
Download Presentation

A coalescent computational platform to predict strength of association for clinical samples

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. A coalescent computational platform to predict strength of association for clinical samples Genomic studies and the HapMap March 15-18, 2005 Oxford, United Kingdom Gabor T. Marth Department of Biology, Boston College marth@bc.edu

  2. 1. Required marker density 2. How to quantify the strength of allelic association in genome region Yoruban samples 4. How general the answers are to these questions among different human populations 3. How to choose tagging SNPs Focal questions about the HapMap CEPH European samples

  3. Across samples from a single population? (random 60-chromosome subsets of 120 CEPH chromosomes from 60 independent individuals)

  4. Markers selected based on the allele structure of the HapMap reference samples… … may not work well in another set of samples such as those used for a clinical study. Possible consequence for marker performance

  5. 2. Experimentally genotype additional sets of samples, and compare association structure across consecutive sets directly How to assess sample-to-sample variability? 1. Understanding fundamental characteristics of a given genome region, e.g. estimating local recombination rate from the data McVean et al. Science 2004 3. It would be a desirable alternative to generate such additional sets with computational means

  6. 1. select markers (tag SNPs) with standard methods 2. generate computational samples 3. test the performance of markers across consecutive sets of computational samples Towards a marker selection tool

  7. 3. Use the second haplotype set induced by the same mutations as our computational samples. 2. Enforce data-relevance by requiring that the first set reproduces the observed haplotype structure of the HapMap reference samples. Calculate the “degree of relevance” as the data likelihood (the probability that the genealogy does produce the observed haplotypes). Generating additional computational haplotypes 1. Generate a pair of haplotype sets with Coalescent genealogies. This “models” that the two sets are “related” to each other by being drawn from a single population. 4. In subsequent statistics, weight each such set proportional to the data likelihood calculated in 2.

  8. Generating computational samples M Problem: The efficiency of generating data-relevant genealogies (and therefore additional sample sets) with standard Coalescent tools is very low even for modest sample size (N) and number of markers (M). Despite serious efforts with various approaches (e.g. importance sampling) efficient generation of such genealogies is an unsolved problem. N We propose a method to generate “approximative” M-marker haplotypes by composing consecutive, overlapping sets of data-relevant K-site haplotypes (for small K)

  9. 2. build M-site composites Approximating M-site haplotypes as composites of overlapping K-site haplotypes M 1. generate K-site sets

  10. 000 100 001 101 010 110 011 111 000 001 010 011 100 101 110 111 Piecing together neighboring K-site sets hope that constraint at overlapping markers preserves for long-range marker association

  11. Building composite haplotypes

  12. 30 CEPH HapMap reference individuals (60 chr) a typical 3-site composite Initial results: 3-site composite haplotypes

  13. 3-site composite vs. data

  14. “short-range” “long-range” 3-site composites: the “best case” the “best-case” 3-site scenario: composite of exact 3-site sub-haplotypes

  15. The purpose of the composite haplotypes sets … Variability across sets … is to model sample variance across consecutive data sets. But the variability across the composite haplotype sets is compounded by the inherent loss of long-range association when 3-sites are used.

  16. 4-site composite 4-site composite haplotypes

  17. “Best-case” 4 site composites Composite of exact 4-site sub-haplotypes

  18. Variability across 4-site composites

  19. Variability across 4-site composites … is comparable to the variability across data sets.

  20. Technical/algorithmic improvements 1. un-phased genotypes (AC)(CG)(AT)(CT) A G A C C C T T ? 2. markers with unknown ancestral state A C 3. dealing with uninformative markers 01101000010101110 11101000001010101 11101000010101110 01101000010101110 4. taking into account local recombination rare

  21. Software engineering aspects: efficiency Currently, we run fresh Coalescent simulations at each K-site (several hours per region). This discards most Coalescent genealogies as irrelevant. Total # genotyped SNPs is ~ 1 million -> 1 million different K-sites to match. Any given Coalescent genealogy is likely to match one or more of these. Haplotype sets resulting from matches can be loaded into, stored in, and retrieved from a database efficiently. 4 HapMap populations x 1 million K-sites x 1,000 comp sets x 50 bytes < 200 Gigabytes

  22. Acknowledgements Eric Tsung Aaron Quinlan Ike Unsal Eva Czabarka (Dept. Mathematics, William & Mary)

  23. Testing markers with composite sets

  24. 2. compute strength of association 3. select a smaller set of markers that capture most of the information present in the complete set of markers Using the HapMap 1. genotype a set of reference samples 4. use these markers in clinical studies

  25. CEPH European samples Yoruban samples Allele structure varies among populations

  26. Pr(composite) = Pr(K-site1)Pr(K-site1 ~ K-site2)Pr(K-site2)Pr(K-site2 ~ K-site3)Pr(K-site3) Data probability for composite haplotypes (motivation from composite likelihood methods for recombination rate estimation e.g. by Hudson, Clark, Wall)

  27. Generating K-site haplotypes K=3,4 reference data 1 match / 100 – 10,000 Coalescent genealogies

  28. Example: CFTR gene Hinds et al. Science, 2005

  29. 4-site composite #1 4-site composite #2 4-site composite haplotypes HapMap data

  30. 4-site composites vs. data

  31. Why should this work? tease apart two questions: (1) to what degree K-site composites preserve long-range correlations between markers (really, the quality of the approximation) and (3) the variability across different sets (what we are interested in).

  32. Example: 4-site approximation 4-site composite #2 4-site composite #1 4-site composite #4 4-site composite #3

More Related