1 / 16

Fast Imputation Using Medium- or Low-Coverage Sequence Data

Fast Imputation Using Medium- or Low-Coverage Sequence Data. Topics. Cost of chip vs. sequence data Chips : Nonlinear increase with SNP density Sequence : Linear increase with read depth Imputation methods for sequence data Few programs designed for low read depth

nedaa
Download Presentation

Fast Imputation Using Medium- or Low-Coverage Sequence Data

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Fast Imputation Using Medium- or Low-Coverage Sequence Data

  2. Topics • Cost of chip vs. sequence data • Chips: Nonlinear increase with SNP density • Sequence: Linear increase with read depth • Imputation methods for sequence data • Few programs designed for low read depth • Value of including HD chip in sequence data

  3. Analysis of chip vs. sequence data

  4. Imputation algorithm (findhap v4) • Prior allele probabilities = pop’n frequency • Compute Prob(nA, nB | genotypes, errate) • Test ancestor haplotype likelihoods first • Find most likely 2 haplotypesfrom library • Compute haplotype posteriors from priors • Test long, then medium, then short segments

  5. Data sets and imputation tests

  6. Computation required • Bulls: 250 sequenced + 250 HD, 1 chromosome • Time (10 processors): findhap 10 min, BeagleV4 3 days • Memory: findhap 5 Gbytes, Beagle <5 Gbytes • Input data: findhap 0.5 Gbytes, Beagle 5 Gbytes • findhap: 2 bytes / SNP [A, B counts stored as hexadecimal] • Beagle: 20 bytes / SNP [Prob(AA), Prob (AB), Prob(BB)] • Output data: findhap 1 byte vs. Beagle 20 bytes / SNP

  7. Accuracy of Findhap vs. Beagle 250 bulls had sequence + HD, 250 others were imputed from HD

  8. Accuracy from HD for bulls * depth Sequences had 1% error, HD imputed using findhap

  9. Accuracy including HD in sequence Correlations of estimated with true genotypes for 500 bulls sequenced with 1% error and 250 bulls with HD only

  10. Imputation from 10K, 60K, 1X, or 2X Reference population is 500 bulls, 8X read depth, 1% error

  11. Sequenced human read depth * error 884 humans sequenced for 394,724 SNPs on chromosome 22

  12. Software at http://aipl.arsusda.gov • Simulate genotypes (programs written 2007) • pedsim.f90, markersim.f90, genosim.f90 • Simulate A and B counts, Poisson plus error • geno2seq.f90 • Impute using haplotype likelihood ratios • findhap.f90 version 4

  13. Actual HD genotype correlations2

  14. Simulated HD correlations2

  15. Conclusions • High read depth is expensive (linear cost) • Low read depth requires additional math • Haplotype probabilities | (A B counts, error) • Imputation improved with findhap version 4 • Up to 400 times faster than Beagle • findhap more accurate for low coverage • Some gain from including HD in sequence

  16. Acknowledgments • Jeff O’Connell and Derek Bickhart provided helpful advice on sequence analysis and software design and testing

More Related