230 likes | 407 Views
Parallel Genehunter : Implementation of a linkage analysis package for distributed memory architectures. Michael Moran CMSC 838T Presentation May 9, 2003. Introduction. Goals Link Genes to specific loci in the genome Decrease time and memory requirements through parallelization Motivation
E N D
Parallel Genehunter: Implementation of a linkage analysis package for distributed memory architectures Michael Moran CMSC 838T Presentation May 9, 2003
Introduction • Goals • Link Genes to specific loci in the genome • Decrease time and memory requirements through parallelization • Motivation • Locate genes for specific phenotypes • Test for inherited diseases and risk factors • Gene therapy CMSC 838T – Presentation
Talk Overview • Introduction • Talk Overview • Genetic Linkage Problem • Previous Work • Parallel Genehunter • Evaluation • Observations CMSC 838T – Presentation
Genetic Linkage Problem • Sexual Reproduction • Offspring created by two haploid gametes • Gametes are produced from diploid/polyploid cells during meiosis www.blc.arizona.edu/courses/181gh/rick/genetics1/ CMSC 838T – Presentation
Genetic Linkage Problem • Recombination occurs in two ways • Random segregation of chromatids 2 x 23 human chromosomes => 223 possible haploid combinations Genes on different chromosomes recombine with probability www.gen.umn.edu/faculty_staff/hatch/1131/ CMSC 838T – Presentation
Genetic Linkage Problem • Recombination occurs in two ways • Random segregation of chromatids • Crossover between homologous pairs of chromosomes Genes on the same chromosome recombine with probability depending on their distance and location on the chromosome CMSC 838T – Presentation
Genetic Linkage Problem Given • This model of recombination • Data for a particular pedigree (family) • Phenotype information for each individual • Genetic markers for each individual • Recombination frequencies for each pair of markers Can we apply probabilistic methods to • Reconstruct the inheritance patterns • Link phenotypes to the markers CMSC 838T – Presentation
Previous Work • Fisher, Haldane, Smith, Morton (1935 - 1955) Methods to infer genetic maps using maximum likelihood estimators • Elston, Stewart (1971) Genetic Linkage Algorithm • Linear in pedigree size • Exponential in number of markers • Lander, Green (1987) Genetic Linkage Algorithm • Linear in number of markers • Exponential in pedigree size CMSC 838T – Presentation
Previous Work • Genehunter (2001) • Implementation of Lander & Green • Analyzes a pedigree containing n non-founders • The inheritance of a gene by one non-founder can be summarized by two bits • The entire pedigree’s inheritance pattern can be summarized by a 2n bits CMSC 838T – Presentation
Previous Work • 3 steps of Genehunter: Step 1 : For each marker, calculate the probability of each of the possible inheritance pattern. Store probabilities in a vector of size 22n 0: grandfather’s chromatid 1: grandmother’s chromatid Pr([0,0]) = .5 Pr([0,1]) = .5 Pr([1,0]) = 0 Pr([1,1]) = 0 CMSC 838T – Presentation
Previous Work • 3 steps of Genehunter: Step 2 : For each marker, calculate the conditional probably of each inheritance pattern conditional on all of the markers to the left, and to the right • For two markers’ inheritance vectors, each disagreeing bit requires a crossover event • The probability of transitioning between inheritance vectors i, j differing in d bits is CMSC 838T – Presentation
Previous Work • 3 steps of Genehunter: Step 2 :For each marker, calculate the conditional probably of each inheritance pattern conditional on all of the markers to the left, and to the right • Mi,j = cost of transitioning between inheritance vectors i&j • P1 , P2 = probability vectors for every inheritance pattern given markers 1 and 2 respectively • P2|1 = P2 • (M P1) • Calculate the probabilities of each marker’s inheritance conditional on all others by Markov Chain or FFT convolution CMSC 838T – Presentation
Previous Work • 3 steps of Genehunter: Step 3 : For each marker, calculate the probability of unknown gene being located at specific locations • Hypothesizes phenotype has a gene located at a particular location. • By default tries 5 evenly-spaced locations between consecutive pairs of markers • Calculates PD, the probabilities of each inheritance pattern for based on this phenotype (as in step 1) • For a location between markers i&i+1, p= PD • Px|1...i • Px|i+1...m • Space Requirement: O(22n) O(22n-f) exploiting symmetry of f founders • Time Requirement: O(m22n) O(m22n-f) with f founders CMSC 838T – Presentation
Parallel Genehunter • Approach • Parallelize the 3 Genehunter steps separately • Divides each 22n-sizedmarker vector evenly among the P processors • allows greater distribution of memory than assigning O(m/P) entire vectors to each processor CMSC 838T – Presentation
Parallel Genehunter • Parallelization of step 1 For each marker, calculate the probability of each of the possible inheritance pattern Each processor calculates the probabilities for a particular 22n / P inheritance patterns for ever marker CMSC 838T – Presentation
Parallel Genehunter • Parallelization of step 2 For each marker, calculate the conditional probably of each inheritance pattern conditional on all of the markers to the left, and to the right • FFT convolution • As in serial genehunter, 22nx 22n matrix-vector multiplication is replaced FFT-based convolution: • 2 forward 1D FFTs on 22n-length vectors • element-by-element multiplication • inverse FFT • Each 1D FFT is equivalent to a 2D FFT on a P x 22n / P matrix • There are well-known distributed algorithms for this FFT using all-to-all communication. • Dot Product in P2|1 = P2 • (M P1) • trivially parallelized: each processor has the same portion of each vector. CMSC 838T – Presentation
Parallel Genehunter • Parallelization of step 3 For each marker, calculate the probability of unknown gene being located at specific locations • computing Px|1...i and Px|i+1...m • FFTs parallelized as in step 2 • Final dot product p = (PD • Px|1...i • Px|i+1...m) • parallelized as in step 2 • each processor holds all the same portion of each vector CMSC 838T – Presentation
Evaluation • Experimental Environment • Input data sets • 51 family member pedigree • {19,21,24}-bit data sets (# bits = 2n-f ) • Computing Facilities • Cplant Cluster (Sandia National Laboratories) • DEC Alpha EV6 processors • Myrinet connection CMSC 838T – Presentation
Evaluation • Runtimes For 19,21 and 24 bit problems CMSC 838T – Presentation
Evaluation • Runtimes For 19,21 and 24 bit problems CMSC 838T – Presentation
Observations Pro: Performs Genehunter computation exactly Pro: Effective for “multipoint linkage” of phenotypes Con: Old-fashioned compared to protein-based methods (?) Pro: Distributes memory requirements Pro: More computers allows larger feasible inputs Con: Experiments based on 1 pedigree Pro: Efficient parallelization up to 32 or 64 processors Con: Only allows pedigrees to grow by only 3 or 4 individuals in equal time CMSC 838T – Presentation
References • Genetic Recombination Dr. Craig Woodworth, Genetic Recombination in Eukaryotes, Lecture Notes, (www.clarkson.edu/class/by214/powerpoint) • Genehunter K. Markianos, M.J. Daly, & L. Kruglyak. Efficient Multipoint Linkage Analysis Through Reduction of Inheritance Space. American Journal of Human Genetics 68, 2001. • Parallel Genehunter G. Conant, S. Plimpton, W. Old, A. Wagner, P. Fain, & G. Heffelfinger. Parallel Genehunter: Implementation of a Linkage Analysis Package for Distributed-Memory Architectures, Proceedings of the First IEEE Workshop on High Performance Computational Biology, International Parallel and Distributed Computing Symposium, 2002. CMSC 838T – Presentation
Questions? CMSC 838T – Presentation