200 likes | 304 Views
Genotype Calling. Matt Schuerman. Biological Problem. How do we know an individual’s SNP values (genotype)? Each SNP can have two values (A/B) Each individual has two copies of the SNP Probes can be used to measure how well a particular SNP matches values
E N D
Genotype Calling Matt Schuerman
Biological Problem • How do we know an individual’s SNP values (genotype)? • Each SNP can have two values (A/B) • Each individual has two copies of the SNP • Probes can be used to measure how well a particular SNP matches values • Need a reliably way to declare values based on probe measurements
Computational Problem • Given a set of data points how can we partition them to maximize similarity within subsets? • The clustering problem • Similarity function arbitrary, but often based on statistical or distance measures • Several accepted algorithms
Standard Solutions • Algorithms exist which call HapMap genotypes with >99% accuracy • Not general, many hidden parameters tuned to work on existing data • Other algorithms require prior knowledge such as how many clusters are present • Again, not general
My Solution • Wanted a more general method with few tuned parameters • Mine has almost no “tuned” parameters • Wanted a fast solution • Many accepted clustering algorithm have exponential run times • Mine is O(n2), but closer to linear in practice
My Solution • Convolve gaussian kernel over data to find initial cluster candidates • Iteratively re-calculate cluster parameters and then re-assign data points to clusters • Assign calls to clusters based on ratio of probe measurements
Phase 1: Initial clusters • Bin data points to grid • Convolve with a 5x5 gaussian kernel • All peaks are considered potential clusters
Phase 2: Cluster Iteration • While the clusters are changing … • Calculate the mean position and covariance matrix of each cluster • Merge clusters within 3 standard deviations of each other using Mahalanobis distance • Assign each data point to the cluster with the shortest Mahalanobis distance
Phase 2: Cluster Iteration Iteration 1 …
Phase 2: Cluster Iteration Iteration 2 …
Phase 2: Cluster Iteration Iteration 3 …
Phase 2: Cluster Iteration Iteration 4, no change so done!
Phase 3: Assigning calls • Based on the ratio of x to y at the center of each cluster • If y/x ~ 1.3, then call as BB • If y/x ~ 1, then call as AB • If y/x ~ 0.7, then call as AA • If 2 or 3 clusters are present, then find which is closest to these values
Results • Clustering works much better when done within populations • Algorithm’s performance is comparable across all populations • Testing 1111 SNPs in the Affy 100K XBA CEU dataset found to be 96.47% accurate
Results: Example Assignment Ignore point at (10,10). One incorrect call in black.
Results • Sometimes assigning calls is problematic • Sometimes clusters get improperly split • Sometimes clusters get improperly merged • Sometimes the grouping is right, but one of the clusters was miscalled • Could probably be fixed if set ratios more precisely
Conclusions • Accuracy is close to that of best published algorithms • Faster run time • Simpler approach with less tuning • Need to run more data