250 likes | 268 Views
Merlin is a multipoint engine for rapid likelihood inference, providing improved pedigree analysis, efficient computations, and memory optimization. Learn what's wrong with Genehunter and the advantages of using Merlin. Explore its algorithms and interface.
E N D
Merlin - Multipoint Engine for Rapid Likelihood Inference Ido Feldman Merlin - Multipoint Engine for Rapid Likelihood Inference
Agenda • What’s wrong with Genehunter • What’s Merlin • Merlin from the user’s viewpoint • Merlin: Pro’s and Con’s • Algorithms and Ideas Merlin - Multipoint Engine for Rapid Likelihood Inference
What’s wrong with Genehunter? • The running time / memory is exponential in the size of the pedigree. • Intractable for pedigrees larger than 22-23. • Ideal Solution: Finding an algorithm which is polynomial in pedigree size and markers. • No such luck so far… We’ll try to improve the constants. Merlin - Multipoint Engine for Rapid Likelihood Inference
Merlin is an improved Genehunter • Pedigree error detection • Pedigree simplification • Smart inheritance vectors • Efficient and approximate computations Merlin - Multipoint Engine for Rapid Likelihood Inference
Merlin – Getting Started • Input consists of 3 files: • Pedigree file • Data file (elaborated later) • Map file (contains marker data) Merlin - Multipoint Engine for Rapid Likelihood Inference
Merlin Input – Pedigree Example Family Person Parent1 Parent2 Sex Diabetes Glucose HLA-DR HLA- DQ 1 1 0 0 1 1 x 3 3 4 3 1 2 0 0 2 1 3.000 4 4 1 1 1 3 0 0 1 1 8.000 1 2 x x 1 4 1 2 2 1 3.500 4 3 1 4 1 5 3 4 2 2 1.234 1 3 3 4 1 6 3 4 1 2 4.321 2 4 1 1 1 7 0 0 1 1 5.500 1 2 4 2 1 8 7 4 2 1 6.231 1 4 4 1 2 1 0 0 1 1 6.000 4 3 1 4 2 2 0 0 2 2 7.000 3 4 5 3 2 3 0 0 1 1 7.700 1 2 2 4 2 4 1 2 2 1 4.000 4 3 1 5 2 5 0 0 2 1 3.600 3 5 2 4 2 6 3 4 1 2 1.234 1 3 1 4 2 7 3 4 1 2 3.321 2 4 2 5 2 8 3 4 2 1 5.175 1 4 1 5 2 9 5 6 2 2 0.512 3 3 4 4 Merlin - Multipoint Engine for Rapid Likelihood Inference
Merlin Input – Data file • Link column name to type. • A diabetes (A =Affected/Not Affected) • T glucose (T = Trait) • M HLA-DR (M = Marker) • M HLA-DQ • It’s possible to encode twins status Merlin - Multipoint Engine for Rapid Likelihood Inference
Advantages and Disadvantages • Pros: • Merlin is VERY fast. • Multipoint IBD calculations are exact. • Cons: • Can’t handle very large pedigrees (but still better than Genehunter!) Merlin - Multipoint Engine for Rapid Likelihood Inference
Merlin is fast! – Example pedigrees Merlin - Multipoint Engine for Rapid Likelihood Inference
Merlin is fast! – Results Merlin - Multipoint Engine for Rapid Likelihood Inference
Merlin needs less memory • Due to the smart storage of inheritance vectors, less memory is consumed in compare to Genehunter. • Example pedigree: Genehunter – 1024MB • Exact Merlin: 100MB • Approximate Merlin: 4MB-54MB • Automatic disk-swapping. Merlin - Multipoint Engine for Rapid Likelihood Inference
Pedigree Simplifications • Tear families apart… 2 1 1 3 4 5 6 7 Merlin - Multipoint Engine for Rapid Likelihood Inference
More Pedigree Simplifications • Remove unneeded people… 2 1 3 4 5 Merlin - Multipoint Engine for Rapid Likelihood Inference
Input Error Correction C/D A/B • Discovers impossible genotypes. • Report of unlikely recombinations. • Mistakes are ambiguous reports the most likely mistake. A/C A/B Merlin - Multipoint Engine for Rapid Likelihood Inference
From IV to packed tree Inheritance Vector: 0 1 . . . n 1st Meiosis 0 0 1 Packed Tree: 1 1 2nd Meiosis 0 1 0 1 2 2 2 2 Merlin - Multipoint Engine for Rapid Likelihood Inference
From packed trees to sparse trees • Idea: Prune unneeded sub trees. • Sub-trees with zero likelihood • Symmetric nodes – Seeing one is like seeing the rest • Pruning at level i removes a sub tree of size O(2n-i). • IV order is important! Merlin - Multipoint Engine for Rapid Likelihood Inference
0 1 . . . . Case 1: Zero Likelihood a/A A/A Any IV with IV[0]=0 is of zero likelihood! A/A Merlin - Multipoint Engine for Rapid Likelihood Inference
1 1 . . . . Case 2: Symmetric Nodes a/A A/A A vector with a IV[1]=0 have a twin with IV[1]=1 A/A Merlin - Multipoint Engine for Rapid Likelihood Inference
L L L L L L L L 1 2 1 2 1 2 1 2 Sparse: Legend Node with zero likelihood Node identical to sibling L L L L 1 2 Likelihood for this branch 1 2 From Packed tree to Sparse Trees Packed: Merlin - Multipoint Engine for Rapid Likelihood Inference
H1 H2 Hi X1 X2 Xi Every member in the matrix is of the form: {step i} P(x1,…,xi,hi) = P(x1,…,xi-1, hi-1) P(hi | hi-1 ) P(xi | hi) hi-1 Reminder: The forward algorithm Note that in Step i of the forward algorithm, we multiply a transition matrix of size 22n x 22n with vectors of size 22n. Merlin - Multipoint Engine for Rapid Likelihood Inference
1 2 3 4 5 6 1 1 1 2 1 1 1 3 1 11 4 1 1 5 1 1 6 1 1 Transition matrix is a bottleneck • Matrix-Vector Multiplication: θ(N2) • In our case, N=22n. • If the matrix was sparse (k<<N2), it was easy. Trivial Implementation: List of lists Merlin - Multipoint Engine for Rapid Likelihood Inference
Multipoint analysis in dense maps • Idea: Close markers Negligible chance for consecutive recombination. • Used for approximate solutions. • Allowing <3 recombinants give an almost exact solution. • But 3 times faster, and with half the memory. Merlin - Multipoint Engine for Rapid Likelihood Inference
Summary • Detects data errors and unlikely data. • Simplifies pedgirees. • Use sparse trees to exploit symmetries and impossible data. • Use sparse matrix to ease matrix-vector multiplication. • Open source. Merlin - Multipoint Engine for Rapid Likelihood Inference
More info • http://bioinformatics.well.ox.ac.uk/Merlin • “Merlin - rapid analysis of dense genetic maps using sparse gene flow trees", Gonçalo R. Abecasis, Stacey S. Cherny, William O. Cookson, and Lon R. Cardon. Nat Genet. 2002 Jan;30(1):97-101 Merlin - Multipoint Engine for Rapid Likelihood Inference
Intermission… Merlin - Multipoint Engine for Rapid Likelihood Inference