200 likes | 368 Views
Inference of Complex Genealogical Histories In Populations and Application in Mapping Complex Traits. Yufeng Wu Dept. of Computer Science and Engineering University of Connecticut. Disease mutation. Genealogy: Evolutionary History of Genomic Sequences.
E N D
Inference of Complex Genealogical Histories In Populations and Application in Mapping Complex Traits Yufeng Wu Dept. of Computer Science and Engineering University of Connecticut DIMACS 2008
Disease mutation Genealogy: Evolutionary History of Genomic Sequences • Tells how sequences in a population are related • Helps to explain diseases: disease mutations occur on branches and all descendents carry the mutations • Genealogy: unknown. Only have SNP haplotypes (binary sequences). • Problem: Inference of genealogy for “unrelated” haplotypes • Not easy: partly due to recombination Diseased (case) Healthy (control) Sequences in current population
Suffix Prefix Breakpoint Recombination • One of the principle genetic forces shaping sequence variations within species • Two equal length sequences generate a third new equal length sequence in genealogy • Spatial order is important: different parts of genome inherit from different ancestors. 110001111111001 1100 00000001111 000110000001111
00 1 0 0 1 10 1 1 Ancestral Recombination Graph (ARG) Mutations Recombination 10 01 00 10 11 01 00 S1 = 00 S2 = 01 S3 = 10 S4 = 11 Assumption: At most one mutation per site S1 = 00 S2 = 01 S3 = 10 S4 = 10
Local tree near site 3 What is the Use of an ARG? May look at the ARG directly. But for noisy data, another way of using ARGs: an ARG represents a set of local trees! Data 0000 0101 0110 1110 1010 0000 0000 0100 0010 Local trees: evolutionary history for different genomic regions between recombination breakpoints. 1010 0110 0101 0110 1110 1010 0000
Possible Disease mutation At which Local Tree Did Disease Mutations Occur? • Clear separation of cases/controls: notexpected for complex diseases Case Control
How to infer ARGs? • But we do not know the true ARG! • Goal: infer ARGs from haplotypes • First practical ARG association mapping method (Minichiello and Durbin, 2006) • Use plausible ARGs: heuristic • Less complex disease model: implicitly assume one disease mutation with major effects. • My results (Wu, RECOMB 2007) • Generate ARGs with a provable property, and works on a well-defined complex disease model • Focus on parsimonious history
Simulation Results (Wu, 2007) • TMARG/MARGARITA: sample ARGs, decompose to local trees and look for association signals. • LATAG: infer local trees at focal points. • Average mapping error for 50 simulated datasets from Zollner and Pritchard Comparison: TMARG (minARGs), TMARG (near minARGs), LATAG (Z. P.), MARGARITA (M. D.). TMARG (my program) and MARGRITA are much faster than LATAG.
Preliminary Results: GAW16 Data SNP rs2476601 reported in Begovich et al., 2004 and Carlton et al., 2005 ? • GAW16 data from the North American Rheumatoid Arthritis Consortium (NARAC), 868 cases and 1194 controls. Chromosome one: 40929 SNPs. • Running TMARG on large-scale data • Break into non-overlapping windows • Run fastPHASE (Scheet and Stephens 06) to obtain haplotypes • Run TMARG with Chi-square mode Caution: more investigation needed.
A Related ProblemInference of Local Tree Topologies Directly (Wu, 2008, Submitted)
Inference of Local Tree Topologies • Recall ARG represents a set of local trees. • Question: given SNP haplotypes, infer local tree topologies (one tree for each SNP site, ignore branch length) • Hein (1990, 1993) • Song and Hein (2003,2005): enumerate all possible tree topologies at each site • Parsimony-based
Local Tree Topologies • Key technical difficulty: enumerate all tree topologies • Brute-force enumeration of local tree topologies: not feasible when number of sequences > 9 • Trivial solution: create a tree for a SNP containing the single split induced by the SNP. • Always correct (assume one mutation per site) • But not very informative: need more refined trees! A: 0 B: 0 C: 1 D: 0 E: 1 F: 0 G: 1 H: 0 A C B E D F G H
How to do better? Neighboring Local Trees are Similar! • Nearby SNP sites provide hints! • Near-by local trees are often topologically similar • Recombination often only alters small parts of the trees • Key idea: reconstruct local trees by combining information from multiple nearby SNPs
RENT: REfining Neighboring Trees • Maintain for each SNP site a (possibly non-binary) tree topology • Initialize to a tree containing the split induced by the SNP • Gradually refining trees by adding new splits to the trees • Splits found by a set of rules (later) • Splits added early may be more reliable • Stop when binary trees or enough information is recovered
A Little Background: Compatibility 1 2 3 a b c d e 0 0 0 1 0 0 0 0 1 1 0 1 0 1 1 Sites 1 and 2 are compatible, but 1 and 3 are incompatible. M • Two sites (columns) p, q are incompatible if columns p,q contains all four ordered pairs (gametes): 00, 01, 10, 11. Otherwise, p and q are compatible. • Easily extended to splits. • A split s is incompatible with tree T if s is incompatible with any one split in T. Two trees are compatible if their splits are pairwise compatible.
Fully-Compatible Region: Simple Case • A region of consecutive SNP sites where these SNPs are pairwise compatible. • May indicate no topology-altering recombination occurred within the region • Rule: for site s, add any such split to tree at s. • Compatibility: very strong property and unlikely arise due to chance.
Split Propagation: More General Rule • Three consecutive sites 1,2 and 3. Sites 1 and 2 are incompatible. Does site 3 matter for tree at site 1? • Trees at site 1 and 2 are different. • Suppose site 3 is compatible with sites 1 and 2. Then? • Site 3 may indicate a shared subtreein both trees at sites 1 and 2. • Rule: a split propagates to both directions until reaching a incompatible tree.
One Subtree-Prune-Regraft (SPR) Event • Recombination: simulated by SPR. • The rest of two trees (without pruned subtrees) remain the same • Rule: find compatible subtree Ts in neighboring trees T1 and T2, s.t. the rest of T1 and T2 (Ts removed) are compatible. Then joint refine T1- Ts and T2- Ts before adding back Ts. Subtree to prune More complex rules possible. ?
Simulation • Hudson’s program MS (with known coalescent local tree topologies): 100 datasets for each settings. • Data much larger and perform better or similarly for small data than Song and Hein’s method. • Test local tree topology recovery scored by Song and Hein’s shared-split measure = 15 = 50
Acknowledgement • More information available at: http://www.engr.uconn.edu/~ywu • I want to thank • Dan Gusfield • Yun S. Song • Charles Langley • Dan Brown • And National Science Foundation and UConn Research Foundation