170 likes | 678 Views
Inferring Local Tree Topologies for SNP Sequences Under Recombination in a Population. Yufeng Wu Dept. of Computer Science and Engineering University of Connecticut, USA. Sites. 00100 01010 00101 00010 11101. Haplotypes. Genetic Variations. Sites.
E N D
Inferring Local Tree Topologies for SNP Sequences Under Recombination in a Population Yufeng Wu Dept. of Computer Science and Engineering University of Connecticut, USA MIEP 2008
Sites 00100 01010 00101 00010 11101 Haplotypes Genetic Variations Sites • Single-nucleotide polymorphism (SNP): a site (genomic location) where two types of nucleotides occur frequently in the population. • Haplotype, a binary vector of SNPs (encoded as 0/1). • Haplotypes: offer hints on genealogy. AATGTAGCCGA AATATAACCTA AATGTAGCCGT AATGTAACCTA CATATAGCCGT AATGTAGCCGA AATATAACCTA AATGTAGCCGT AATGTAACCTA CATATAGCCGT Each SNP induces a split DNA sequences
Disease mutation Genealogy: Evolutionary History of Genomic Sequences • Tells how individuals in a population are related • Helps to explain diseases: disease mutations occur on branches and all descendents carry the mutations • Problem: How to determine the genealogy for “unrelated” individuals? • Complicated by recombination Diseased (case) Healthy (control) Individuals in current population
Suffix Prefix 11000 0000001111 Breakpoint Recombination • One of the principle genetic forces shaping sequence variations within species • Two equal length sequences generate a third new equal length sequence in genealogy • Spatial order is important: different parts of genome inherit from different ancestors. 110001111111001 000110000001111
00 10 Ancestral Recombination Graph (ARG) Mutations Recombination 1 0 0 1 1 1 10 01 00 10 11 01 00 S1 = 00 S2 = 01 S3 = 10 S4 = 11 Assumption: At most one mutation per site S1 = 00 S2 = 01 S3 = 10 S4 = 10
Local Trees ARG • ARG represents a set of local trees. • Each tree for a continuous genomic region. • No recombination between two sites same local trees for the two sites • Local tree topology: informative and useful Local tree near site 2 Local tree to the right of site 3 Local tree near sites 1 and 2
Inference of Local Tree Topologies • Question: given SNP haplotypes, infer local tree topologies (one tree for each SNP site, ignore branch length) • Hein (1990, 1993) • Enumerate all possible tree topologies at each site • Song and Hein (2003,2005) • Parsimony-based • Local tree reconstruction can be formulated as inference on a hidden Markov model.
Local Tree Topologies • Key technical difficulty • Brute-force enumeration of local tree topologies: not feasible when number of sequences > 9 • Can not enumerate all tree topologies • Trivial solution: create a tree for a SNP containing the single split induced by the SNP. • Always correct (assume one mutation per site) • But not very informative: need more refined trees! A: 0 B: 0 C: 1 D: 0 E: 1 F: 0 G: 1 H: 0 A C B E D F G H
How to do better? Neighboring Local Trees are Similar! • Nearby SNP sites provide hints! • Near-by local trees are often topologically similar • Recombination often only alters small parts of the trees • Key idea: reconstructing local trees by combining information from multiple nearby SNPs
RENT: REfining Neighboring Trees • Maintain for each SNP site a (possibly non-binary) tree topology • Initialize to a tree containing the split induced by the SNP • Gradually refining trees by adding new splits to the trees • Splits found by a set of rules (later) • Splits added early may be more reliable • Stop when binary trees or enough information is recovered
A Little Background: Compatibility 1 2 3 4 5 a b c d e f g 0 0 0 1 0 1 0 0 1 0 0 0 1 0 0 1 0 1 0 0 0 1 1 0 0 0 1 1 0 1 0 0 1 0 1 Sites 1 and 2 are compatible, but 1 and 3 are incompatible. M • Two sites (columns) p, q are incompatible if columns p,q contains all four ordered pairs (gametes): 00, 01, 10, 11. Otherwise, p and q are compatible. • Easily extended to splits. • A split s is incompatible with tree T if s is incompatible with any one split in T. Two trees are compatible if their splits are pairwise compatible.
Fully-Compatible Region: Simple Case • A region of consecutive SNP sites where these SNPs are pairwise compatible. • May indicate no topology-altering recombination occurred within the region • Rule: for site s, add any such split to tree at s. • Compatibility: very strong property and unlikely arise due to chance.
Split Propagation: More General Rule • Three consecutive sites 1,2 and 3. Sites 1 and 2 are incompatible. Does site 3 matter for tree at site 1? • Trees at site 1 and 2 are different. • Suppose site 3 is compatible with sites 1 and 2. Then? • Site 3 may indicate a shared subtreein both trees at sites 1 and 2. • Rule: a split propagates to both directions until reaching a incompatible tree.
Unique Refinement • Consider the subtree with leaves 1,2 and 3. • Which refinement is more likely? • Add split of 1 and 2: the only split that is compatible with neighboring T2. • Rule: refine a non-binary node by the only compatible split with neighboring trees ? 1 3 2
One Subtree-Prune-Regraft (SPR) Event • Recombination: simulated by SPR. • The rest of two trees (without pruned subtrees) remain the same • Rule: find identicalsubtree Ts in neighboring trees T1 and T2, s.t. the rest of T1 and T2 (Ts removed) are compatible. Then joint refine T1- Ts and T2- Ts before adding back Ts. Subtree to prune More complex rules possible.
Simulation • Hudson’s program MS (with known coalescent local tree topologies): 100 datasets for each settings. • Data much larger and perform better or similarly for small data than Song and Hein’s method. • Test local tree topology recovery scored by Song and Hein’s shared-split measure = 15 = 50
Acknowledgement • Software available upon request. • More information available at: http://www.engr.uconn.edu/~ywu • I want to thank • Yun S. Song • Dan Gusfield