240 likes | 357 Views
Bhaskar DasGupta UIC. Reconstructing Kinship Relationships in Wild Populations. I do not believe that the accident of birth makes people sisters and brothers. It makes them siblings. Gives them mutuality of parentage . Maya Angelou. Mary Ashley UIC. Tanya Berger-Wolf UIC.
E N D
BhaskarDasGuptaUIC Reconstructing Kinship Relationshipsin Wild Populations I do not believe that the accident of birth makes people sisters and brothers. It makes them siblings. Gives them mutuality of parentage. Maya Angelou Mary AshleyUIC TanyaBerger-WolfUIC W. Art ChaovalitwongseRutgers AshfaqKhokharUIC Chun-An (Joe) Chou Rutgers PriyaGovindanRutgers Saad SheikhEcolePolytechnique Isabel CaballeroUIC Alan Perez-RathkeoUIC
CACACACA 5’ Alleles CACACACA #1 CACACACACACA #2 #3 CACACACACACACA Genotypes 1/1 2/2 1/2 1/3 2/3 3/3 Microsatellites (STR) • Advantages: • Codominant (easy inference of genotypes and allele frequencies) • Many heterozygous alleles per locus • Possible to estimate other population parameters • Cheaper than SNPs • But: • Few loci • And: • Large families • Self-mating • …
Siblings:two children with the same parents Question: given a set of children, find the sibling groups Diploid Siblings allele locus father(.../...),(a /b ),(.../...),(.../...) (.../...),(c /d ),(.../...),(.../...) mother (.../...),(e /f ),(.../...),(.../...) child one from fatherone from mother
Why Reconstruct Sibling Relationships? Used in: conservation biology, animal management, molecular ecology, genetic epidemiology Necessary for: estimating heritability of quantitative characters, characterizing mating systems and fitness. • But: hard to sample parent/offspring pairs. Sampling cohorts of juveniles is easier
Sibling Groups: 2, 4, 5, 6 1, 3 7, 8 The Problem
Inheritance Rules father(.../...),(a /b ),(.../...),(.../...) (.../...),(c /d ),(.../...),(.../...) mother child 1 (.../...),(e1/f1),(.../...),(.../...) child 2 (.../...),(e2/f2),(.../...),(.../...) child 3 (.../...),(e3/f3),(.../...),(.../...) … child n(.../...),(en/fn),(.../...),(.../...) 4-allele rule:siblings have at most 4 distinct alleles in a locus 2-allele rule: In a locus in a sibling group:a + R ≤ 4 Num distinct alleles Num alleles that appear with 3 others or are homozygot
Our Approach: Mendelian Constrains 4-allele rule:siblings have at most 4 different alleles in a locus Yes: 3/3, 1/3, 1/5, 1/6 No:3/3, 1/3, 1/5, 1/6, 3/2 2-allele rule: In a locus in a sibling group: a + R ≤ 4 Yes: 3/3, 1/3, 1/5 No: 3/3, 1/3, 1/5, 1/6 Num distinct alleles Num alleles that appear with 3 others or are homozygot
Our Approach: Sibling Reconstruction Given:n diploid individuals sampled at l loci Find: Minimum number of 2-allele sets that contain all individuals NP-complete even when we know sibsets are at most 31.0065 approximation gapAshley et al ’09 ILP formulationChaovalitwongseet al. ’07, ’10 Minimum Set Cover based algorithm with optimal solution (using CPLEX)Berger-Wolf et al. ’07 Parallel implementationSheikh, Khokhar, BW ‘10
Canonical families 2/3 1/1 1/1 1/2 2/1 2/2 1/4 4/1 2/4 2/3 3/1 2/1 1/3 1/3 3/2 2/1 3/1 2/1 3/1 1/3 1/2 1/1 2/1 1/2 1/1 4/2 3/2 1/3 2/1 2/3 2/1 3/2 1/3 2/2 1/1 1/2 1/4 2/3 2/4 3/4 3/3 4/4
Aside: Minimum Set Cover Given: universe U = {1, 2, …, n} collection of sets S = {S1, S2,…,Sm} where Si subset of U Find: the smallest number of sets in S whose union is the universe U Minimal Set Cover is NP-hard (1+ln n)-approximable (sharp)
Are we done? Challenges No ground truth available Growing number of methods Biologists need (one) reliable reconstruction Genotyping errors Answer: Consensus Consensus is what many people say in chorus but do not believe as individuals Abba Eban (1915 - 2002), Israeli diplomat In "The New Yorker," 23 Apr 1990
Consensus Methods Combine multiple solutions to a problem to generate one unified solution C:S*→S Based on Social Choice Theory Commonly used where the real solution is not known e.g. Phylogenetic Trees S1 S2 Sk S ... Consensus
Error-Tolerant ApproachSheikh et al. 08 S2 Sk S ... Locus 2 Locus 1 Locus 3 Locusl Sibling Reconstruction Algorithm ... Consensus S1
Distance-based Consensus fq S S2 S1 Sk Ss fd • Algorithm • Compute a consensus solution S={g1,...,gk} • Search for a goodsolution nearS fq fd Search Consensus ... NP-hard for any fd, fq or an arbitrary linear combination Sheikh et al. ‘08
A Greedy Approach - Algorithm Compute a strict consensus While total distance is not too large Merge two sibgroups with minimal (total) distance Quality: fq=n-|C| Distance function from solution C to C’ fd(C,C’) =sum of costs of merging groups in C to obtain C’ =sum of costs of assigning individuals to groups Cost of assigning individual to a group: Benefit: Alleles and allele pairs shared Cost: Minimum Edit Distance
Auto Greedy Consensus • Change costs to average per locus costs • Compare max group error on per locus basis • Treat cost and benefit independently • In order to qualify a merge • Cost <= maxcost • Benefit >= minbenefit • Benefit = max benefit among possible merges
A Greedy Approach • S1 = { {1,2,3},{4,5},{6,7} } • S2 = { {1,2,3},{4}, {5,6,7} } • S3 = { {1,2},{3,4,5}, {6,7} } Strict Consensus S = { {1,2}, {3}, {4}, {5}, {6,7} } S = { {1,2}, {3}, {4}, {5}, {6,7}} S={ {1,2}, {3,6,7}, {4}, {5} }
Testing and Validation: Protocol • Get a dataset with known sibgroups(real or simulated) • Find sibgroups using our alg • Compare the solutions • Partition distrance, Gusfield ’03 = assignment problem • Compare to other sibship methods • Family Finder, COLONY
Salmon (Salmosalar) - Herbingeret al., 1999 351 individuals, 6 families, 4 loci. No missing alleles Shrimp (Penaeusmonodon) - Jerry et al., 200659 individuals,13 families, 7 loci. Some missing alleles Ants (Leptothoraxacervorum )- Hammond et al., 2001Antsare haplodiploid species. The data consists of 377 worker diploid ants Test Data Simulated populations of juveniles for a range of values of number of parents, offspring per parent, alleles, per locus, number of loci, and the distributions of those.
Experimental Protocol Generate F females and M males (F=M=5, 10, 20) Each with l loci (l=2, 4, 6,8,10) Each locus with aalleles (a=10, 15) Generate f families (f=5,10,20) For each family select female+male uniformly at random For each parent pair generate o offspring(o=5,10) For each offspring for each locus choose allele outcome uniformly at random Introduce random errors
Conclusions • Combinatorial algorithms with minimal assumptions • Behaves well on real and simulated data • Better than others with few loci, few large families • Error tolerant • Useful, high demand New and improved: • Efficient implementation Perez-Rathlke et al. (in submission) • Other objectives (bio vs math) Ashley et al. ‘10 • Other genealogical relationshipsSheikh et al. ‘09, ’10 • Different combinatorial approach Brown & B-W, ‘10 • Pedigree amalgamation