180 likes | 283 Views
Fitting Tree Metrics: Hierarchical Clustering and Phylogeny. Nir Ailon Moses Charikar Princeton University. Data with dissimilarity information. u. Represented by matrix D Complete information. 10. D(u,v)=1. y. 7. v. 6. 5. 3. 2. 13. 8. 5. x. w.
E N D
Fitting Tree Metrics:Hierarchical Clustering and Phylogeny Nir Ailon Moses Charikar Princeton University
Data with dissimilarity information u • Represented by matrix D • Complete information 10 D(u,v)=1 y 7 v 6 5 3 2 13 8 5 x w (big number = high dissimilarity)
Goal: Fit data to tree structure • Preserve dissimilarity info T • Tree metric dT close to D v dT(u,v) w y x u
Objective function Minimize: cost(T) = || D – dT||p n ( )-dimensional real vectors 2
Applications • Evolutionary biology • Molecular phylogeny:Dissimilarity information from DNA • Gene expression analysis • Historical linguistics • ...
Special case: Ultrametrics (Hierarchical clustering) T , ` y u v M=3 x w y u v x w dT(v,x)=1 dT(u,w)=3 Equivalently: Two largest distances in every equal
Previous results • Fitting ultrametrics under ||.|| in P[FKW95] • Fitting trees under ||.|| APX-Hard[ABFPT99] • Fitting ultrametrics under ||.||1 APX-Hard[W93] under ||.||2 NP-Hard • f(n)-approximation algorithm for ultrametrics(3f(n))-approximation algorithm for trees(under any ||.||p) [ABFPT99]
Previous results • O(min{n1/p, (k logn)1/p})-approx for trees under ||.||p[HKM05] • Fitting ultrametrics for M=2under||.||1 : Correlation Clustering[BBC02, CGW03, ACN05..] • . . .
Our results • (M+1)– approx for fitting level M ultrametrics under ||.||1 • O)(log n loglog n)1/p)- approx for general weighted trees under||.||p
Reconstructing T from ultrametric D • Given ultrametricD {1..M}n x n • Pick pivot vertex u • Recursively solve for neighbor-classes M=3 M=2 2 1 u 3
Minimizing ||.||1 for inconsistent D {1..M}n x n • Same algorithm! • Pick pivot vertex u(uniformly@random) • Freeze distances incident to u • Fix inter-class distances 2 2 X 3 3 X • Fix intra-class distances 3 2 1 X 1 • (Total cost contribution: 4) u 3 • Recurse... • Lemma: no cancellations • Theorem: M+1 approximation
Proof idea w • violating if:1 > 2¸3 • Optimal solution pays¸1-2 • Algorithm chargingscheme: 2 ) 1 1 ) 2 v u ) 2) 1 3 2-3+ 1-2 w 1-2 u v chosen as pivot ) charged
T LM ... ... ... L2 L1 y u v x w General ultrametrics • D2 R+n £ n • Fit D to weighted ultrametric M possible distances: 1 = L1 2 = L1+L2 : M = L1+ . . . + Lm Ex: dt(v,w)=L1+L2
T LM xMuy = 0 x2uy = 0 x1uy = 1 ... ... ... L2 L1 y u v x w Fitting D to M-level weightedUltrametric under || .||1 Linear [0,1] relaxation • Integer program formulation: xtuv {0,1} • xtuv = 1 u,v separated at level t • 0 xMuv xM-1uv ... x1uv=1 • - inequality at each levelxtuv xtuw + xtwv • Cost:min t=1M Lt ( xtuv + (1-xtuv) ) D(u,v) t D(u,v) > t
Rounding the LP:An O(logn loglogn)-approximation • A divisive (top-down) algorithm • At each level t=M, M-1,..., 1: • Solve a multi-cut-like problem • Cluster so as to separate u,v ’s s.t. xtuv¸ 2/3 • Danger: High levels influence low ones!
General ||.||p cost • Similar analysisgives same bound for ||.||pp • Therefore: O( logn loglogn )1/p– approximation • By [ABFPT99], applies also to fitting trees
Future work • O( log n) – algorithm? Better? • Stronger lower bounds • Derandomize (M+1)-approx algorithm • Aggregation [ACN05] • Applications Thank You !!!