750 likes | 763 Views
This study explores creating effective websites by solving the Constrained Subtree Selection Problem. It focuses on natural navigation, optimal access to information, and minimizing path costs. The research delves into optimal solutions, weighted path costs, degree costs, and k-favorability in constraint-free graphs. The study also discusses complexity, highlighted results, and related works in website optimization and coding problems.
E N D
Building Optimal Websites with the Constrained Subtree Selection Problem Brent Heeringa (joint work with Micah Adler) 09 November 2004
Knives Maker Type Wüstof Henkels paring chef bread steak 0.26 0.33 0.27 0.14 A website design problem(for example: a new kitchen store) • Given products, their popularity, and their organization: • How do we create a good website? • Navigation is natural • Access to information is timely
Transitive Closure Subgraph of TC Good website: Natural Navigation • Organization is a DAG • TC of DAG enumerates all viable categorical relationships and introduces shortcuts • Subgraph of TC preserves logical relationship between categories TC
Good website: Timely Access to Info • Two obstacles to finding info quickly • Time scanning a page for correct link • Time descending the DAG • Associate a cost with each obstacle • Page cost (function of out-degree of node) • Path cost (sum of page costs on path) • Good access structure: • Minimize expected path cost • Optimal subgraph is always a full tree 1/2 Page Cost = # links Path Cost = 3+2=5 Weighted Path Cost = 5/2
Constrained Subtree Selection (CSS) • An instance of CSS is a triple: (G,,w) • G is a rooted, DAG with n leaves (constraint graph) • is a function of the out-degree of each internal node (degree cost) • w is a probability distribution over the n leaves (weights) • A solution is any directed subtree of the transitive closure of G which includes the root and leaves • An optimal solution is one which minimizes the expected path cost 1/4 1/4 1/4 1/4 (x)=x
Constrained Subtree Selection (CSS) • An instance of CSS is a triple: (G,,w) • G is a rooted, DAG with n leaves (constraint graph) • is a function of the out-degree of each internal node (degree cost) • w is a probability distribution over the n leaves (weights) • A solution is any directed subtree of the transitive closure of G which includes the root and leaves • An optimal solution is one which minimizes the expected path cost 1/4 1/4 1/4 1/4 3(1/4) 5(1/4) 5(1/4) 3(1/4) (x)=x Cost:4
Constrained Subtree Selection (CSS) • An instance of CSS is a triple: (G,,w) • G is a rooted, DAG with n leaves (constraint graph) • is a function of the out-degree of each internal node (degree cost) • w is a probability distribution over the n leaves (weights) • A solution is any directed subtree of the transitive closure of G which includes the root and leaves • An optimal solution is one which minimizes the expected path cost 1/2 1/6 1/6 1/6 (x)=x Cost: 3 1/2
Constraint-Free Graphs and k-favorability • Constraint-Free Graph • Every directed, full tree with n leaves is a subtree of the TC • CSS is no longer constrained by the graph • k-favorable degree cost • Fix . There exists k>1 for any constraint-free instance of CSS under where an optimal tree has maximal out-degree k
Linear Degree Cost - (x)=x • 5 paths w/ cost 5 • 3 paths w/ cost 5 • 2 paths w/ cost 4
Linear Degree Cost - (x)=x > 1/2 • Prefer binary structure when a leaf has at least • half the mass • Prefer ternary structure when mass is • uniformly distributed • CSS with 2-favorable degree costs and C.F. graphs is Huffman coding problem • Examples: quadratic, exp, ceiling of log
Results • Complexity: NP-Complete for equal weights and many • Sufficient condition on • Hardness depends on constraint graph • Highlighted Results: • Theorem: O(n(k)+k)-time DP algorithm • is integer-valued, k-favorable and G is constraint free • (x)=x • Theorem: poly-time constant-approximation: • ≥1 and k-favorable; G has constant out-degree • Approximate Hotlink Assignment - [Kranakis et. al] • Other results: • Characterizations of optimal trees for uniform probability distributions
Related Work • Adaptive Websites [Perkowitz & Etzioni] • Challenge to the AI community • Novel views of websites: Page synthesis problem • Hotlink Assignment [Kranakis, Krizanc, Shende, et. al.] • Add 1 hotlink per page to minimize expected distance from root to leaves • Recently: pages have cost proportional to their size • Hotlinks don’t change page cost • Optimal Prefix-Free Codes [Golin & Rote] • Min code for n words with r symbols where symbol ai has cost ci • Resembles CSS without a constraint graph
Exact Cover by 3-Sets INPUT: (X,C) X=(x1,…,xn) n=3k and C=(C1,…,Cm) Ci X OUTPUT: C’ C where |C’|=k and covers X QUESTION: Given K and (X,C) is there a cover of size K? Sufficient condition on : For every integer k, there exists an integer s(k) such that
Recall: (x)=x, and G is constraint free Node level = path cost Adding an edge increases level Grow lopsided trees level by level Lopsided Trees
Lopsided Trees • We know exact cost of tree up to the current level i: • Exact cost of m leaves • Remaining n-m leaves must have path-cost at least i
Lopsided Trees • Exact cost of C: 3 • (1/3)=1 • Remaining mass up to level 4: (2/3) • 4 = 8/3 • Total: 1+8/3=11/3
Lopsided Trees • Tree cost at Level 5 in terms of Tree cost at Level 4: • Add in the mass of remaining leaves • Cost at Level 5: • No new leaves • 11/3+2/3=13/3
Lopsided Trees • Equality on trees: • Equal number of leaves at or above frontier • Equal number of leaves at each relative level below frontier • Nodes have outdegree ≤ 3 • Node below frontier ≤ (3) • (m;l1, l2, l3) = signature • Example Signature: (2; 3, 2, 0) • 2: C and F are leaves • 3: G, H, I are 1 level past the frontier • 2: J and K are 2 levels past the frontier
Inductive Definition • Let CSS(m,l1,l2,l3) = min cost tree with sig. (m;l1, l2, l3) • Can we define CSS(m,l1,l2,l3) in terms of optimal substructures? • Which trees, when grown by one level, have signatures CSS(m,l1,l2,l3)? • Which signatures (m’,l’1,l’2,l’3) lead to (m,l1,l2,l3)
The other direction Sig: (0; 2, 0, 0) • Growing a tree only affects frontier • Only l1 affects next level • Choose leaves • The remaining nodes are internal • Choose degree-2 (d2) • Remaining nodes are degree-3 (d3) • O(n2) choices Sig: (1; 0, 0, 3)
The original question(warning: here be symbols) • Which (m’;l’1,l’2,l’3) (m;l1,l2,l3) • l’1 and d2 are sufficient • l’1 and d2 are both O(n) • O(n2) possibilities for (m’;l’1,l’2,l’3) • CSS(m,l1,l2,l3) = min cost tree with sig. (m;l1, l2, l3) = CSS(m’,l’1,l’2,l’3) + cm’ for 1≤d2≤l’1≤n (cm’ are the smallest n-m’ weights) • CSS(n,0,0,0) = cost of optimal tree • Analysis: • Table size = O(n4) • Each cell takes O(n2) lookups • O(n6) algorithm
Lower Bound on Cost • Lemma:H(w)/log(k) is a lower bound on the cost of an optimal tree • For any k-favorable degree cost , with ≥1 • G is constraint-free T’ T T 1 1 1 1 1 1 1 1 1 c(T) ≥ c’(T) ≥ c’(T’) ≥ H(w)/log(k) (shannon)
A Simple Lemma • Lemma 2: For any tree with m weighted nodes there exists 1 node (splitter) which, when removed, divides the tree into subtrees with at most half the weight of the original tree. splitter <1/2 < 1/2 < 1/2
Aproximation Algorithm Let G be a DAG where out-degree of every node d • Choose a spanning tree T from G • Balance-Tree(T): • Find a splitter node in T (Lemma 2) • Stop if splitter is child of root • Disconnect the splitter and reconnect it to the root • root has degree at most d+1 • Call Balance-Tree on all subtrees splitter Mass of each subtree is at least half of whole tree
Approximation Algorithm • Analysis: • Mass under any node is half of mass under its grandparent • Path length to leaf with weight wi is -2log(wi) • Theorem: • O(m)-time O(log(k)(d+1))-approx to optimal solution • For any DAG G with m nodes and out-degree d • For every k-favorable degree cost ≥ 1, Upper Bound on Node Cost Weighted Path Length
Proposed Problem 1(CSS in constraint-free graphs, equal leaf weights) • Question: Polytime algorithm for CSS with: • Constraint-free graphs • Equal leaf weights • Increasing degree cost • Good News: • Characterizations for linear and log degree costs • Near linear time algorithms for r-ary Varn Codes (Huffman codes with r unequal letter costs, uniform probability distribution)
Varn Codes(infinite lopsided tree) Symbol Costs = (3,3,3,8,8) 5 Leaves Note: Not the 5 highest Leaves!
Varn Codes(infinite lopsided tree) Symbol Costs = (3,3,3,8,8) 6 Leaves Note: m internal nodes are the highest m nodes in the infinite tree
Proposed Problem 1(CSS in constraint-free graphs, equal leaf weights) • Bad News: • No Notion of an infinite lopsided tree in CSS • Degree change = structure change • Optimal CSS tree is fairly balanced • Property: • No leaf may appear above the level of any other internal node • Proof: If it were the case, we could switch branches and decrease the cost of the tree • Intuition: There is some k which optimizes breadth-to-depth tradeoff. The optimal tree repeats this structure. Fringe requires some computation time.
Proposed Problem 2(Dynamic CSS) • CSS often applies to environments which are inherently dynamic • Web pages change popularity • Access patterns change on file systems • Question: Given a CSS tree with property P, how much time does it take to maintain P after an update? • P = minimum cost, approximation-ratio of min cost • Restrict attention to • Integer leaf weights (rational distributions) • Unit updates
Proposed Problem 2(Dynamic CSS) • Good News: Knuth (and later Vitter) studied Dynamic Huffman Codes (DHC) • Motivation: One-pass encoding • Protocol: • Both parties maintain optimal tree for first t characters • Encode and decode t+1 character • Update tree • Optimality of tree maintained in time proportional to encoding
DHC: Sibling Property • A binary tree with n leaves is a Huffman tree iff: • The n leaves have nonnegative weights w1…wn • the weight of each internal node is the sum of the weights of its children • The nodes can be numbered in non-decreasing order by weight • siblings are numbered consecutively • common parent has a higher number 11 32 10 21 Numbering corresponds to merging in greedy algorithm 9 11 F 7 8 10 11 4 5 3 5 5 5 6 6 C D E 1 2 2 3 A B
DHC: Sibling Property • A binary tree with n leaves is a Huffman tree iff: • The n leaves have nonnegative weights w1…wn • the weight of each internal node is the sum of the weights of its children • The nodes can be numbered in non-decreasing order by weight • siblings are numbered consecutively • common parent has a higher number 11 33 10 22 What happens if we increase B? Node 4 violates the Sibling Property 9 11 F 7 8 11 11 4 6 3 5 5 5 6 6 C D E 1 2 2 4 A B
DHC: Sibling Property • A binary tree with n leaves is a Huffman tree iff: • The n leaves have nonnegative weights w1…wn • the weight of each internal node is the sum of the weights of its children • The nodes can be numbered in non-decreasing order by weight • siblings are numbered consecutively • common parent has a higher number 11 32 10 21 Before updating: Exchange current node with node with highest number having the same weight 9 11 F 7 8 10 11 4 5 3 5 5 5 6 6 C D E 1 2 2 3 A B
DHC: Sibling Property • A binary tree with n leaves is a Huffman tree iff: • The n leaves have nonnegative weights w1…wn • the weight of each internal node is the sum of the weights of its children • The nodes can be numbered in non-decreasing order by weight • siblings are numbered consecutively • common parent has a higher number 11 32 10 21 Before updating: Exchange current node with node with highest number having the same weight 9 11 F 7 8 10 11 4 5 3 5 5 5 6 6 C D E 1 2 2 3 A B
DHC: Sibling Property • A binary tree with n leaves is a Huffman tree iff: • The n leaves have nonnegative weights w1…wn • the weight of each internal node is the sum of the weights of its children • The nodes can be numbered in non-decreasing order by weight • siblings are numbered consecutively • common parent has a higher number 11 32 10 21 Different, but still optimal, greedy choice when merging nodes 9 11 F 7 8 10 11 4 3 5 5 6 5 6 5 C D E 1 2 2 3 A B
DHC: Sibling Property • A binary tree with n leaves is a Huffman tree iff: • The n leaves have nonnegative weights w1…wn • the weight of each internal node is the sum of the weights of its children • The nodes can be numbered in non-decreasing order by weight • siblings are numbered consecutively • common parent has a higher number 11 32 10 21 Different, but still optimal, greedy choice when merging nodes 9 11 F 7 8 10 11 4 3 5 5 6 5 6 5 C D E 1 2 2 3 A B
DHC: Sibling Property • A binary tree with n leaves is a Huffman tree iff: • The n leaves have nonnegative weights w1…wn • the weight of each internal node is the sum of the weights of its children • The nodes can be numbered in non-decreasing order by weight • siblings are numbered consecutively • common parent has a higher number 11 33 10 21 Different, but still optimal, greedy choice when merging nodes 9 11 7 8 10 11 6 5 6 5 F E 4 3 5 5 3 2 2 C D 1 A B
DHC: Sibling Property • A binary tree with n leaves is a Huffman tree iff: • The n leaves have nonnegative weights w1…wn • the weight of each internal node is the sum of the weights of its children • The nodes can be numbered in non-decreasing order by weight • siblings are numbered consecutively • common parent has a higher number 11 32 10 21 Now, safe to increase B, because it can’t be greater than the next highest! 9 11 7 8 10 11 6 5 6 5 F E 4 3 5 5 3 2 2 C D 1 A B
DHC: Sibling Property • A binary tree with n leaves is a Huffman tree iff: • The n leaves have nonnegative weights w1…wn • the weight of each internal node is the sum of the weights of its children • The nodes can be numbered in non-decreasing order by weight • siblings are numbered consecutively • common parent has a higher number 11 33 10 21 Now, safe to increase B, because it can’t be greater than the next highest! 9 12 7 8 10 11 6 5 6 6 F E 4 3 5 5 4 2 2 C D 1 A B
Proposed Problem 2(Dynamic CSS) • Good News: DHC generalizes to k-ary alphabets • Claim: • DHC is an O((k))-approximation for CSS • : k-favorable, (x)≥1 • constraint-free graphs