Dynamic Structures for Top- k Queries on Uncertain Data

Jiang Chen Columbia University Ke Yi HKUST Dynamic Structures for Top-kQueries on Uncertain Data

Motivation • Uncertain data naturally arises in many applications: sensor data, fuzzy data integration, data cleaning, etc. • Items associated with “confidence” • may or may not be true • may or may not exist • Very hot topic in the database community

Motivation top-k answer depends onthe interplay between score and confidence (sensor reading, reliability) (page rank, how well match query)

Problem Definition [Soliman et al. 07] The k items with the maximum probabilityof being the top-k {t3, t5}: 0.2*0.8 = 0.16 {t3, t4}: 0.2*(1-0.8)*0.9 = 0.036 {t5, t4}: (1-0.2)*0.8*0.9 = 0.576 ...

One-Time Computation • Assume items are already sorted by score t1 t2 t3 t4 t5 t6 t7 t8 ...0.2 0.8 0.7 0.2 0.1 1 0.1 0.8 ... {t2,t5} being top-2  t2, t5 appearing and t1, t3, t4 not appearing Consider the i-thitem ti: Question: Among t1, ..., ti, which k items have the maximum prob. of appearing while the rest not appearing? Answer: The k items with the largest prob. Time: O(n log k) Just need to answer the question for all i

The Data Structure Problem • Build a data structure, such that: • Query • Given j, return the top-j answer • Update • Insert an item • Delete an item • Update the probability of an item • Construction

Our Results • A data structure of size O(n) • Query: O(log(n) + j) • Given j, return the top-j answer, j=1,...,k • Update: O(k log n) (better than paper) • Insert an item • Delete an item • Update the probability of an item • Construction: O(n log k) (better than paper)

Overall Structure ρju= max{ρjv, max0≤j’≤j-1{φj’vρj-j’w}}, j=1,…,k u top-j prob. ρju j’ largest probφj’v top-(j-j’) ρj-j’u v w Top-j query: O(log n + j) leaf has k ~ 2k items

Update an Internal Node ρju= max{ρjv, max0≤j’≤j-1{φj’vρj-j’w}}, j=1,…,k Monotone The last item of the top-(j+1) answer can’t be in front of the last item of top-j

Total Monotonicity • A matrix is totally monotone if all its sub-matrices are monotone • Enough to check all 2x2 sub-matrices A > B  C > D For a k*k totally monotone matrix, the SMAWK algorithm [Aggarwal et al. 87] can find all row-maximum in time O(k).

Total Monotonicity Lemma: Thematrix (φj’vρj-j’w) is totally monotone. An internal node can be updated in time O(k).

Update (Recompute) a Leaf • Goal: Compute ρj, j = 1,…,n, where n = Θ(k) • Define φj,i = p(e1,i)∙p(e2,i)∙ ∙∙∙ ∙p(ej,i)∙(1-p(ej+1,i))∙(1-p(ej+2,i))∙ ∙∙∙ ∙(1-p(ei,i))where ei,1,…,ei,i are the first i items sorted by decreasing probability • ρj = max1≤i≤n φj,i • Compute the row-max for the matrix(φj,i)k*n !

Total Monotonicity, Again • Lemma: The matrix(φj,i)k*n is totally monotone. • Are we done yet? • The SMAWK algorithm probes O(k) entries in the matrix (φj,i)k*n, but still need to retrieve φj,i = p(e1,i)∙ ∙∙∙ ∙p(ej,i)∙(1-p(ej+1,i))∙ ∙∙∙ ∙(1-p(ei,i))on demand

Retrieve φj,i Rewrite φj,i = p(e1,i)∙ ∙∙∙ ∙p(ej,i)∙(1-p(ej+1,i))∙ ∙∙∙ ∙(1-p(ei,i)) p(e1,i) p(ej,i) 1-p(e1,i) 1-p(ej,i) = ∙ ∙∙∙ ∙ ∙(1-p(e1,i))∙ ∙∙∙ ∙(1-p(ei,i)) p(e1,i) p(ej,i) 1-p(e1,i) 1-p(ej,i) = ∙ ∙∙∙ ∙ ∙(1-p(t1))∙ ∙∙∙ ∙(1-p(ti)) pre-compute in time O(k)

Retrieve φj,i Focus on p(e1,i) p(ej,i) 1-p(e1,i) 1-p(ej,i) ∙ ∙∙∙ ∙ To support all i, make the structurepartially persistent Insertion: O(log k) Query: O(log k) e1,i e1,i+1 e2,i e3,i e4,i e5,i e6,i e2,i+1 e3,i+1 e4,i+1 e5,i+1 e6,i+1 e7,i+1

Update (Recompute) a Leaf • Goal: Compute ρj, j = 1,…,n, where n = Θ(k) • ρj = max1≤i≤n φj,i • Compute the row-max for the matrix(φj,i)k*n ! • The SMAWK algorithm probes O(k) φj,i’s • Using persistent (2,3)-tree • Construction: O(k log k) • Query: O(log k) Total time for a leaf: O(k log k)

Summary • Update (recompute) an internal node: O(k) • O(log n) such nodes • Update (recompute) a leaf node: O(k log k) • Total update time: O(k log n) • Insertions/deletions can be handled using standard techniques (rebalancing) • Construction time: O(n log k) • Construction as efficient as one-time computation

Final Remarks • Conjecture Ω(k) is lower bound for update time • Other top-k definitions? • for each item, compute its prob. being one of the top-k • return the k items with the largest such prob. • k-nearest neighbors in uncertain geometric data • each point has a pdf

The END

Dynamic Structures for Top- k Queries on Uncertain Data