1 / 19

Dynamic Structures for Top- k Queries on Uncertain Data

Jiang Chen Columbia University Ke Yi HKUST. Dynamic Structures for Top- k Queries on Uncertain Data. Motivation. Uncertain data naturally arises in many applications: sensor data, fuzzy data integration, data cleaning, etc. Items associated with “confidence” may or may not be true

baina
Download Presentation

Dynamic Structures for Top- k Queries on Uncertain Data

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Jiang Chen Columbia University Ke Yi HKUST Dynamic Structures for Top-kQueries on Uncertain Data

  2. Motivation • Uncertain data naturally arises in many applications: sensor data, fuzzy data integration, data cleaning, etc. • Items associated with “confidence” • may or may not be true • may or may not exist • Very hot topic in the database community

  3. Motivation top-k answer depends onthe interplay between score and confidence (sensor reading, reliability) (page rank, how well match query)

  4. Problem Definition [Soliman et al. 07] The k items with the maximum probabilityof being the top-k {t3, t5}: 0.2*0.8 = 0.16 {t3, t4}: 0.2*(1-0.8)*0.9 = 0.036 {t5, t4}: (1-0.2)*0.8*0.9 = 0.576 ...

  5. One-Time Computation • Assume items are already sorted by score t1 t2 t3 t4 t5 t6 t7 t8 ...0.2 0.8 0.7 0.2 0.1 1 0.1 0.8 ... {t2,t5} being top-2  t2, t5 appearing and t1, t3, t4 not appearing Consider the i-thitem ti: Question: Among t1, ..., ti, which k items have the maximum prob. of appearing while the rest not appearing? Answer: The k items with the largest prob. Time: O(n log k) Just need to answer the question for all i

  6. The Data Structure Problem • Build a data structure, such that: • Query • Given j, return the top-j answer • Update • Insert an item • Delete an item • Update the probability of an item • Construction

  7. Our Results • A data structure of size O(n) • Query: O(log(n) + j) • Given j, return the top-j answer, j=1,...,k • Update: O(k log n) (better than paper) • Insert an item • Delete an item • Update the probability of an item • Construction: O(n log k) (better than paper)

  8. Overall Structure ρju= max{ρjv, max0≤j’≤j-1{φj’vρj-j’w}}, j=1,…,k u top-j prob. ρju j’ largest probφj’v top-(j-j’) ρj-j’u v w Top-j query: O(log n + j) leaf has k ~ 2k items

  9. Update an Internal Node ρju= max{ρjv, max0≤j’≤j-1{φj’vρj-j’w}}, j=1,…,k Monotone The last item of the top-(j+1) answer can’t be in front of the last item of top-j

  10. Total Monotonicity • A matrix is totally monotone if all its sub-matrices are monotone • Enough to check all 2x2 sub-matrices A > B  C > D For a k*k totally monotone matrix, the SMAWK algorithm [Aggarwal et al. 87] can find all row-maximum in time O(k).

  11. Total Monotonicity Lemma: Thematrix (φj’vρj-j’w) is totally monotone. An internal node can be updated in time O(k).

  12. Update (Recompute) a Leaf • Goal: Compute ρj, j = 1,…,n, where n = Θ(k) • Define φj,i = p(e1,i)∙p(e2,i)∙ ∙∙∙ ∙p(ej,i)∙(1-p(ej+1,i))∙(1-p(ej+2,i))∙ ∙∙∙ ∙(1-p(ei,i))where ei,1,…,ei,i are the first i items sorted by decreasing probability • ρj = max1≤i≤n φj,i • Compute the row-max for the matrix(φj,i)k*n !

  13. Total Monotonicity, Again • Lemma: The matrix(φj,i)k*n is totally monotone. • Are we done yet? • The SMAWK algorithm probes O(k) entries in the matrix (φj,i)k*n, but still need to retrieve φj,i = p(e1,i)∙ ∙∙∙ ∙p(ej,i)∙(1-p(ej+1,i))∙ ∙∙∙ ∙(1-p(ei,i))on demand

  14. Retrieve φj,i Rewrite φj,i = p(e1,i)∙ ∙∙∙ ∙p(ej,i)∙(1-p(ej+1,i))∙ ∙∙∙ ∙(1-p(ei,i)) p(e1,i) p(ej,i) 1-p(e1,i) 1-p(ej,i) = ∙ ∙∙∙ ∙ ∙(1-p(e1,i))∙ ∙∙∙ ∙(1-p(ei,i)) p(e1,i) p(ej,i) 1-p(e1,i) 1-p(ej,i) = ∙ ∙∙∙ ∙ ∙(1-p(t1))∙ ∙∙∙ ∙(1-p(ti)) pre-compute in time O(k)

  15. Retrieve φj,i Focus on p(e1,i) p(ej,i) 1-p(e1,i) 1-p(ej,i) ∙ ∙∙∙ ∙ To support all i, make the structurepartially persistent Insertion: O(log k) Query: O(log k) e1,i e1,i+1 e2,i e3,i e4,i e5,i e6,i e2,i+1 e3,i+1 e4,i+1 e5,i+1 e6,i+1 e7,i+1

  16. Update (Recompute) a Leaf • Goal: Compute ρj, j = 1,…,n, where n = Θ(k) • ρj = max1≤i≤n φj,i • Compute the row-max for the matrix(φj,i)k*n ! • The SMAWK algorithm probes O(k) φj,i’s • Using persistent (2,3)-tree • Construction: O(k log k) • Query: O(log k) Total time for a leaf: O(k log k)

  17. Summary • Update (recompute) an internal node: O(k) • O(log n) such nodes • Update (recompute) a leaf node: O(k log k) • Total update time: O(k log n) • Insertions/deletions can be handled using standard techniques (rebalancing) • Construction time: O(n log k) • Construction as efficient as one-time computation

  18. Final Remarks • Conjecture Ω(k) is lower bound for update time • Other top-k definitions? • for each item, compute its prob. being one of the top-k • return the k items with the largest such prob. • k-nearest neighbors in uncertain geometric data • each point has a pdf

  19. The END

More Related