Indexing Multi-Dimensional Uncertain Data with Arbitrary Probability Density Functions

Indexing Multi-Dimensional Uncertain Data with Arbitrary Probability Density Functions Yufei Tao, Reynold Cheng, Xiaokui Xiao, Wang Kai Ngai, Ben Kao, Sunil Prabhakar City University of Hong Kong Hong Kong Polytechnic University University of Hong Kong Purdue University

Multi-dimensional Uncertain Data • Moving objects • An object sends its location to a server whenever its distance from the previously reported location is larger than certain threshold. • Sensor readings • Each sensor reports the temperature, humidity, UV index, …, in its neighborhood periodically. • Querying the (uncertain) data stored in the server directly is meaningless.

Uncertainty Modeling An object’s location is described by a probability density function.

Probabilistic Range Search Find theclients that are currently in CityU with at least 50% probability (probabilistic range query)(probability threshold)

Appearance Probability E.g., uniform pdf: apperance probability:

Appearance Probability must be calculated numerically Calculation time of an appearance probability in 2D space: 1.3ms Time for a random access: 10ms

A good solution should… • Support any pdf. • Minimize the number of page accesses. • Minimize the number of appearance probability calculations. • Minimize the total cost (I/O + CPU)

Main Idea • Pre-compute some “auxiliary information” that can be used to • efficiently decide whether an object appears in a region with at least a certain probability • without calculating its actual appearance probability.

Quick Examples pq=20%

Probabilistically Constrained Regions (PCR)

Probabilistically Constrained Regions (PCR) For a query q with search region rq and probability pq= 0.2 • Observation 1.1 (pruning) an object o can not satisfy q if rqdoes not intersecto.pcr(0.2)

Probabilistically Constrained Regions (PCR) (=1 – 0.2) For a query q with search region rq and probability pq= 0.8 • Observation 1.2(pruning) an object o can not satisfy q if rqdoes not fully containo.pcr(0.2)

Probabilistically Constrained Regions (PCR) A query q with search region rq and probability pq= 0.2 • Observation 1.3 (validating) an object o definitely satisfies q if rq fully contains the part of o.MBR on the left of l1-(or on the right of l1+or below l2-or above l2+)

Probabilistically Constrained Regions (PCR) A query q with search region rq and probability pq= 0.8 • Observation 1.4 (for validating) an object odefinitely satisfiesq if rq fully contains the part of o.MBR on the left of l1+ (or on the right of l1-or below l2+or above l2-)

Probabilistically Constrained Regions (PCR) A query q with search region rq and probability pq= 0.6 =(1 – 2 * 0.2) • Observation 1.5(for validating) an object o must satisfy q if rqfully contains the part of o.MBR between l1-and l1+(or between l2-and l2+)

Probabilistically Constrained Regions (PCR) o.pcr(0.2)provides 5 heuristics to reduce CPU cost In general, for a prob-range query with probability threshold pq • if pq <= 0.5 • o may be pruned using o.pcr( pq ) observation 1.1 • o may be validated using o.pcr( pq ) observation 1.3 • o may be validated using o.pcr( (1 - pq)/2 )observation 1.5 • if pq > 0.5 • o may be pruned using o.pcr( 1 - pq ) observation 1.2 • o may be validated using o.pcr( 1 - pq ) observation 1.4 • o may be validated using o.pcr( pq /2 ) observation 1.5 pq in [0, 1]→ infinite number of pq → infinite number of PCRs Impractical! It is possible to use a finite number of PCRs to achieve pruning and validating.

Using PCRs in a Conservative Way E.g., U-catalog: { 0, 0.1, 0.2, 0.3, 0.4, 0.5 } for a query q with search region rq and probability pq= 0.25 • Observation 1.1 an object o cannot satisfy q if rqdoes not intersect o.pcr(0.25) • Observation 2.1 an object o cannot satisfy q if rqdoes not intersect o.pcr(0.2)

Using PCRs in a Conservative Way U-catalog: { 0, 0.1, 0.2, 0.3, 0.4, 0.5 } for a query q with search region rq and probability pq= 0.75 • Observation 1.2 an object o cannot satisfy q if rq does not fully contain o.pcr(0.25) • Observation 2.2 an object o cannot satisfy q if rq does not fully contain o.pcr(0.3)

U-catalog Size m {0, 0.5}, m = 2 {0, 0.25, 0.5}, m = 3 {0, 0.1, 0.2, 0.3, 0.4, 0.5}, m = 6 … larger m→ more PCRs→ greater pruning/validating power → less CPU cost larger m → higher space consumption → larger I/O cost m = 9

Conservative Functional Boxes (CFB) U-catalog: { 0, 0.1, 0.2, 0.3, 0.4, 0.5 } o.pcr : 2m values for each dimension o.cfbout : 4 values for each dimension o.cfbin : 4 values for each dimension total : 8 values m = 9 8 : 18

Conservative Functional Boxes (CFB) for a query q with search region rq and probability pq= 0.25 U-catalog: { 0, 0.1, 0.2, 0.3, 0.4, 0.5 } • Observation 1.1 an object o cannot satisfy q if rqdoes not intersect o.pcr(0.25) • Observation 2.1 an object o cannot satisfy q if rqdoes not intersect o.pcr(0.2) • Observation 3.1 an object o cannot satisfy q if rqdoes not intersect o.cfbout(0.2)

Conservative Functional Boxes (CFB) for a query q with search region rq and probability pq= 0.75 U-catalog: { 0, 0.1, 0.2, 0.3, 0.4, 0.5 } • Observation 1.2 an object o cannot satisfy q if rqdoes not fully contain o.pcr(0.25) • Observation 2.2 an object o cannot satisfy q if rqdoes not fully contain o.pcr(0.3) • Observation 3.2 an object o cannot satisfy q if rqdoes not fully contain o.cfbin(0.3)

Comparing CFBs with PCRs • CFBs have weaker pruning/validating power than PCRs • But CFBs require less space than PCRs

Finding Conservative Functional Boxes • goal: minimize • for the i th dimension, minimize with the following constrains: • Linear Programming: Simplex Method

More in Our Paper • The U-tree a dynamic index designed to accelerate prob-range queries.

Experimental Results • data space: [0, 10000]d • uncertainty region shape: circle (sphere) • uncertainty region radius: 250 • data set: • Long Beach County (LB): 53k 2D objects, uniform pdf • California (CA): 62k 2D objects, Gaussian pdf • Aircraft: 100k 3D objects, uniform pdf • query set: 100 queries for each data set with various sizes of rq and different pq

Experimental Results

Experimental Results Query performance vs. search region size (LB, pq = 0.6)

Experimental Results Query performance vs. search region size (CA, pq = 0.6)

Experimental Results Query performance vs. search region size on (Aircraft, pq = 0.6)

Experimental Results Query performance vs. probability threshold on (LB, qs = 1500)

Experimental Results Query performance vs. probability threshold on (CA, qs = 1500)

Experimental Results Query performance vs. probability threshold on (Aircraft, qs = 1500)

Summary • A fast method for answering probabilistic range search queries.

Indexing Multi-Dimensional Uncertain Data with Arbitrary Probability Density Functions