330 likes | 421 Views
Query Assurance on Data Streams. Ke Yi (AT&T Labs, now at HKUST) Feifei Li (Boston U, now at Florida State) Marios Hadjieleftheriou (AT&T Labs) Divesh Srivastava (AT&T Labs) George Kollios (Boston U). Outsourcing. Manufacturing Software development Service Data. TRUST?.
E N D
Query Assurance on Data Streams Ke Yi (AT&T Labs, now at HKUST) Feifei Li (Boston U, now at Florida State) Marios Hadjieleftheriou (AT&T Labs) Divesh Srivastava (AT&T Labs) George Kollios (Boston U)
Outsourcing Manufacturing Software development Service Data TRUST?
Data Outsourcing Model Owner: owns data Servers: host (or process) the data and provide query services Clients: query the owner’s data through servers (possibly = owner) the unified client model clients / servers owner
Outsourced Database for Better Query Services Company with headquarters in US Servers that are close to local clients and maintained by local business partners 4
Data Outsourcing Model Owner/client: owns data and issue queries Servers: host (or process) the data and provide query services the unified client model Owner/client servers 5
Data Stream Outsourcing Network 0 1 1 0 0 1 … 1 1 0 … IP Traffic Streamcoming from small business Gigascope:analysis tool by Results statistics
Concrete Example IP Stream: . . . pm p3 p2 p1 : srcIP, destIP SELECT COUNT(*) FROM IP_trace GROUP BY srcIP, destIP Answer: Groups
The Model for the Stream T=3 T=1 T=2 group_id 1 1 i … S Major issue: space V 0 1 2 0 0 … 0 1 0 V1 V2 V3 Vi Vn
Information Security Issues • The third-party (server) cannot be trusted • Lazy service provider • Malicious intent • Compromised equipment • Unintentional errors (e.g. bugs)
A Simple Solution [Sion, VLDB 05] • Accumulate b queries • The owner computes r of them itself • Compute the hashes of these results, with some fake ones • Ask the server to identify these r queries • Problems: • Can only prevent (very) lazy service provider • How about malicious attacks? • Need to accumulate enough queries • What if there is only one query? • High cost: r queries need to processed locally • High failure probability: 10%-30% (typically)
no alarm Alarm 2 0 0 … 1 0 V1 V2 V3 Vi Vn Continuous Query Verification: CQV T=1 T=2 T=3 9 7 1 … S Update X Update V V 0 9 2 0 0 … 1 0 0 XT V1 V2 V3 Vi Vn Synopsis 9 0 2 0 2 … 5 0 0 V1 V2 V3 Vi Vn
PIRS: Polynomial Identity Random Synopsis choose prime p: chose a random number : raise alarm if not equal o/w no alarm
Incremental Update to PIRS T=1 T=2 1 i … S update to v1 update to vi
happens at no more than m values of x It Solves CQV problem! Theorem: Given any PIRS raises an alarm with probability at least 1-δ, otherwise no alarm. a polynomial with 1 as the leading coefficient is completely determined by its zeroes (and the corresponding multiplicity) due to the fundamental theorem of algebra. Since we have p>m/ δ choices for a: the probability that X(V)=X(W) is at most δ
Optimality of PIRS Theorem: PIRS occupies O(log(m/δ) + log n) bits of space (3 words only at most, i.e., p, a, X(V)), spends O(1) time to process a tuple for count query, or O(log u) time to process a tuple for sum query. Theorem: Any synopsis for solving the CQV problem with error probability at most δ has to keep Ω(log(min{n,m}/δ)) bits.
In Practice • Failure probability • Choose largest p that fits in a word • E.g, if we use 64-bit words, then failure probability is δ = m/p < 2-32 (assuming m<232) • Space requirement • p, a, X(V): 3 words! • Time requirement • For count queries / selection queries • One subtraction, one multiplication, one mod • For sum queries: • log(u) multiplications: exponentiation by squaring
Multiple Queries Q1 Q2 Q1 Q2 V1..n2 V1..n1 V1..(n1+n2) X1 X2 X Theorem: our synopses use constant space for multiple queries. 1,8 … S update to v1 update to v8
Some Experiments • We use real streams: • World Cup Data (WC) • IP traces from the AT&T network (IP) • We perform the following query: • WC: Aggregate on response size and group by client id/object id (50M groups) • IP: Aggregate on packet size and group by source IP/destination IP (7M groups) • Hardware for the client: • 2.8GHz Intel Pentium 4 CPU • 512 MB memory • Linux Machine
Memory Usage of Exact Exact’s memory usage is linear and expensive. PIRS using only constant 3 words (27 bytes) at all time.
Update Time (per tuple) of Exact Cache misses • Exact is fast when memory usage is small. • It becomes extremely slow due to cache misses.
Running Time Analysis Average Update Time IPs exhibits smaller update cost for sum query as the average value of u is smaller than that of WC
Multiple Queries: Exact Memory Usage Exact’s memory usage is linear w.r.t number of queries and increasing over time. PIRS always uses only 3 words.
PIRSγ: An Exact Solution Alarm bi=2 vi If at least one layer raises alarms … PIRS PIRS PIRS k buckets Alarm log 1/δ … If at least γ buckets raise alarms … PIRS PIRS PIRS 25
PIRSγ: An Exact Solution Theorem: PIRSγ requires O(γ2 log1/δ logn) bits, spends O(γ log1/δ ) time to process a tuple and solves CQV with semantic load shedding. 26
Intuition on Approximation the approximation probability to raise alarm the ideal synopsis number of errors γ γ- γ+ 27
PIRS±γ: An Approximate Solution Theorem: PIRS±γ requires O(γ log1/δ logn) bits, spends O(γ log1/δ ) time to process a tuple. 28
PIRS±γ: An Approximate Solution Theorem: PIRS±γ: 1.raises no alarm with probability at least 1- δ on any 2.raises an alarm with probability at least 1- δ on any For any c>-lnln2=0.367 Using the intuition of coupon collector problem and the Chernoff bound. 29
PIRS±γ: An Approximate Solution Alarm If majority layers raise alarms bi=2 vi … PIRS PIRS PIRS k buckets Alarm log 1/δ … If all k buckets raise alarms … PIRS PIRS PIRS 30
Related Techniques to PIRS 32 • Incremental Cryptography • Block operation (insert, delete), cannot support arithmetic operation • Sketches • Provide approximate estimates • We want absolute accuracy • Often much more costly • Space O(1/) or O(1/2) • Fingerprinting Technique • PIRS is a fingerprinting technique • Polynomial identity verification
Thanks! • Questions