400 likes | 520 Views
PIRS: Query Verification on Data Streams. Ke Yi, Hong Kong University of Science and Technology Feifei Li, Florida State University Marios Hadjieleftheriou, AT&T Labs George Kollios, Boston University Divesh Srivastava, AT&T Labs.
E N D
PIRS: Query Verification on Data Streams Ke Yi, Hong Kong University of Science and Technology Feifei Li, Florida State University Marios Hadjieleftheriou, AT&T Labs George Kollios, Boston University Divesh Srivastava, AT&T Labs work done while the 1st and 2nd authors were working at AT&T labs.
Publishing Data and Outsourcing Query Service Network 0 1 1 0 0 1 … 1 1 0 … IP Traffic Streamcoming from Gigascope:analysis tool by Results statistics
Revisiting the CISCO – AT&T Example Network Gigascope IP Traffic Stream 0 1 1 0 0 1 … 1 1 0 … statistics lawyers: sign the trust agreement Could we help? (computer scientists)
Concrete Example IP Stream: . . . pm p3 p2 p1 : srcIP, destIP, packet_size Continuous Query: SELECT SUM(packet_size) FROM IP_trace GROUP BY srcIP, destIP Answer: Groups Time
Continuous Query Verification (CQV) on Data Streams Group 1 Group 2 • Client register query • Server reports answer • upon request Group 3 Server maintains exact answer … … Source of streams … Client maintains synopsis X Both client and server monitor the same stream SELECT SUM(packet_size) From IP_Trace GROUP BY src_ip, dest_ip
The Model for the Stream T=3 T=1 T=2 agg_attribute | group_id 1|1 9|1 7|i … S VT 10 9 0 0 0 … 0 7 0 V1 V2 V3 Vi Vn
no alarm Alarm 10 0 0 … 7 0 V1 V2 V3 Vi Vn Continuous Query Verification: CQV T=1 T=2 T=3 9|1 7|i 1|1 … S Update X Update V VT 0 9 10 0 0 … 7 0 0 XT V1 V2 V3 Vi Vn Synopsis 9 0 10 0 2 … 5 0 0 V1 V2 V3 Vi Vn
PIRS: Polynomial Identity Random Synopsis choose prime p: chose a random number : raise alarm if not equal o/w no alarm
Incremental Update to PIRS T=1 T=2 9|1 7|i 1|1 … S update to v1 update to vi update to v1 An update to group i with value u could be done in logu time (exponential by squaring):
happens at no more than m values of x It Solves CQV problem! Theorem: Given any PIRS raises an alarm with probability at least 1-δ a polynomial with 1 as the leading coefficient is completely determined by its zeroes Due to the fundamental theorem of algebra. Since we have p>m/ δ choices for a: the probability that X(V)=X(W) is at most δ
Optimality of PIRS Theorem: PIRS occupies O(log m/δ + log n) bits of space (3 words only at most, i.e., p, a, X(V)), spends O(1) time to process a tuple for count query, or O(log u) time to process a tuple for sum query. Theorem: Any synopsis for solving the CQV problem with error probability at most δ has to keep Ω(log min{n,m}/δ) bits.
Multiple Queries Q1 Q2 Q1 Q2 V1..n2 V1..n1 V1..(n1+n2) X1 X2 X Theorem: our synopses use constant space for multiple queries. 9|1,8 … S update to v1 update to v8
Handle the Load Shedding • Semantic Load Shedding: drop tuples from certain groups • Small number of groups having errors • Random Load Shedding: • All groups have small amount of errors
CQV with Semantic Load Shedding Randomly drop certain tuples according to groups 9|1 7|i 2|j 1|1 4|k 5|1 … Server claims at most γ number of groups have errors To detect if more than γ groups having errors! We have designed synopses using O(γ log 1/δ log n) bits of space and achieve the error probability at most δ
PIRSγ: An Exact Solution b(8)=2 Alarm v8 If at least one layer raises alarms … PIRS PIRS PIRS k buckets Alarm log 1/δ … If at least buckets raise alarms … PIRS PIRS PIRS
PIRSγ: An Exact Solution Theorem: PIRSγ requires O(γ2 log1/δ logn) bits, spends O(log1/δ ) time to process a tuple and solves CQV with semantic load shedding.
Intuition on Approximation the approximation probability to raise alarm the ideal synopsis number of errors γ γ- γ+
PIRS±γ: An Approximate Solution Theorem: PIRS±γ requires O(γ log1/δ logn) bits, spends O(γ log1/δ ) time to process a tuple.
CQV with Random Load Shedding Randomly drop tuples All groups have small errors To detect if any group has error greater than a claimed threshold Theorem: Any synopsis solves this problem with error probability at most δ requires at least Ω(n) bits (reducing to the problem of estimating infinite frequency moment: the number of occurrence of the most frequent item).
Sliding Window and Other Queries • It is easy to extend PIRS to work with sliding window model since it is decomposable, i.e., X(v1+v2)=X(v1)*X(v2). • Other queries that can be transformed into Group By aggregation queries. • Details in the paper.
Some Experiments • We use real streams: • World Cup Data (WC) • IP traces from the AT&T network (IP) • We perform the following query: • WC: Aggregate on response size and group by client id/object id (50M groups) • IP: Aggregate on packet size and group by source IP/destination IP (7M groups) • Hardware for the client: • 2.8GHz Intel Pentium 4 CPU • 512 MB memory • Linux Machine
Detection Accuracy Over 100,000 random attacks, PIRS identifies all of them.
Memory Usage of Exact Exact’s memory usage is linear and expensive. PIRS using only constant 3 words (27 bytes) at all time.
Update Time (per tuple) of Exact Cache misses and memory swap • Exact is fast when memory usage is small. • It becomes extremely slow due to cache misses and memory swap operations.
Running Time Analysis Average Update Time IPs exhibits smaller update cost for sum query as the average value of u is smaller than that of WC
Multiple Queries: Exact Memory Usage Exact’s memory usage is linear w.r.t number of queries and increasing over time. PIRS always using only constant 3 words (27 bytes).
The Library Download PIRS and other synopses at: http://www.cs.fsu.edu/~lifeifei/pirs/
Conclusion • Space and Update efficient synopsis for verifying continuous group-by aggregation queries on streaming data; • Could be generalized to handle selection query, and sliding-window semantics; • How about more complicated queries?
Thanks! • Questions
Problem and Goals • Assumption: • Client and DSMS observe the same stream • Problem: • Client needs to verify the results • Goals: • Be memory, update efficient • Tolerance for a limited number of errors • Tolerance for small errors • Support multiple queries
Related Techniques to PIRS • Incremental Cryptography • Block operation (insert, delete), cannot support arithmetic operation • Program Verification • Server may pass the program execution but simply return random outputs • Fingerprinting Technique • PIRS is a fingerprinting technique
PIRS±γ: An Approximate Solution Theorem: PIRS±γ: 1.raises no alarm with probability at least 1- δ on any 2.raises an alarm with probability at least 1- δ on any For any c>-lnln2=0.367 Using the intuition of coupon collector problem and the Chernoff bound.
PIRS±γ: An Approximate Solution Alarm If majority layers raise alarms bi=2 vi … PIRS PIRS PIRS k buckets Alarm log 1/δ … If all k buckets raise alarms … PIRS PIRS PIRS
Information Disclosure on Multiple Attacks PIRS: X(V) on r R Insight: server could potentially gets rid of δ portion of seeds from each notified failed attack! Learns nothing about r
Information Disclosure on Multiple Attacks Bob Theorem: For the total of k attacks made by Bob to PIRS, the probability that none of them succeeds is at least 1-kδ.