500 likes | 631 Views
Proof-Infused Streams: Authenticating Sliding Window Queries on Data Streams. Feifei Li, Florida State University Ke Yi, Hong Kong University of Science & Technology Marios Hadjieleftheriou , AT&T Labs Research George Kollios , Boston University.
E N D
Proof-Infused Streams: Authenticating Sliding Window Queries on Data Streams Feifei Li, Florida State University Ke Yi, Hong Kong University of Science & Technology MariosHadjieleftheriou, AT&T Labs Research George Kollios, Boston University
Outsourced stream model: stock trading monitoring Servers (bloomberg) Q Provider: A stock broker Register Queries: Sliding window query and/or One shot query Clients
Data Publishing Model [HIM02] Owner: publish data Servers: host (or monitor) the data and provide query services Clients: query the owner’s data through servers clients servers owner H. Hacigumus, B. R. Iyer, and S. Mehrotra, ICDE02
Information Security Issues • The third-party (server) cannot be trusted • Lazy server • Malicious intent • Compromised equipment • Unintentional errors (e.g. bugs)
Problem 1: Injection Select * from T where 5<A<11 client owner Returns 7, 8, 9 server
Problem 2: Drop Select * from T where 5<A<11 client owner Returns 7 9 ri+1 server
Query Authentication: Goals • Query Correctness results do exist in the owner's database • Query Completeness no records have been omitted from the result
General Approach Authenticated Structures Verification Object (VO) Query results clients servers owner
Recent n tuples 2, D xt-n xt xt+1 Tuple-based Window SELECT SUM(stock_price) FROM Stock_trace WHERE stock_name = A in last 100 Trades SLIDES every 1 trade Time-based Window SELECT SUM(stock_price) FROM Stock_trace WHERE stock_name = A in last 5 Minutes SLIDES every 1 minute Sliding Window Query … 2, A 2, B 4, A 9, C 5, A 8, A 7, C 7, B xt-n+1 This talk concentrates on tuple-baesd window, generalizing to time-based window is in the paper. For tuple-based window, the timestamp is simply the arrival id of the tuple.
Recent n tuples xt-n xt Tuple-based Window SELECT SUM(stock_price) FROM Stock_trace WHERE stock_name = A in last 100 Trades One Shot Query … 2, A 2, B 4, A 9, C 5, A 8, A 7, C 7, B
Sign(h1..8,SK) h1..8 h1..4 h5..8 h12 h34 h56 h78 h1 h2 h3 h4 h5 h6 h7 h8 Merkle Hash Tree[M89]-Amortizing Signature Cost Collision resistant hash function any change in the tree will lead to a different hash value for the root Digital signature of the root no one except the owner could produce the signature Single signature to sign many messages Hash function is publicly known Ver(h1..8, ,pK)=valid? h1..8 h1..4 h5..8 h12= H(h1|h2) h56 h78 h5 h6 m1 m2 m3 m4 m5 m5 m6 m6 m7 m8 R. C. Merkle. CRYPTO, 1989
Sign(h1..8,SK) h12 h34 h56 h78 h1 h2 h3 h4 h5 h6 h7 h8 q Extends to Range Query: f=2 (f is the fanout) Select * from T where 5<A<11 h1..8 h1..4 h1..4 h5..8 h5..8 1 2 3 4 5 5 6 9 12 12 VO: 5, 12, h1..4, LB(q) RB(q)
Ver(h1..8,PK, ) Valid? h1..8 h5..8 h56 h78 h5 h6 h7 h8 Reconstruct query subtree q Client Side Verification Select * from T where 5<A<11 VO: 5, 12, h1..4, Query results: 6, 9 h1..4 Unknown to the client 5 6 9 12
Solution Overview • Sign Every Tuple (with query attribute(s) and timestamp) • Expensive update cost for the data provider • Expensive communication cost between server and clients as VO size is large • But it provides timely answer on a per-tuple basis • Amortize the signing cost by “proof-infusing” on a group of tuples: • A delayed response, can often be tolerated. • Query with d query attributes is a query in d+1 dimension. • N: maximum window size; n: window size for a particular query; b: the delay
Tumbling Merkle Tree (TM-tree) Sign(hroot|t1|tb) … … … … Merkle binary search tree for every b tuples Merkle binary search tree for every b tuples Time ti: timestamp of the ith tuple
TM-tree Continues Build Merkle tree Query Attribute A Sort by A … … Time
Sliding window query on the TM-tree Tuples to be added to results Tuples to be removed from results • • • 2. Window slides 1. Initialization: Query n/b trees 3. Incremental update: query four boundary trees
Query the TM-tree Q False positives Value Sent to clients Time Remove from results Q Query shifts by b Added to results False positives
Correctness and Completeness • Correctness: • Guaranteed by each individual Merkle tree • Completeness: • Completeness in each small Merkle tree is guaranteed by what we have studied in the first part of this talk • Overall completeness: • Check that the results returned are obtained by querying consecutive trees that fall within the query range on time dimension and they completely cover the query range on time dimension. • This is possible as two boundary tuples’ timestamps have been signed in each tree (hence these timestamps have to be included in the VO by the server).
Limitation of TM-tree • Only supports one dimensional query • False positives lead to large VO size, especially when each tuple has non-trivial size.
Merkle kd tree (Mkd-tree) • To get rid of false positives: • Obviously we need a multi-dimensional indexing structure • KD-tree: an excellent candidate with bounded query performance of and to bulk-load. • A space-partition structure: partition along each dimension in turn.
Mkd-tree and TMkd-tree • Incorporating Merkle tree into KD-tree: • Leaf node: H(p), p is the point contained in this node • Index node u with children v, w and dividing line lu: H(hv|hw|lu) • Tumbling Merkle kd-tree (TMkd-tree) • Similar idea as it is in TM-tree, but we are using Mkd-tree as each small tree. • Boundary trees no longer introduce false positives!
Is this good enough? • Tumbling trees are good for maintaining the update to sliding window queries • They both have linear space to N and log b update cost, and • But they are expensive for answering one-shot queries (or the initialization of sliding window queries) • query with window size n: have to query n/b trees: linear in n and could be expensive for large values of n.
Dyadic Merklekd-tree (DMkd-tree): 1D queries N+b N+b • • • 4b 4b 4b • • • 2b 2b 2b 2b 2b 2b b b b b b b b b b b b b N+b Q 4b Merkle tree 2b Discarded 2b Mkd-tree b b
Exponential Merklekd-tree (EMkd-tree):Multi-dimensional queries 4b 4b 4b T’l T’l Tl 2b 2b 2b 2b 2b 2b T’1 T1 T’1 T1 b b b b b b b b b T0 T0 T’0 new T0 T’0 Q Materialized Mkd-tree Non-materialized Mkd-tree
Some Experiments • We use real streams: • World Cup Data (WC) • IP traces from the AT&T network (IP) • We perform the following query: • WC: Query attribute is the response size • IP: Query attribute is the packet size • Hardware: • 2.8GHz Intel Pentium 4 CPU • Linux Machine
Tumbling trees: update cost 1. b=1000 is a sweet point 2. This delay is small: in real streams it spans less than one or two seconds
Tumbling trees: size They both have linear size (to number of tuples covered in maximal window size of N)
Query cost per sliding period, b=1,000: fixed sliding period as b Linear scan of TM-tree at leaf level results in locality which greatly improves its performance
VO size per sliding period, b=1,000: fixed sliding period as b TM-tree incurs roughly 4γb false positives
Summary • All trees support aagregations • TM-tree and DMkd-tree support only one-dimensional queries • TMkd-tree and EMkd-tree support multi-dimensional queries • Tumbling trees are good for maintaining updates to sliding window queries, while DMkd-tree and Emkd-tree are good for one shot queries.
Thanks! • Questions
Query q Intuition on Authenticating Aggregation Query Naïve solution: answer it as a range selection query linear authentication cost k (k tuples in the range)! Find the canonical cover: authentication cost log k !
m KeyGen (SK, PK) SK m Ver(m, PK, ) valid? Sign(m, SK) Public key digital signature schemes Sender Insecure Channel Recipient
Merkle Tree: Verifying A Single Value • SELECT Airline FROM Flights WHERE price = $600 apply merkle tree to database authentication [DGMS03] P. Devanbu, M. Gertz, C. Martel, and S. G. Stubblebine Journal of Computer Security 2003 Ver(hroot , , PK)=valid? hroot 410 Query result: h12 h34 320 600 t4 Verification Object: h3 h4 h3 Sibling hash values along the query path h12 t1 t2 t3 t4 $250 $320 $410 $600
m1 mk m1 mk 1 k =combine(1,…, k) Reduce S/C communication Cost [MNT04] • Aggregation Signature: Condensed RSA Overhead: computation cost of modular multiplication with big modular base number, close to 100 s E. Mykletun, M. Narasimha, and G. Tsudik. NDSS'04
Condensed RSA[MNT04] • KeyGen: • Choose two large primes, p and q, pq • Set n=pq • Compute (n)=(p-1)(q-1) • Choose e s.t. 1<e<(n) and e is coprime to (n) • Compute d s.t. de1 (mod (n)) • (d, n) is the secret key and (e, n) is the public key
Sign: • Given mi, compute hi=H(mi) • Compute • Compute • Verify: • Given mi, compute hi=H(mi) • Check that: Condensed RSA[MNT04]
Tool 1: Collision-Resistant Hash Functions • Example SHA1: variable input size 20 bytes (can also plug in any newer replacement) • Observations: • Computation cost: 2-3 s (for up to 500 bytes input) • Storage cost: 20 bytes H H x1 x2 hard to find collision
Tool 2: Public Key Digital Signature Schemes • Formally defined by [GMR88] • The message has not been changed in any way • The message is indeed from the sender (corresponding to the public key) • No one except the secret key owner could produce a signature • One such scheme: RSA [RSA78] • Observations • Computation cost: about 3-4 ms for signing and more than 100 s for verifying • Storage cost: 128 bytes S. Goldwasser S. Micali R. Rivest SIAM Journal on Computing 1988. R. Rivest A. Shamir L. Adleman, Commun. ACM 1978
Problem 3: Omission Select * from T where 5<A<11 client owner Returns 7,9 Update server
Roadmap • Solution overview • Efficient authentication of sliding window queries when window slides • Efficient authentication of one shot queries (also the sliding window query initialization): • Experiment • Conclusion
Is this good enough? • Tumbling trees are good for maintaining the update to sliding window queries • They both have linear space to N and log b update cost, and • But they are expensive for answering one-shot queries (or the initialization of sliding window queries) • query with window size n: have to query n/b trees: linear in n and could be expensive for large values of n.
Query cost per sliding period, b=1,000: fixed query selectivity as 0.1 Query upto 2/b+2 boundary trees
VO size per sliding period, b=1,000: fixed query selectivity as 0.1 TM-tree incurs roughly (2/b+2)b false positives