370 likes | 386 Views
Trustworthy Keyword Search for Regulatory Compliant Record Retention. Soumyadeb Mitra University of Illinois IBM Almaden. Windsor W. Hsu IBM Almaden. Marianne Winslett University of Illinois. Spending on eDiscovery Growing at 65% CAGR. Average F500 Company Has 125 Non-Frivolous
E N D
Trustworthy Keyword Search for Regulatory Compliant Record Retention Soumyadeb Mitra University of Illinois IBM Almaden Windsor W. Hsu IBM Almaden Marianne Winslett University of Illinois
Spending on eDiscovery Growing at 65% CAGR Average F500 Company Has 125 Non-Frivolous Lawsuits at Any Given Time There is a need for trustworthy record keeping Email Instant Messaging Files Corporate Misconduct Digital Information Explosion Soaring Discovery Costs Records IDC Forecasts 60B Business Emails Annually By 2006 Focus on Compliance HIPAA Sources: IDC, Network World (2003), Socha / Gelbmann (2004)
Regret Query Alice Bob What is trustworthy record keeping? Establish solid proof of events that have occurred Storage Device time Commit Record Adversary Bob should get back Alice’s data
This leads to a unique threat model time Query is trustworthy Commit is trustworthy Adversary has super-user privileges Record is created properly Record is queried properly • Access to storage device • Access to any keys Adversary could be Alice herself
Traditional schemes do not work time Cannot rely on Alice’s signature
WORM storage helps address the problem Record Overwrite/ Delete New Record Adversary cannot delete Alice’s record Write Once Read Many
Index required due to high volume of records Index time Commit Record Update Index Query from Index Regret Bob Alice Adversary
A B’ In effect, records can be hidden/altered by modifying the index Or replace B with B’ Hide record B from the index B B The index must also be secured (fossilized)
Most business records are unstructured, searched by inverted index One WORM file for each posting list Keywords Posting Lists Query 1 3 11 17 Data 3 9 Base 3 19 Worm 7 36 Index 3
Doc: 79 79 79 Query Data Index 79 Index must be updated as new documents arrive • 500 keywords = 500disk seeks • ~1 sec per document Keywords Posting Lists Query 1 3 11 17 Data 3 9 Base 3 19 Worm 7 36 Index 3
79 81 83 Amortize cost by updating in batch Buffer • 1 seek per keyword in batch • Large buffer to benefit infrequent terms • Over 100,000 documents to achieve 2 docs/sec Keywords Posting Lists Doc: 79 Query Query 1 3 11 17 Doc: 80 Data 3 9 Doc: 81 Base 3 19 Query Worm 7 36 Index 3 Doc: 82 Doc: 83 Query
Index Commit Record Buffer Buffer Index is not updated immediately • Prevailing practice – email must be committed before it is delivered Alice time Alter Omit Adversary
Can storage server cache help? • Storage servers have huge cache • Data committed into cache is effectively on disk • Is battery backed-up • Inside the WORM box, so is trustworthy
Doc: 79 80 79 Query Cache Miss Base 79 Index 79 Cache Miss Caching works in blocks • Caching does not benefit infrequent terms Cache Hit Cache Miss Query 1 3 11 17 Data 3 9 Base 3 19 Worm 7 36 Index 3 Doc: 80 Query
Simulation results show caching is not enough • What if number posting lists ≤ Number of cache blocks? • Each update will hit the cache
So, merge posting lists so that the tails blocks fit in cache Only 1 random I/O per document, for 4K block size Keyword Encodings Query 1 3 11 Document IDs Data 3 9 31 Base 3 19 00 01 10 00 1 3 3 3 Worm 7 36 01 00 10 01 9 11 19 31 Index 3
The tradeoff is longer lists to scan during lookup Workload lookup cost before merging: ∑ tw qw After merging into A = {A1,…, An} : ∑( ∑tw) (∑qw ) A w € A w € A VLDB 2006 : Query answered by scanning both posting lists length of posting list for keyword w # of times w is queried in workload length of A w # of times A is searched
Which lists to merge? We need a heuristic solution • Choose A={A1, A2 .. An} • n = Cache blocks • Minimize ∑( ∑tw) * (∑qw ) • Problem is NP-complete • Reduction from minimum-sum-of-squares • So, try some merging heuristics on a real-world workload • 1 million documents from IBM’s intranet • 300,000 queries
A few terms contribute most of the query workload cost (tw *qw)
Different merging heuristics were tried • Separate lists for high contributor terms • Merging heuristics • Based on qwtw • Random merging • Details of heuristics, evaluation in paper
24 24 Additional index support is needed to answer conjunctive queries quickly VLDB and 2006 VLDB 2006 2 3 2 2 3 2 7 24 m 13 7 24 31 13 n 13 31 24 24 31 31 Merge Join : O (m+n) Index Join : m log(n)
How to maintain B+trees on WORM • B+trees require node split and join • B+tree on posting list are special case • Documents IDs are inserted in an increasing order • Can be built bottom up without split/joins • Please refer to our paper
25 27 25 26 30 B+tree index is insecure, even on WORM • Path to an element depends on elements inserted later – Adversary can attack it 23 7 13 31 4 7 11 13 19 23 29 31 2
Our solution is jump indexes • Path to an element only depends on elements inserted before • Jump index is provably trustworthy • Leverages the fact that document IDs are increasing • O(log N) lookup : N - # of documents • Supports range queries too • Reasonable performance as compared to B+ trees for conjunctive queries in experiments with real-workload For details, see our paper
Conclusions • WORM storage by itself is not enough • We need a trustworthy index too • Trustworthy inverted indexes can be built efficiently • 10-15% slowdown, for non-conjunctive queries • Within 1.5x of optimal B-tree performance, for conjunctive queries • Other possible uses • Indexing time
Future work • Ranking attacks • Adversary can attack a lot of junk documents • Secure generic index structure • Path to an element is fixed • Supports range queries
n+2 n 0 1 2 3 4 5 Element Pointers n + 21≤ n+2 < n + 22 A new index structure required: Jump Index • ithpointer points to an element ni n + 2i≤ ni < n + 2(i+1)
Already Set Follow Pointer 0 1 2 3 4 7 2 5 1 + 22≤ 7 < 1 + 23 1 + 20≤ 2 < 1 + 21 5 + 21≤ 7 < 5 + 22 log(N) pointers to N 1 + 22≤ 5 < 1 + 23 Jump index in action 0 1 2 3 4 1
Start here Follow Pointer Got 7 0 1 2 3 4 7 5 2 5 + 21≤ 7 < 5 + 22 1 + 22≤ 7 < 1 + 23 Path to an element does not depend on future elements Lookup (7) 0 1 2 3 4 1
(0,1) l .. .. ( i ,j ) (0,2) (0,B-1) (1,0) (0,1) Jump index elements are stored in blocks • Storing pointers with every element is inefficient • p entries are grouped together • Brach factor B. • Pointer (i,j) from block b points to b’ having smallest x l + j*Bi≤ x < l + (j+1)*Bi pentries Jump Pointers
Jump index evaluation parameters • p : Elements grouped together • B :Branching factor • L : Block size p + (B-1) logB N ≤ L • Evaluation • Index update performance • Pointers have to be set • Query performance
Is this a real threat? • Would someone want to delete a record after a day its created? • Intrusion detection logging • Once adversary gain control, he would like to delete records of his initial attack • Record regretted moments after creation • Email best practice - Must be committed before its delivered