470 likes | 571 Views
Auditing Compliance with a Hippocratic Database Rakesh Agrawal Roberto Bayardo Christos Faloutsos Jerry Kiernan Ralf Rantzau Ramakrishnan Srikant Intelligent Information Systems Research IBM Almaden Research Center. Outline. Introduction and motivation Problem statement Foundations
E N D
Auditing Compliance with a Hippocratic DatabaseRakesh AgrawalRoberto BayardoChristos FaloutsosJerry KiernanRalf RantzauRamakrishnan SrikantIntelligent Information Systems ResearchIBM Almaden Research Center
Outline • Introduction and motivation • Problem statement • Foundations • System organization and algorithms • Performance • Summary
Motivation • Hippocratic databases advocate policy directed data management for privacy sensitive data • Need reinforced by legislations and regulations: • Health Insurance Portability & Accountability Act • Gramm-Leach Bliley Act – Consumer Privacy Rule • Goal • Build a system to assist with auditing compliance with the stated policy • Event driven - privacy complaint • Periodic - monitor exposure to privacy violation
Audit Scenario The doctor must now review disclosures of Jane’s information in order to understand the circumstances of the disclosure, and take appropriate action Sometime later, Jane receives promotional literature from a pharmaceutical company, proposing over the counter diabetes tests The doctor uncovers that Jane’s blood sugar level is high and suspects diabetes Jane complains to the department of Health and Human Services saying that she had opted out of the doctor sharing her medical information with pharmaceutical companies for marketing purposes Jane has not been feeling well and decides to consult her doctor
Audit Expression Who has accessed Jane’s disease information? audit T.disease from Customer C, Treatment T where C.cid=T.pcid and C.name = ‘Jane’
Outline • Introduction and motivation • Problem statement • Foundations • System organization and algorithms • Performance • Summary
Problem Statement • Given • A log of queries executed over a database • An audit expression specifying sensitive data • Precisely identify • Those queries that accessed the data specified by the audit expression
“Suspicious” Queries A query Qi has accessed information contained in the Customer table The audit expression A specifies the data to the audited Customer table If query Qi accesses all the cells specified by the audit expression A for any row, Qi is suspicious
Issues • Convenient language • Audit expression (essentially SPJ query) • Fast and precise on audits • Non disruptive • Minimal performance impact on normal database operation • Fine grained
Assumptions • Disclosures stemming from multiple query executions is not considered • No use of outside knowledge to deduce information without detection • Queries considered include • Joins and aggregation, but not nested subqueries • Note that existential subqueries can be converted into joins [SIGMOD92]
Outline • Introduction and motivation • Problem statement • Foundations • System organization and algorithms • Performance • Summary
Informal Definitions • “Candidate” query • Logged query that accesses all columns specified by the audit expression • “Indispensable” tuple (for a query) • A tuple whose omission makes a difference to the result of a query • “Suspicious” query • A candidate query that shares an indispensable tuple with the audit expression
Definition 1 - A virtual tuple v cT is indispensable for an SPJ query Q if the result of Q changes when we delete v: Indispensable Tuple Output columns in Q Duplicate preserving projection operator Predicates in Q Tables common to Q and A The SPJ query Q and the audit expression A are of the form: Columns appearing anywhere in Q
“Candidate” Query Definition 6 - Q is a candidate query with respect to A if: Only candidate queries can be suspicous queries
Definition 5 - Maximal virtual tuple (MVT): A tuple v is a MVT for queries Q1 and Q2 if it belongs to the cross product of common tables in their from clauses Definition 7 - Q is suspicious with respect to A if they share an indispensable MVT v “Suspicious” Query For example, Query Q: Addresses of people with diabetes Audit A: Jane’s diagnosis Jane’s tuple is indispensable for both; hence query Q is “suspicious” with respect to A
Outline • Introduction and motivation • Problem statement • Foundations • System organization and algorithms • Performance • Summary
System Overview Query with purpose, recipient Audit expression IDs of log queries having accessed data specified by the audit query Updates, inserts, delete Generate audit query Database Layer Audit Static analysis Database triggers track updates to base tables Database Layer Data Tables Backlog Audit query Query Log
Static Analysis Query Log Audit expression Eliminates queries that could not possibly have violated the audit expression Insures that Accomplished by examining only the queries themselves (i.e., without running the queries) Filter Queries Candidate queries
Audit Query Generation • Goal • Build a query which, when run, returns the id’s of suspicious queries with respect to an audit expression A
Union T1 T2 Generating the Audit Query Audit Expression Replace each table with it’s backlog to restore the version of the table to the time of each query Combine the audit expression with individual candidate queries to identify suspicious queries Combine individual candidate queries and the audit expression into a single query graph Candidate Query 1 Candidate Query 2 Lines represent input/output relationships between operators QGM is a graphical representation of a query Boxes represent operators, such as select Boxes with no inputs are tables
Suspicious SPJ Query The candidate SPJ query Q and the audit expression A are of the form: Proof of correctness is based upon Definition 7 (suspicious query) and given in the paper QGM rewrites, shown in previous slide, transform Q and A into: Theorem 2 - A candidate SPJ query Q is suspicious with respect to an audit expression A if and only if:
Suspicious Aggregate Query (Including Having) • Solution in the paper
Example Jane’s audit
Audit Expression Who has accessed Jane’s disease information? audit T.disease from Customer C, Treatment T where C.cid=T.pcid and C.name = ‘Jane’
Query Log Query 1 was executed at time T3
Backlog Table (Time Stamp) Operation on a tuple among Insert, Update and Delete Jane’s record was inserted at time T2 and updated at time T4. The backlog table records both versions of her information Timestamp of the operation C. S. Jensen, L. Mark, and N. Roussopoulos [TKDE 1991] Attributes also in the source table Attributes only in the backlog table
Merge Logged Queries and Audit Expression Merge logged queries and audit expression into a single query graph T.s C.n, C.a, C.z Select := T.s=‘diabetes’ and T.p=C.c audit expression := T.p=C.c and C.n= ‘Jane’ C T T C p, r, …, t c, n, …, t Treatment Customer
Transform Query Graph into an Audit Query ‘Q1’ The audit expression now ranges over the logged query. If the logged query is suspicious, the audit query will output the id of the logged query audit expression := X.n= ‘Jane’ X C.n Select := T.s=‘diabetes’ and C.c=T.p C T View of Customer (Treatment) is a temporal view at the time of the query was executed p, r, ..., t c, n, …, t Treatment Customer
Scenario Outcome • The audit uncovers that Query 1 in the query log accessed Jane’s information
Outline • Introduction and motivation • Problem statement • Foundations • System organization and algorithms • Performance • Summary
Empirical Evaluation: Goals • Cost of maintaining backlog tables • Understand the impact of maintaining backlog tables on ongoing database operations • Cost of running audits • Understand whether audits can run in reasonable time
Experimental Setup • IBM M Pro 6868 Intellistation • 800 MHz Pentium III processor • 512 MB of memory • 16.9 GB disk drive • Windows 2000 Version 5, SP 4 • DB2 v7 with default settings • TPC-H database • Supplier table • 100,000 tuples
System Structures • Indexing • Eager indexing • Maintain an index over the backlog table • Maintained during ongoing database operations • Lazy indexing • No index over the backlog table • Create indices at the time of audit • Choice of index • Simple index • Primary key of source table • Composite index • Primary key of source table • Time stamp
Impact on Ongoing Operations • Queries • Additionally log the query string • Already performed in many application environments • Updates • For each updated tuple, • Insert a tuple to the backlog table • Inserts and deletes are handled similarly • In a majority of environments, queries are much more frequent than updates
Update Performance • 100,000 tuples in Supplier table • Update statement updates all tuples • Each update statement fires triggers which inserts an additional 100,000 tuples in backlog • Evaluate impact of multiple versions on performance
Overhead on Updates Simple wins over Composite Number of version of each tuple in the Supplier backlog table 7x if all tuples are updates 3x if a single tuple is updated Eager indexing doesn’t add much cost
Audit Query Performance Audit query: select ‘Q’ from Supplier where skey = k Experiment: Evaluate the impact of the number of versions of tuples in the backlog table on performance
Audit Query Execution Time Composite wins over simple if initial version is selected Simple wins over composite if the current version is selected
Takeaways • The composite index • Enhances the performance of audits, but • Additionally burdens updates when using eager indexing • The system supports • Efficient auditing • Without substantially burdening normal query processing
Related Work • Oracle Privacy Security Auditing • Facility for logging queries with timestamp • Flash-back queries • Restores the version of the data at the time of the query • No support for automated auditing • User manually selects queries from the log and runs them • The user to decide if the query is suspicious • G. Miklau D. Suciu [SIGMOD 2004] • Formal analysis of information disclosure in data exchange • Is information about a secret query S revealed by views V1,…,Vn • Considers all possible instances of a database schema • Assumes tuple independence • We’re interested in given instances (temporal versions) • Nonetheless, it will be interesting to explore the connection between the two works • Active enforcement of policies by limiting disclosure [VLDB’04] • Literature on multi-query optimization
Summary • In light of new privacy legislation • The problem of auditing usage of information represents an important opportunity for database research • Formalized the problem through the fundamental concepts of indispensable tuple and suspicious queries • Achieved our design goals:
Design Goals • Convenient language • Fast and precise on audits • Non disruptive • Minimal performance impact on normal database operation • Fine grained
Multiple Candidate Queries Union ‘Q1’ ‘Q2’ audit expression := C.n= ‘Jane’ audit expression := C.n= ‘Jane’
‘Q1’ select:= q1.c1=q2.c1 and … and q1.ci=q2.ci q1 q1 c1, …, ck c1, …, ck audit expression := … audit expression := … Aggregate Queries with Having The join on aggregate columns ensures that the group being tracked by the audit has not been eliminated by the having clause c1, …, ci Qh select:= … Qg c1, …, ci, agg1, …, aggn group:= c1, …, ci Qs
Dynamic Temporal Views View of Customer table at time t c = id n = name a = address h = phone z = zip o = contact t = marketing ts = ts op = opr C1 Time stamp of the logged query c, n, a, h, z, o, t Select :=ts <= t andop <> ‘delete’ andnot(C5) C5 * Exists :=C4.ts <= t andC3.c = C4.c andC4.ts > C3.ts C3 C4 c, n, a, h, z, o, t, ts, op Customer_backlog