270 likes | 287 Views
This research aims to address the challenge of publishing XML documents without leaking sensitive data, even when users can infer information using common knowledge. The study proposes methods to define sensitive data, describe common knowledge, compute inferred documents, and prevent information leakage.
E N D
Secure XML Publishing without Information Leakage in the Presence of Data Inference Xiaochun Yang Northeastern University, Liaoning, China Chen Li
(1) (3) pname pname (2) disease (3) (1) (2) (4) ward ward ward ward (1) disease (2) pname disease (4) disease (3) (1) Alice (1) W305 (1) Cathy (3) W305 (1) (1) (1) leukemia leukemia Betty (2) W305 leukemia (3) (2) leukemia Example: Hospital XML data hospital (4) (2) (3) (1) patient patient patient patient ... pname (4) Tom W403 cancer Goal: hide Alice’s disease Common Knowledge: patients in the same ward have the same disease
Problem statement: How to publish an XML document without leaking sensitive data, even if public users can do inference using common knowledge? Outline: • Information Leakage • Defining sensitive data • Describing common knowledge • Computing inferred documents • Prevent information leakage • Experiments
patient disease Alice * S A1 (1) (3) pname pname (2) disease (3) (1) (2) ward ward ward (1) disease (2) pname disease (3) (1) Alice (1) W305 (1) Cathy (3) W305 (1) (1) leukemia Betty (2) W305 leukemia (3) (2) leukemia Defining sensitive data using XQuery hospital regulating query (2) (3) (1) patient patient patient • Map the query to the XML tree • For each mapping, the target of the * node is sensitive.
Common Knowledge • Represented as XML constraints • Could be obtained in various ways, e.g., • possible schema • analysis from the published data
Common Constraints Patient Patient • Child constraints: //p //p/c //patient //patient/pname • Descendant constraints: //p //p//d //patient //patient//disease • Functional dependencies: //p/a//p/b //patient/ward //patient/disease pname Patient Patient disease Patient Patient If w1 = w2, then d1 = d2 disease ward disease ward (value equal) d1 w1 d2 w2
hospital patient patient (1) (2) pname pname disease ward ward (2) (1) (1) (1) (1) (1) (2) leukemia W305 W305 Modify partial document using constraints C1(P) C1: //patient //patient/pname
Floating branch hospital patient patient (2) (1) disease disease pname disease ward ward (1) (1) (1) (2) leukemia leukemia (2) (1) W305 W305 (1) Apply a sequence of constraints: <C2,C3> C2: //patient //patient//disease C3: //patient/ward //patient/disease
hospital patient patient (2) (1) disease pname disease ward ward (1) (1) (1) (2) leukemia leukemia (2) (1) W305 W305 (1) Another sequence of constraints: <C3,C2> C2: //patient //patient//disease C3: //patient/ward //patient/disease
hospital patient patient (2) (1) disease ward ward (1) (1) (2) leukemia (2) (1) W305 W305 (1) hospital patient patient (2) (1) P2: result of <C3,C2> P1: result of <C2,C3> pname pname disease ward ward disease (1) (1) (1) (1) (2) leukemia (2) (1) W305 leukemia W305 (1) disease disease leukemia They look different! • They have the same amount of “information” • Introduced a concept called “m-equivalence” (see the paper)
Theorem • Given a partial document P and a set of constraints C, there is a document M that can be inferred from P using a sequence of constraints, M m-contains the inferred document of any constraint sequence. • M: computable using a greedy approach. • M: unique under m-equivalence.
Mapping Inference Maximal inferred document M Information leakage Partial Document P Regulating query A
Talk Outline • Information Leakage • Prevent information leakage • Experiments
Formal Problem • Given an XML document D, a regulating query A, constraints C1,…,Ck. • Find a partial document P without information leakage (“valid partial document”). • P has as much data as possible • Developed an algorithm for solving this problem
(1) (3) pname pname (2) disease (3) (1) (2) ward ward ward (1) disease (2) pname disease (3) (1) Alice (1) W305 (1) Cathy (3) W305 (1) (1) leukemia Betty (2) W305 leukemia (3) (2) leukemia Example hospital (2) (3) (1) patient patient patient Regulating query A patient disease Alice * S Functional dependency: //patient/ward //patient/disease
(1) (3) pname pname (2) disease (3) (1) (2) ward ward ward (1) disease (2) pname disease (3) (1) Alice (1) W305 (1) Cathy (3) W305 (1) (1) leukemia Betty (2) W305 leukemia (3) (2) leukemia Remove sensitive data A(D) hospital (2) (3) (1) patient patient patient patient disease Alice * S Remaining document: D - A(D)
(1) (3) pname pname (2) disease (3) (1) (2) ward ward ward (1) disease (2) pname disease (3) (1) Alice (1) W305 (1) Cathy (3) W305 (1) (1) leukemia Betty (2) W305 leukemia (3) (2) leukemia Compute the maximal inferred document M of D-A(D) hospital (2) (3) (1) patient patient patient patient disease Alice * S There is a mapping from A to P. So information leaked.
(1) (3) pname pname (2) disease (3) (1) (2) ward ward ward (1) disease (2) pname disease (3) (1) Alice (1) W305 (1) Cathy (3) W305 (1) (1) leukemia Betty (2) W305 leukemia (3) (2) leukemia hospital (2) (3) (1) patient patient patient Regulating query A patient disease Alice * S START AND/OR Graphs OR (1) (1) Alice leukemia
(1) (3) pname pname (2) disease (3) (1) (2) ward ward ward (1) disease (2) pname disease (3) (1) Alice (1) W305 (1) Cathy (3) W305 (1) (1) leukemia Betty (2) W305 leukemia (3) (2) leukemia hospital START (2) (3) (1) patient patient patient OR (1) (1) Alice leukemia AND OR OR (1) (2) (3) (3) (2) W305 W305 W305 leukemia leukemia
Solution graphs START START Requirements: • Connected subgraph including START. • Each node in the subgraph keeps its successor connectors. • OR connector: keep one of its successors. • AND connector: keep all its successors. OR OR (1) Alice (1) leukemia AND OR OR (1) W305
Talk Outline • Information leakage • Prevent Information Leakage • Experiments
Experiments • Evaluate the effect of data inference on security and our technique • XML constraints • Data sets • course_washington.xml • http://anhai.cs.uiuc.edu/archive/data/courses/washington • 3,904 courses, 162,102 nodes • dblp.xml • http://www.informatik.uni-trier.de/ley/db • About 427,000 publications, 8,728,000 nodes
Information Leakage • Sensitive nodes defined by regulating queries • A1: In course_washington.xml, “Hide codes of all courses.” • A2: In dblp.xml, “Hide authors who published papers in 2001.” //dblp/pub/title //dblp/pub/author
Effect of number of sensitive nodes • Sensitive nodes randomly selected from the tree Child and descendant constraints in course washington.xml
Effect of number of constraints course_washington.xml
Removing nodes to prevent leakage Course data set DBLP data set
Conclusion and Future Work • Contributions: • Formulated the problem of publishing XML documents without information leakage due to data inference • Showed effect of constraints on inference • Algorithm for finding a valid partial document of a given document • Future work: • Positive regulating queries • Quantify information amount/importance