320 likes | 482 Views
M-Invariance: Towards Privacy Preserving Re-publication of Dynamic Datasets. by Tyrone Cadenhead. Main Contribution. Paper presents study on privacy preserving publication of fully-dynamic datasets. These datasets can be modified by any sequence of insertions and deletions.
E N D
M-Invariance: Towards Privacy Preserving Re-publication of Dynamic Datasets by Tyrone Cadenhead
Main Contribution • Paper presents study on privacy preserving publication of fully-dynamic datasets. • These datasets can be modified by any sequence of insertions and deletions. • Use of two concepts: m-invariance and counterfeited generalization.
Generalization • A generalization is a popular methodology of privacy preservation. • Divide the tuples into several QI groups, and then generalize the QI values in each group to a uniform format. • For example a generalized format of table 1a is table 1b. • A generalized table is considered privacy preserving if it satisfies a generalization principle, for example k-anonymity and l-diversity. • K-anonymity requires each QI group to include at least k tuples (table 1b is 2-anonymous). • L-diversity requires every QI group to contain at least l “well-represented” sensitive values (table 1b is 2-diverse)
Example 1 • Consider a hospital that releases patients’ records quarterly, but each publication includes only the results of diagnoses in the 6 months preceding the publication time. • Table 1a is the microdata first release. Table 1b is the generalized relation. • Likewise, table 2a is the microdata for the second release and table 2b is the generalized relation.
Example 1 cont’d • The tuples of Alice, Andy, Helen, Ken and Paul have been deleted (they describe diagnoses over six months ago). • The tuples Emily, Mary, Ray, Tom and Vince have been added. • The hospital then publishes the generalized table relation in table 2b.
Example 1 cont’d • Even though both published relations of table 1a and table 1b are 2-anonymous and 2-diverse, adversary can still precisely determine the disease of a patient by exploiting the correlation between the two snapshots. • Assume an adversary knows Bob’s age and zipcode, and also knows that Bob has a record in tables 1b and 2b. • Table 1b would suggest Bob has either dyspepsia or bronchitis. While table 2b would suggest Bob has either dyspepsia or gastritis. By combining the two knowledge, the adversary correctly captures Bob’s real disease.
M-invariance and counterfeited generalization • Instead of publishing table 2b, publish table 3a and a auxiliary table 3b. • The adversary cannot distinguished between a counterfeit tuple and a real one. • From the QI details of Bob, the adversary can conclude that Bob has been generalized to the first QI group of tables 1b and 3a. But cannot infer Bob real disease.
Historical union • At time n ≥ 1, the historical union U(n) contains all the tuples in T at timestamps 1, 2, ..., n, respectively. Formally: • Each tuple t ε U(n) is implicitly associated with a lifespan [x, y], where x (y) is the smallest (largest) integer j such that t appears in T(j). • U(n) can be regarded as a table with the same schema as T. Note that, if a tuple appears in several T(j) with different j, it is included only once in U(n). Now we can define the background knowledge that can be tackled by our technique. • U(2) includes all the tuples in T(1) and T(2) after eliminating duplicates. • The lifespan of the tuple <Bob, 21,12k,dyspepsia> is [1,2], and the lifespan of <Alice,22,14k, bronchitis> is [1,1].
Background Knowledge B(n) • tuple t= <Bob,21,12k,dyspepsia> in U(2) correspond to b ε B(n) in the first tuple in table 4. • B[Group-ID]=* means the adversary is not sure about the generalized hosting groups of t. • The adversary also knows that c1 and c2 are counterfeits from table 3b. • The adversary also knows the lifespan of each tuple.
Privacy Breach • A privacy breach occurs if an adversary correctly finds out the sensitive value of any tuple t ε U(n), utilizing T*(1),…,T*(n) and B(n). • The objective of privacy preserving re-publication is to compute a pair of {T*(n),R(n)}, that minimizes the risk of privacy disclosure, yet captures as much information in the microdata as possible.
Microdata Reconstruction • Deals with how an adversary can reconstruct the microdata tables T(1),…,T(n) from the published T*(1),…,T*(n) and the knowledge table B(n). • Generalized Historical Union: Given a generalized relation T*(j) (1≤ j ≤n), we convert each row t* ε T*(j) to a timestamped tuple <t*,j>, which augments t* with another attribute Atm, called Timestamp, storing j. • The generalized historical union U*(n) includes all the timestamped tuples covering T*(1),…,T*(n)
Microdata Reconstruction • Generalized Historical Union • Rebuilding Surjection: Mapping f: U*(n) to B(n) is a rebuilding surjective function if it fulfils these requirements: • Superscript l is the lifespan.
Microdata Reconstruction • The arrows depict five mappings in a surjective function f, namely f(t*1) = b1, f(t*2) = b2, etc. • These mappings reconstruct four tuples in the microdata tables T(1) and T(2). • We can reconstruct tuple <Bob,21,12k,dyspepsia> in T(1) and T(2); and tuple <Alice,22,14k,bronchitis> in T(2) • Reconstruction is not always accurate as tuple <Helen,36,27k,flu> in T(1) real disease is gastritis. • There is a genuine surjective function, which exactly reconstructs the original microdata. If the adversary were able to discover this, then he could get all sensitive information on all individuals. Fortunately , as the tuples in U*(n) have been generalized , there exist a huge number of possible surjective functions from U*(n) to B(n).
Reasonable Surjection • A function f is reasonable if it satisfies these conditions:
Privacy Disclosure Risk • Let t be a tuple in the historical union U(n). The privacy disclosure risk risk(t) of t is
Persistent Invariance • Let t be a tuple in U(n). The candidate sensitive set t.CSS(j) of t at time j is the union of the sensitive values in each QI group QI*.
Persistent Invariance • Let n=2, and T*(1) = table 1b, and T*(2) = table 2b. • Let t = <Bob,21,12k,dyspepsia> in U(2) and have lifespan [1,2]. • T.CSS(1) at time 1 includes the sensitive values dyspepsia and bronchitis in QI group 1 of T*(1) • T.CSS(2) is {dyspepsia, gastritis}. • By lemma 1 the privacy risk of t is 1.
Persistent Invariance • To protect privacy, re-publication must ensure a sufficiently largeat each publication timestamp , until t is deleted from the metadata.
M-Invariance • If a sensitive value is carried by multiple tuples in QI*, the value appears only once in the signature.
M-Invariance • M-invariance demands that each sensitive value appear at most once in every QI group. • The publisher can simply set m to a sufficiently large value to achieve the target extent of privacy preservation.
M-Invariance • This points to an incremental approach for performing re-publication • To construct T*(n), the publisher only needs to consult the microdata tables T(n-1) and T(n) and the last release of T*(n-1)
Algorithm by example • The algorithm has three phases: Division, Balancing and Assignment. • Divide the tuples in T(n) into two Disjoint sets: • For any tuple t in Sn, t.QI*(n-1) and t.QI*(n) have the same signature. • For any tuple t in S_, t* in T*(n) is in a QI group, which has at least m tuples, and all the tuples have distinct sensitive values.
Algorithm by example • Each bucket BUC is inspected in turn. If BUC is not balanced, there is a shortage of some sensitive value. • S_ = {Emily, Mary, Ray, Tom, Vince} • In this case we move tuples from S_ into BUC, as long as S_ is still m-eligible • In figure 2a, BUC2 is unbalanced, we move Ray from S_ to BUC2 because there are 2 flu and 2 gastritis in S_ that are 2-eligible.
Algorithm by example • If S_ cannot be used to fix an unbalanced bucket BUC, there are two possibilities: • No tuple in S_ carries the required sensitive value. • S_ is no longer m-eligible • In both cases we insert counterfeit to balance BUC • Both BUC3 and BUC4 are unbalanced, but neither can be remedied with S_. BUC3 needs a bronchitis and BUC4 needs flu. • So we add c1 and c2.
Algorithm by example • In the Assignment phase, we assign the remaining tuples in S_ to buckets, subject to two rules: • Each tuple t in S_ can be placed only in a bucket whose signature includes t[As], the sensitive attribute of t. • At the end of the phase, all buckets are still balanced. • Here S_ = {Emily, Mary, Tom, Vince} • These are assigned to BUC1