Watermarking Relational Databases

Watermarking Relational Databases Rakesh Agrawal and Jerry Kiernan

Why Watermark Databases • Watermark -- Intentionally introduced pattern in the data • hard to occur by chance • hard to find => hard to destroy (robust against malicious attack) • Increasing use of databases in applications beyond "behind-the-firewall data processing" involving data publication • Data providers require technical solutions to deter data theft and assert ownership of pirated copies

Assumption • Value of the database is significantly reduced if all of k least significant bits of an attribute are dropped or perturbed, but it is acceptable to perturb a small number of attribute values • Datasets from many data publishers satisfy the above assumption (Acceptable to tradeoff a small decrease in quality to assert ownership) • Tables of parametric specifications (mechanical, electrical, electronic, chemical, etc.), surveys (geological, climatic, etc.), life sciences (e.g. gene expressions) • Historical precedence: Logarithm tables, Astronomical Ephemerides, H.P. • Inappropriate dataset: Online bank balances

Desiderata • Detectability • Using a subset of the tuples and attributes • Robustness • Updates and malicious attacks • Incremental Updatability • On tuple insert/update/delete • Imperceptibility • Hard to infer the presence of a watermark • Blind System • Detection requires neither the original data nor the watermark • Key-Based System • Algorithm is public • Security resides in the choice of secret key

Related Work • Images [BGM95,HG98,M98,DR00] • Audio [BTH96] • Text [M94] • Software [CT00]

Relational data is different from multimedia data Multimedia Object Database Relation • Consists of a large number of bits, with considerable redundancy => Watermark has a large cover to hide in. • Consists of tuples, each of which represents a separate object => Watermark needs to be spread over these separate objects. • Relative spatial/temporal positioning of various pieces of an object does not change. • Tuples of a relation constitute a set and there is no implied ordering between them • Portions of an object cannot be dropped or replaced arbitrarily without causing perceptual changes in the object. • Pirate can easily drop some tuples/attributes or substitute them with tuples/attributes from other relations Need watermarking techniques designed to take into account special characteristics of relational data

Techniques • Introduce watermarks across a fraction of the tuples in a database relation • Detect the watermark by retrieving a subset of the tuples • Use statistical hypothesis testing to locate the watermark even in the presence of updates to the data

Message Authentication Code • h = H(M), where H is a hash function and M is a message • Given M, easy to compute h • Given h, hard to compute M • Given M, hard to find M' such that H(M) = H(M') • MD5 and SHA are good choices for H • MAC is a one-way hash function which depends on a key K • We use: F(r.P) = H(K o H(K o r.P)), where r.P is the primary key of relation r, and o is concatenation

Watermarking Algorithm • Determine the attributes(s) to be watermarked, the Gap, and the LSBs • For each tuple r, compute MAC: • Establish if r doesn't fall into a gap • Select attribute to be marked • Determine bit position to contain the mark • Compute the mark's value • Update the attribute's value to reflect the watermark, if necessary

Technique Before Watermarking B2 of A1 selected for PK1 A1 A2 A3 A4 Mark = 1 PK1 011001100 110000100 100011100 110000101 PK2 101000111 010101111 111010110 100110011 PK5 Not selected because in gap PK3 110001110 100010101 000010101 101010000 PK4 111000010 010001010 010000010 111110010 PK5 110011001 010000111 100011001 110000110 After Watermarking Value not changed because Mark = 1 A1 A2 A3 A4 Value changed PK1 011001100 110000100 100011100 110000101 PK2 101000111 010101111 111010110 100110011 PK3 110001110 100010101 000010101 101010010 PK4 111000010 010000010 010000010 111110010 PK5 110011001 010000011 100011001 110000110

Without the Private Key, the Watermark is Hard to Destroy • Which tuple contains a mark • Which attribute got marked • Which bit position got marked • The expected value of a mark

Detection Algorithm • Locate suspicious data and extract sample which might contain watermark • For each tuple r, compute MAC: • If r doesn't fall into a gap, extract the mark bit value • Count the number of success and Bernoulli trials • Apply statistical analysis to establish presence of the watermark

Extensions to the Algorithm • Relations with no primary keys • Null values

Evaluation • Analysis • Experiments • Forest Cover Type dataset from UCI repository

Attacks • Bit attacks • Randomize, zero-out, bit flipping, rounding, translation • Subset attack • Select subset of tuples and attributes • Mix-and-match attack • Combine data from multiple sources • Additive attack • Insert new watermark over existing watermark • Invertibility attack • Counterfeit watermark • Benign updates

Cumulative Binomial Probability Distribution ( ) n k n-k b(k;n,p) = p (1-p) k n S b(i;n,p) B(k;n,p) = i=k

Parameters and Defaults • Number of tuples: 1 million • Number of marked attributes: 1 • Number of least significant bits: 1 • Fraction of tuples marked: 1/1000 • Significance level for hypothesis test: 0.01

Proportion of correctly marked tuples required for detectability • The proportion of correctly marked tuples needed for detectability decreases as the number of marks increases • For 1M tuples and 10% of tuples marked, that proportion < 51% • Illustrates the tolerance of the watermark to updates

Proportion of correctly marked tuples needed for decreasing alpha • The data can tolerate a large number of updates while maintaining detectability with high confidence

Excess Error in an Attack • Attacker can be forced to make orders of magnitude more errors than the owner,making his data economically much less attractive compared to that of the owner

Samples in Which the Watermark Could be Detected When the Attacker has Dropped Tuples • Watermark detected in a subset of the tuples of a watermarked relation • Selectivity gives the sample size • Each experiment repeated 100 times • Results show the percentage of trials in which the watermark could be detected

Samples in Which Watermark was Detected When the Attacker has Dropped some Attributes • Watermark detected in a subset of the attributes and tuples of a watermarked relation • Watermark spread across 10 attributes • Selectivity gives the sample size • Each experiment repeated 100 times • Results show the percentage of trials in which the watermark could be detected

Mix-and-Match Attack • Minimum fraction of tuples from the watermarked relation needed for detectability • N is the relation size • N x f = tuples from marked relation • N x (1 - f) = tuples from other relations

Summary • Provided desiderata for a system for watermarking database relations • First watermarking algorithm for database relations • No dependence on tuple ordering • Robust against attacks • Watermark can be incrementally updated • Requires neither the original relation nor the watermark for detection

Future Work • Watermarking extensions to handle non-numeric attributes • New algorithms for fingerprinting to track multiple sources of piracy

Watermarking Relational Databases

Watermarking Relational Databases

Presentation Transcript

Relational Databases

Relational Databases

Relational Databases

Relational Databases

Relational Databases

Relational Databases

RELATIONAL DATABASES

Relational Databases

Relational Databases

Relational Databases

Relational Databases

Relational Databases

Relational Databases

Relational Databases

Relational Databases

A Watermarking Scheme for Categorical Relational Databases

Relational Databases

Relational Databases:

Relational Databases

Relational Databases

Relational Databases