460 likes | 486 Views
Explore the evolving concept of privacy in data management, from information security to anonymized data publishing. Delve into case studies on US Census, Netflix Prize, and AOL Search Data for insights into privacy practices and challenges.
E N D
Privacy in Data Management Sharad Mehrotra
Privacy - definitions Generic • Privacy is the interest that individuals have in sustaining a 'personal space', free from interference by other people and organizations. Information Privacy • The degree to which an individual can determine which personal information is to be shared with whom and for what purpose. • The evolving relationship between technology and the legal right to, or public expectation of privacy in the collection and sharing of data Identity privacy (anonymity) • Anonymity of an element (belonging to a set) refers to the property of that element of not being identifiable within the set, i.e., being indistinguishable from the other elements of the set
Means of achieving privacy Information Security is the process of protecting data from unauthorized access, use, disclosure, destruction, modification, or disruption. Enforcing security in information processing applications: • Law • Access control • Data encryption • Data transformation – statistical disclosure control Techniques used depend on • Application semantics/functionality requirements • Nature of data • Privacy requirement/metrics Privacy is contextual
Overview Study the nature of privacy in context of data-centric applications • Privacy-preserving data publishing for data mining applications • Secure outsourcing of data: “Database as A Service (DAS)” • Privacy-preserving implementation of pervasive spaces • Secure data exchange and sharing between multiple parties
Privacy-Preserving / Anonymmized Data Publishing
Why Anonymize? For Data Sharing Give real(istic) data to others to study without compromising privacy of individuals in the data Allows third-parties to try new analysis and mining techniques not thought of by the data owner For Data Retention and Usage Various requirements prevent companies from retaining customer information indefinitely E.g. Google progressively anonymizes IP addresses in search logs Internal sharing across departments (e.g. billing marketing)
Why Privacy? Data subjects have inherent right and expectation of privacy “Privacy” is a complex concept (beyond the scope of this tutorial) What exactly does “privacy” mean? When does it apply? Could there exist societies without a concept of privacy? Concretely: at collection “small print” outlines privacy rules Most companies have adopted a privacy policy E.g. AT&T privacy policyatt.com/gen/privacy-policy?pid=2506 Significant legal framework relating to privacy UN Declaration of Human Rights, US Constitution HIPAA, Video Privacy Protection, Data Protection Acts
Case Study: US Census Raw data: information about every US household Who, where; age, gender, racial, income and educational data Why released: determine representation, planning How anonymized: aggregated to geographic areas (Zip code) Broken down by various combinations of dimensions Released in full after 72 years Attacks: no reports of successful deanonymization Recent attempts by FBI to access raw data rebuffed Consequences: greater understanding of US population Affects representation, funding of civil projects Rich source of data for future historians and genealogists
Case Study: Netflix Prize Raw data: 100M dated ratings from 480K users to 18K movies Why released: improve predicting ratings of unlabeled examples How anonymized: exact details not described by Netflix All direct customer information removed Only subset of full data; dates modified; some ratings deleted, Movie title and year published in full Attacks: dataset is claimed vulnerable [Narayanan Shmatikov 08] Attack links data to IMDB where same users also rated movies Find matches based on similar ratings or dates in both Consequences: rich source of user data for researchers unclear if attacks are a threat—no lawsuits or apologies yet
Case Study: AOL Search Data Raw data: 20M search queries for 650K users from 2006 Why released: allow researchers to understand search patterns How anonymized: user identifiers removed All searches from same user linked by an arbitrary identifier Attacks: many successful attacks identified individual users Ego-surfers: people typed in their own names Zip codes and town names identify an area NY Times identified 4417749 as 62yr old GA widow [Barbaro Zeller 06] Consequences: CTO resigned, two researchers fired Well-intentioned effort failed due to inadequate anonymization
Three Abstract Examples “Census” data recording incomes and demographics Schema: (SSN, DOB, Sex, ZIP, Salary) Tabular data—best represented as a table “Video” data recording movies viewed Schema: (Uid, DOB, Sex, ZIP), (Vid, title, genre), (Uid, Vid) Graph data—graph properties should be retained “Search” data recording web searches Schema: (Uid, Kw1, Kw2, …) Set data—each user has different set of keywords Each example has different anonymization needs
Models of Anonymization Interactive Model (akin to statistical databases) Data owner acts as “gatekeeper” to data Researchers pose queries in some agreed language Gatekeeper gives an (anonymized) answer, or refuses to answer “Send me your code” model Data owner executes code on their system and reports result Cannot be sure that the code is not malicious Offline, aka “publish and be damned” model Data owner somehow anonymizes data set Publishes the results to the world, and retires Our focus in this tutorial – seems to model most real releases
Objectives for Anonymization Prevent (high confidence) inference of associations Prevent inference of salary for an individual in “census” Prevent inference of individual’s viewing history in “video” Prevent inference of individual’s search history in “search” All aim to prevent linking sensitive information to an individual Prevent inference of presence of an individual in the data set Satisfying “presence” also satisfies “association” (not vice-versa) Presence in a data set can violate privacy (eg STD clinic patients) Have to model what knowledge might be known to attacker Background knowledge: facts about the data set (X has salary Y) Domain knowledge: broad properties of data (illness Z rare in men)
Utility Anonymization is meaningless if utility of data not considered The empty data set has perfect privacy, but no utility The original data has full utility, but no privacy What is “utility”? Depends what the application is… For fixed query set, can look at max, average distortion Problem for publishing: want to support unknown applications! Need some way to quantify utility of alternate anonymizations
Measures of Utility Define a surrogate measure and try to optimize Often based on the “information loss” of the anonymization Simple example: number of rows suppressed in a table Give a guarantee for all queries in some fixed class Hope the class is representative, so other uses have low distortion Costly: some methods enumerate all queries, or all anonymizations Empirical Evaluation Perform experiments with a reasonable workload on the result Compare to results on original data (e.g. Netflix prize problems) Combinations of multiple methods Optimize for some surrogate, but also evaluate on real queries
Definitions of Technical Terms Identifiers–uniquely identify, e.g. Social Security Number (SSN) Step 0: remove all identifiers Was not enough for AOL search data Quasi-Identifiers (QI)—such as DOB, Sex, ZIP Code Enough to partially identify an individual in a dataset DOB+Sex+ZIP unique for 87% of US Residents [Sweeney 02] Sensitive attributes (SA)—the associations we want to hide Salary in the “census” example is considered sensitive Not always well-defined: only some “search” queries sensitive In “video”, association between user and video is sensitive SA can be identifying: bonus may identify salary…
Summary of Anonymization Motivation Anonymization needed for safe data sharing and retention Many legal requirements apply Various privacy definitions possible Primarily, prevent inference of sensitive information Under some assumptions of background knowledge Utility of the anonymized data needs to be carefully studied Different data types imply different classes of query
Privacy issues in data outsourcing (DAS) and cloud computing applications
Example: DAS - Secure outsourcing of data management Issues: • Confidential information in data needs to be protected • Features – support queries on data: SQL, keyword based search-queries, XPath queries etc. • Performance - Bulk of work to be done on server, reduce communication overhead, client-side storage and post-processing of solutions. DB Server Data owner/Client Service Provider
Security model for DAS applications Adversaries (A): • Inside attackers: authorized users with malicious intent • Outside attackers: hackers, snoopers Attack models: • Passive attacks: A wants to learn confidential information • Active attacks: A wants to learn confidential information + actively modifies data and/or queries Trust on server: • Untrusted: normal hardware, data & computation visible • Semi-trusted: trusted co-processors + limited storage • Trusted: All hardware is trusted & tamper-proof
Secure data storage & querying in DAS Security concern: “ssn” “salary” & “credit rating” is confidential Service Provider DB Server Data owner/Client R Encrypt the sensitive column values How to execute queries on encrypted data? e.g. Select * from R where salary [25K, 35K] We can do better Use secure indices for query evaluation on server Trivial solution: retrieve all rows to client, decrypt them and check for predicate
Data storage • Encrypt the rows • Partition salary values into buckets • Index the etuples by their bucket-labels Client side meta-data buckets B0 B1 B2 B3 0 20 30 40 50 Server side data RS :Server side Table (encrypted + indexed) R: Original Table (plain text) Encrypt
Querying encrypted data Select * from R where sal [25K, 35K] Client side data Client-side query Server-side query B0 B1 B2 B3 0 20 30 40 50 SelectetuplefromRSwherebucket = B1 ∨B2 False positive Server side Table (encrypted + indexed) RS Client side Table (plain text)R Client side Table (plain text)R
Problems to address • Security analysis • Goal: To hide away the confidential information in data from server-side adversaries (DB admins etc.) • Quantitative measures of disclosure-risk • Quality of partitioning (bucketization) • Data partitioning schemes • Cost measures • Tradeoff • Balancing the two competing goals of security & performance Continued later …
Privacy in Cloud Computing What is cloud computing? Many definition exist Cloud computing is a model for enabling convenient, on-demand network access to a shared pool of configurable computing resources (e.g., networks, servers, storage, applications, and services) that can be rapidly provisioned and released with minimal management effort or service provider interaction. [NIST] Clouds are a large pool of easily usable and accessible virtualized resources (such as hardware, development platforms and/or services). These resources can be dynamically re-configured to adjust to a variable load (scale), allowing also for an optimum resource utilization. This pool of resources is typically exploited by a pay-per-use model in which guarantees are offered by the Infrastructure Provider by means of customized service-level agreements. [Luis M. Vaquero et al., Madrid Spain] 29
Privacy in Cloud Computing Actors Service Providers Provide software services (Ex: Google, Yahoo, Microsoft, IBM, etc…) Service Users Personal, business, government Infrastructure Providers Provide the computing infrastructure required to host services Three cloud services Cloud Software as a Service (SaaS) Use provider’s applications over a network Cloud Platform as a Service (PaaS) Deploy customer-created applications to a cloud Cloud Infrastructure as a Service (IaaS) Rent processing, storage, network capacity, and other fundamental computing resources 30
Privacy in Cloud Computing Examples of cloud computing services Web-based email Photo storing Spreadsheet applications File transfer Online medical record storage Social network applications 31
Privacy in Cloud Computing Privacy issues in cloud computing Cloud increases security and privacy risks Data Creation, storage, communication – exponential rate Data replicated across large geographic distances Data contain personal identifiable information Data stored at untrusted hosts Create enormous risks for data privacy lost of control of sensitive data Risk of sharing sensitive data with marketing Other problem: technology ahead of law Does the user or the hosting company own the data? Can the host deny a user access to their own data? If the host company goes out of business, what happens to the users' data it holds? How does the host protect the user's data? 32
Privacy in Cloud Computing Solutions The cloud does not offer any privacy Awareness Some effort Effort ACM Cloud Computing Security Workshop, November, 2009 ACM Symposium on Cloud Computing, June, 2010 Privacy in cloud computing at UCI Recently lunched a project on privacy-preservation in cloud computing General approach: personal privacy middleware 33
Example: Detecting a pre-specified set of events • No ordinary coffee room, one that is monitored ! • There are rules that apply • If rule is violated, penalties may be imposed • But all is not unfair: individuals have right to privacy ! ”Till an individual has not had more than his quota of coffee, his identity will not be revealed” Just like a coffee room !!
Issues to be addressed • Modeling pervasive spaces: How to capture events of interest • E.g., “Tom had his 4th cup of coffee for the day” • Privacy goal: Guarantee anonymity to individuals • What are the necessary and sufficient conditions? • Solution • Design should satisfy the necessary and sufficient conditions • Practical/scalable
Basic events, Composite events & Rules • Model of pervasive space: A stream of basic events • Composite event: one or more sequence of basic events • Rule: (Composite event, Action) • Rules apply to groups of individuals, e.g.: • Coffee room rules apply to everyone • Server room rule applies to everyone except administrators etc. Pervasive Space with sensors :: ek:<Bill, coffee-room, coffee-maker, exit> :: e2:<Tom, coffee-room, coffee-cup, dispense> Stream of basic events e1:<Tom, coffee-room, *, enter>
Composite-events & automaton templates Composite-event templates • “A student drinks more than 3 cups of coffee” e1 ≡ <u ∈ STUDENT, coffee_room, coffee_cup, dispense> • “A student tries to access the IBM machine in the server room” e1 ≡ <u ∈ STUDENT,server_room,*, entry> e2 ≡ <ū, server_room, *, exit> e3 ≡ <ū, server_room, IBM-mc, login-attempt>
System architecture & adversary Secure Sensor node (SSN) Server Rules DB :: Secure Sensor node (SSN) State Information Thin trusted middleware to obfuscate origin of events Basic Assumptions about SSNs • Trusted hardware (Sensors are tamper-proof) • Secure data capture & generation of basic events by SSN • Limited computation + storage capacity: can carry out encryption/decryption with secret key common to all SSNs, automaton transition
Privacy goal & Adversary’s knowledge Passive adversary (A):Server-side snooper who wants to deduce the identity of the individual associated with a basic-event • A knows all rules of the space & automaton structures • A can observe all server-side activities • A has unlimited computation power Ensure k-anonymity for each individual (k-anonymity is achieved when each individual is indistinguishable from at leastk-1other individuals associated with the space) Minimum requirement to ensure anonymity: State information (automatons) are always kept encrypted on server
Basic protocol Return automatons that (possibly) match e (encrypted match) Store updated automatons SERVER SECURE SENSOR NODE Encrypted query for automatons that make transition on e Decrypt automatons, advance the state of automatons if necessary associate encrypted label with new state. Write-back encrypted automatons Generate basic event e Question: Does encryption ensure anonymity? NO! pattern of automaton access may reveal identity
Example U enters kitchen U takes coffee R1 U enters kitchen U opens fridge Applies to Tom Tom enters Kitchen 3 firings R2 U enters kitchen U opens microwave R3 U enters kitchen U takes coffee R1 Applies to Bill Bill enters Kitchen 2 firings U enters kitchen U opens fridge R2 On an event,the # rows retrieved from state tablecan disclose the identity of the individual
Characteristic access patterns of automatons The set of rules applicable to an individual maybe unique potentially identify the individual The characteristic access patterns of rows can potentially reveal the identity of the automaton in spite of encryption Rules applicable to TOM Tom enters kitchen Tom takes coffee x Characteristic patterns of x P1: {x,y,z} {x y} Characteristic patterns of y P2: {x,y,z} {x,y} {y} P3: {x,y,z} {y,z} {y} Characteristic patterns of z P4: {x,y,z} {y z} Tom leaves coffee pot empty Tom takes coffee Tom enters kitchen y Tom opens fridge Tom leaves fridge open Tom enters kitchen Tom opens fridge z
Solution scheme • Formalized the notion of indistinguishability of automatons in terms of their access patterns • Identified “event clustering” as a mechanism for inducing indistinguishability for achieving k-anonymity • Proved the difficulty of checking for k-anonymity • Characterized the class of event-clustering schemes that achieve k-anonymity • Proposed an efficient clustering algorithm to minimize average execution overhead for protocol • Implemented a prototype system • Challenges: • Designing a truly secure sensing-infrastructure is challenging • Key management issues • Are there other interesting notions of privacy in pervasive space?