760 likes | 963 Views
Privacy on the Web. Li Xiong Department of Mathematics and Computer Science Emory University. Definitions of Privacy. Right to be left alone (1890s, Brandeis, future US Supreme Court Justice)
E N D
Privacy on the Web Li Xiong Department of Mathematics and Computer Science Emory University
Definitions of Privacy • Right to be left alone (1890s, Brandeis, future US Supreme Court Justice) • a: The quality or state of being apart from company or observation; b: freedom from unauthorized intrusion (Merrian-Webster) • The right of individual to be protected against intrusion into his personal life or affairs, or those of his family, by direct physical or by publication of information (Calcutt committee, UK)
Aspects of Privacy • Information privacy • Bodily privacy • Privacy of communications • Territorial privacy
Information Privacy • Establishment of rules governing the collection and handling of personal data • Data about individuals should not be automatically available to other individuals and organizations • The individual must be able to exercise a substantial degree of control over that data and its use.
Information privacy on the web • Large amount of (personal) data collected on the web • Search engine logs • Personal data and blogs on social network sites • … • The data are of great value for both individuals and our society. • The data also pose a significant threat to individuals’ privacy.
Data privacy on the web – some case studies • A comparison of privacy practices of internet service companies • Query log retention and its privacy implications • Information revelation patterns and its privacy implications on social network sites
A race to the bottom: privacy ranking of Internet service companies • Privacy International, 2007 • Studied and ranked the privacy practices of key Internet based companies • Amazon, AOL, Apple, BBC, eBay, Facebook, Friendster, Google, LinkedIn, LiveJournal, Microsoft, MySpace, Skype, Wikipedia, LiveSpace, Yahoo!, YouTube
A Race to the Bottom: Methodologies Corporate administrative details Data collection and processing Data retention Openness and transparency Customer and user control Privacy enhancing innovations and privacy invasive innovations
Why Google Retains a large quantity of information about users, often for an unstated or indefinite length of time, without clear limitation on subsequent use or disclosure Maintains records of all search strings with associated IP and time stamps for at least 18-24 months Additional personal information from user profiles in Orkut Use advanced profiling system for ads
Data privacy on the web – some case studies • A comparison of privacy practices of internet service companies • Query log retention and its privacy implications • Information revelation patterns and its privacy implications on social network sites
Query Log AnonID Query QueryTime ItemRank ClickURL 217 lottery 2006-03-01 11:58:51 1 http://www.calottery.com 217 lottery 2006-03-27 14:10:38 1 http://www.calottery.com 1268 gall stones 2006-05-11 02:12:51 1268 gallstones 2006-05-11 02:13:02 1 http://www.niddk.nih.gov 1268 ozark horse blankets 2006-03-01 17:39:28 8 http://www.blanketsnmore.com (Source: AOL Query Log)
A Face is exposed for AOL searcher No. 4417749 • Thelma Arnold, a 62-year-old widow who lives in Lilburn, Ga., frequently researches her friends’ medical ailments and loves her dogs • 20 million Web search queries by AOL • User 4417749 • “numb fingers”, • “60 single men” • “dog that urinates on everything” • “landscapers in Lilburn, Ga” • Several people names with last name Arnold • “homes sold in shadow lake subdivision gwinnett county georgia”
Privacy Risks of Query Log • Accidental or malicious disclosure • Disclosure of information that users intended to keep private, or that may harm them when released • Compelled disclosure to third parties • Query logs may be subject to subpoena as part of civil litigation between individuals or organizations • Disclosure to the government • Query logs may be subject to government demands in the context of law enforcement or intelligence investigations • Misuse of user profiles • The retention of query logs may allow the creation of detailed profiles of individuals’ interests, preferences, and behaviors.
Query Log Retention Rationale • Improving ranking algorithms and quality of search results • Language-based applications such as spelling correction • Query refinement • Personalization • Combating fraud and abuse • Sharing data for academic research • Sharing data for marketing and other commercial purposes
Query log retention • Analyze potential techniques/practices for query log retention • how well the technique protects privacy • how well the technique preserves the utility of the query logs • how well the technique might be implemented as a user control (a mechanism that allows users to choose to applied the technique)
1. Log Deletion • Erase users’ complete query logs • may occur as early as when the search engine returns search results to the user. • Privacy: the most privacy-enhancing technique available • Utility: drops to zero after they are erased • If the query log keep longer before erasure, search engine could seek to gain some of the benefits of log analysis and storage • User control: straightforward - either have their logs retained or deleted
2. Hashing queries • Two approaches • Entire queries could be hashed, so that the resulting log contains a hash value • tokenize the query, and hash each token, resulting in a set of hash values
2. Hashing queries (user control) • Straightforward • the technique’s effectiveness in protecting privacy may actually increase for those who choose to adopt it if not all individuals make use of it. (because reverse-engineering attacks relies on statistic)
3. Identifier Deletion • Identifier: IP address, cookie IDs
4. Hashing Identifiers • Identifier: IP address, cookie IDs
5. Scrubbing Query Content • remove identifying information • phone numbers, Social Security numbers, credit card numbers, addresses, and names • distinguish identifying information from the remainder of the query content [Xiong and Agichtein 2007]
5. Scrubbing Query Content (user control) • Two ways • Scrub everything the search engine deemed to be identifying • User defined
6. Deleting Infrequent Query • remove queries appearing infrequently • the vast majority of queries occur a small number of times [Beitzel et al. 2004; Spink et al. 2001] • Infrequent queries right now might not necessarily be infrequent for ever (new professional athletes or celebrities, new slang, new product names, etc.)
6. Deleting Infrequent Query (user control) • Difficult • Hard to define infrequent query
7. Shortening session • shorten the length of time that any identifier is associated with an individual [Xiong, 2007]. • a user may be assigned a new identifier every month, day, or hour • Or when users close their browsers or when they navigate away from the search engine’s site.
7. Shortening session (user control) • easy • provide the option of clearing identifiers stored on the search engine’s side • managing which users have requested shorter sessions and when their sessions expire may be expensive
Conclusion • It is possible to collect/retain query logs while protecting user privacy • Technical challenges to strike the balance between retaining query log utility and protecting user privacy • Policy and user education challenges
Data privacy on the web – some case studies • A comparison of privacy practices of internet service companies • Query log retention and its privacy implications • Information revelation patterns and its privacy implications on social network sites
Motivation • Mass adoption • Number of online social networking sites has increased • Dramatic increase of online network participants each year • Information revelation behavior of participants • More open than offline social networks
Online Vs. Offline Networks • Offline social networks contain diverse relations. • Examples – Family, Friend, Co-Worker, Roommate, Acquaintance, Classmate, Teammate, Enemy, etc. • Online social networks simplify relations to simplistic binary relations such as “Friend or not”. • How does someone qualify as a “Friend or not”? What is the measurement? • Most users tend to list anyone (as a Friend) who they know and do not actively dislike.
Online Vs. Offline Networks An offline social network may include up to a dozen intimate or significant ties and 1000 to 1700 “acquaintances” or “interactions”. Online social networks can list hundreds of direct “friends” and include hundreds of thousands of additional “friends” within just three degrees of separation from a subject.
Online Social Networks - Privacy Implications • The level of identifiability of the information • The possible recipients of the information • The possible uses of the information
Online Social Networks - Privacy Implications 1. Level of identifiability • Sites that don’t expose user identity may provide enough information to identify the profile’s owner • Examples: • Face re-identification through photos used across different sites • Demographic data • Category-based representations of interests that reveal unique or rare overlaps of hobbies or tastes • Information Revelation (two possibilities) • Identify “anonymous” profile through previous knowledge of profile owner’s characteristics or traits (identity disclosure) • Allowing a party to infer previously unknown characteristics or traits about an identified profile (attribute disclosure)
Online Social Networks - Privacy Implications 2. Possible Recipients – Who has access to the profile information? • Hosting site / Company • The site’s social network (in some cases site visitors) • Hackers • Government Agencies
Online Social Networks - Privacy Implications 3. Possible uses – how can social network profile information be used? • Dependant upon information provided (may be extensive and intimate in some cases) • Possible uses (risks) • Identity theft • Online/physical stalking • Embarrassment • Blackmail
Analysis - The Facebook.com Gross and Acquisti, 2005 In June 2005, the authors searched for all “female” and all “male” profiles for CMU Facebook members using Facebook’s advanced search feature and extracted their profile IDs. Using the extracted IDs, they downloaded a total of 4540 profiles – virtually the entire CMU Facebook population at the time of the study.
The Facebook.com Demographics