Data Mining Technologies for Digital Libraries & Web Information Systems

Data Mining Technologiesfor Digital Libraries & Web Information Systems Ramakrishnan Srikant

Talk Outline • Taxonomy Integration (WWW 2001, with R. Agrawal) • Searching with Numbers • Privacy-Preserving Data Mining

Taxonomy Integration • B2B electronics portal: 2000 categories, 200K datasheets ICs ICs DSP Mem. Logic Cat1 Cat2 a b c d e f x y z w Master Catalog New Catalog

Taxonomy Integration (2) • After integration: ICs DSP Mem. Logic a b x y c d e f z w

Goal • Use affinity information in new catalog. • Products in same category are similar. • Accuracy boost depends on match between two categorizations.

Problem Statement • Given • master categorization M: categories C1, C2, …, Cn • set of documents in each category • new categorization N:categories S1, S2, …, Sn • set of documents in each category • Find the category in M for each document in N • Standard Alg: Estimate Pr(Ci | d) • Enhanced Alg: Estimate Pr(Ci | d, S)

Naive Bayes Classifier • Estimate probability of document d belonging to class Ci • Where

Enhanced Naïve Bayes • Standard: • Enhanced: • How do we estimate Pr(Ci|S)? • Apply standard Naïve Bayes to get number of documents in S that are classified into Ci • Incorporate weight w reflecting match between two taxonomies. • Only affect classification of borderline documents. • For w = 0, default to standard classifier.

Enhanced Naïve Bayes (2) • Use tuning set to determine w.

Intuition behind Algorithm Standard Algorithm Enhanced Algorithm

Electronic Parts Dataset 1150 categories; 37,000 documents

Yahoo & OpenDirectory • 5 slices of the hierarchy: Autos, Movies, Outdoors, Photography, Software • Typical match: 69%, 15%, 3%, 3%, 1%, …. • Merging Yahoo into OpenDirectory • 30% fewer errors (14.1% absolute difference in accuracy) • Merging OpenDirectory into Yahoo • 26% fewer errors (14.3% absolute difference)

Summary • New algorithm for taxonomy integration. • Exploits affinity information in the new (source) taxonomy categorizations. • Can do substantially better, and never does significantly worse than standard Naïve Bayes. • Open Problems: SVM, Decision Tree, ...

Talk Outline • Taxonomy Integration • Searching with Numbers (WWW 2002, with R. Agrawal) • Privacy-Preserving Data Mining

Motivation • A large fraction of useful web consists of specification documents. • <attribute name, value> pairs embedded in text. • Examples: • Data sheets for electronic parts. • Classified ads. • Product catalogs.

Search Engines treat Numbers as Strings • Search for 6798.32 (lunar nutation cycle) • Returns 2 pages on Google • However, search for 6798.320 yielded no page on Google (and all other search engines) • Current search technology is inadequate for retrieving specification documents.

Data Extraction is hard • Synonyms for attribute names and units. • "lb" and "pounds", but no "lbs" or "pound". • Attribute names are often missing. • No "Speed", just "MHz Pentium III" • No "Memory", just "MB SDRAM" • 850 MHz Intel Pentium III • 192 MB RAM • 15 GB Hard Disk • DVD Recorder: Included; • Windows Me • 14.1 inch display • 8.0 pounds

Searching with Numbers IBM ThinkPad 750 MHz Pentium 3, 196 MB DRAM, … Dell Computer 700 MHz Celeron, 256 MB SDRAM, … Database 800 200 IBM ThinkPad (750 MHz, 196 MB) … Dell (700 MHz, 256 MB) 800 200 3 lb

Reflectivity • If we get a close match on numbers, how likely is it that we have correctly matched attribute names? • Likelihood  Non-reflectivity (of data) • Non-overlapping attributes  Non-reflective. • Memory: 64- 512 Mb, Disk: 10 - 40 Gb • Correlations or Clustering  Low reflectivity. • Memory: 64 - 512 Mb, Disk: 10 - 100 Gb

Reflectivity: Examples

Reflectivity: Definition • Let • D: dataset • ni : co-ordinates of point xi • reflections(xi ): permutations of ni • (ni): # of points within distance r of ni • (ni): # of reflections within distance r of ni

Algorithm • How to compute match score (rank) of a document for a given query? • How to limit the number of documents for which the match score is computed?

Match Score of a Document • Select k numbers from D yielding minimum distance between Q and D. • Relative distance for each term: • Euclidean distance (Lp norm) to combine term distances:

Bipartite Graph Matching • Map problem to Bipartite Graph Matching • k source nodes: corr. to query numbers • m target nodes: corr. to document numbers • An edge from each source to k nearest targets. Assign weight f(qi ,nj)p to the edge (qi ,nj). 10 25 75 Doc: .25 .58 .25 .5 20 60 Query:

Limiting the Set of Documents • Similar to the score aggregation problem [Fagin, PODS 96] • Proposed algorithm is an adaptation of the TA algorithm in [Fagin-Lotem-Naor, PODS 01]

Limiting the set of documents • k conceptual sorted lists, one for each query term • Do round robin access to the lists. For each document found, compute its distance F(D,Q) • Let ni := number last looked at for query term qi • Let • Halt when t documents found whose distance <=  • t is lower bound on distance of unseen documents

Empirical Results

Empirical Results (2) • Screen Shot

Incorporating Hints • Use simple data extraction techniques to get hints, • Names/Units in query matched against Hints. • 256 MBSDRAM memory Unit Hint: MB Attribute Hint: SDRAM, memory

Summary • Allows querying using only numbers or numbers + hints. • Data can come from raw text (e.g. product descriptions) or databases. • End run around data extraction. • Use simple extractor to generate hints. • Open Problems: integration with keyword search.

Talk Outline • Taxonomy Integration • Searching with Numbers • Privacy-Preserving Data Mining • Motivation • Classification • Associations

Growing Privacy Concerns • Popular Press: • Economist: The End of Privacy (May 99) • Time: The Death of Privacy (Aug 97) • Govt. legislation: • European directive on privacy protection (Oct 98) • Canadian Personal Information Protection Act (Jan 2001) • Special issue on internet privacy, CACM, Feb 99 • S. Garfinkel, "Database Nation: The Death of Privacy in 21st Century", O' Reilly, Jan 2000

Privacy Concerns (2) • Surveys of web users • 17% privacy fundamentalists, 56% pragmatic majority, 27% marginally concerned (Understanding net users' attitude about online privacy, April 99) • 82% said having privacy policy would matter (Freebies & Privacy: What net users think, July 99)

Technical Question • Fear: • "Join" (record overlay) was the original sin. • Data mining: new, powerful adversary? • The primary task in data mining: development of models about aggregated data. • Can we develop accurate models without access to precise information in individual data records?

Talk Outline • Taxonomy Integration • Searching with Numbers • Privacy-Preserving Data Mining • Motivation • Private Information Retrieval • Classification (SIGMOD 2000, with R. Agrawal) • Associations

Web Demographics • Volvo S40 website targets people in 20s • Are visitors in their 20s or 40s? • Which demographic groups like/dislike the website?

Solution Overview 30 | 70K | ... 50 | 40K | ... ... Randomizer Randomizer 65 | 20K | ... 25 | 60K | ... ... Reconstruct distribution of Age Reconstruct distribution of Salary ... Data Mining Algorithms Model

Reconstruction Problem • Original values x1, x2, ..., xn • from probability distribution X (unknown) • To hide these values, we use y1, y2, ..., yn • from probability distribution Y • Given • x1+y1, x2+y2, ..., xn+yn • the probability distribution of Y • Estimate the probability distribution of X.

Intuition (Reconstruct single point) • Use Bayes' rule for density functions

Reconstructing the Distribution • Combine estimates of where point came from for all the points: • Gives estimate of original distribution.

Reconstruction: Bootstrapping • fX0 := Uniform distribution • j := 0 // Iteration number • repeat • (Bayes' rule) • j := j+1 • until (stopping criterion met) • Converges to maximum likelihood estimate. • D. Agrawal & C.C. Aggarwal, PODS 2001.

Seems to work well!

Recap: Why is privacy preserved? • Cannot reconstruct individual values accurately. • Can only reconstruct distributions.

Talk Outline • Taxonomy Integration • Searching with Numbers • Privacy-Preserving Data Mining • Motivation • Private Information Retrieval • Classification • Associations (KDD 2002, with A. Evfimievski, R. Agrawal & J. Gehrke)

Association Rules • Given: • a set of transactions • each transaction is a set of items • Association Rule: 30% of transactions that contain Book1 and Book5 also contain Book20; 5% of transactions contain these items. • 30% : confidence of the rule. • 5% : support of the rule. • Find all association rules that satisfy user-specified minimum support and minimum confidence constraints. • Can be used to generate recommendations.

Recommendations Overview Alice Recommendation Service Book 1, Book 7, Book 21 Book 1, Book 11, Book 21 Support Recovery Associations Book 3, Book 25 Bob Book 5, Book 25 Recommendations

Private Information Retrieval • Retrieve 1 of n documents from a digital library without the library knowing which document was retrieved. • Trivial solution: Download entire library. • Can you do better? • Yes, with multiple servers. • Yes, with single server & computational privacy. • Problem introduced in [Chor et al, FOCS 95]

Uniform Randomization • Given a transaction, • keep item with 20% probability, • replace with a new random item with 80% probability. • Appears to gives around 80% privacy… • 80% chance that an item in the randomized transaction was not in the original transaction.

Privacy Breach Example • 10 M transactions of size 3 with 1000 items: 100,000 (1%) have {x, y, z} 9,900,000 (99%) have zero items from {x, y, z} 6 * (0.8/1000)3 = 3 * 10-9 0.23 = .008 800 transactions .03 transactions (<< 1) 99.99% 0.01% • 80% privacy “on average,” but not for all items!

Data Mining Technologies for Digital Libraries & Web Information Systems