1.39k likes | 1.61k Views
Data Mining Algorithms for Recommendation Systems. Zhenglu Yang University of Tokyo. Some slides are from online materials. Applications. Applications. Corporate Intranets. Applications. System Inputs. Interaction data (users items) Explicit feedback – rating, comments
E N D
Data Mining Algorithms for Recommendation Systems Zhenglu Yang University of Tokyo Some slides are from online materials.
Corporate Intranets Applications
System Inputs • Interaction data (users items) • Explicit feedback – rating, comments • Implicit feedback – purchase, browsing • User/Item individual data • User side: • Structural attribute information • Personal description • Social network • Item side: • Structural attribute information • Textual description/content information • Taxonomy of item (category)
Interaction between Users and Items Observed preferences (Purchases, Ratings, page views, bookmarks, etc)
Profiles of Users and Items User Profile: (1) Attribute Nationality,Sex,Age,Hobby,etc (2) Text Personal description (3) Link Social network Item Profile: (1) Attribute Price,Weight,Color,Brand,etc (2) Text Product description (3) link Taxonomy of item (category)
All Information about Users and Items User Profile: (1) Attribute Nationality,Sex,Age,Hobby,etc (2) Text Personal description (3) Link Social network Item Profile: (1) Attribute Price,Weight,Color,Brand,etc (2) Text Product description (3) link Taxonomy of item (category) Observed preferences (Purchases, Ratings, page views, bookmarks, etc)
KDD and Data Mining Machine learning Artificial Intelligence KDD Database Statistics Natural Language Processing Data mining is a multi-disciplinary field
Recommendation Approaches • Collaborative filtering • Using interaction data (user-item matrix) • Process: Identify similar users, extrapolate from their ratings • Content based strategies • Using profiles of users/items (features) • Process: Generate rules/classifiers that are used to classify new items • Hybrid approaches
A Brief Introduction • Collaborative filtering • Nearest neighbor based • Model based
Recommendation Approaches • Collaborative filtering • Nearest neighbor based • User based • Item based • Model based
User-based Collaborative Filtering • Idea: people who agreed in the past are likely to agree again • To predict a user’s opinion for an item, use the opinion of similar users • Similarity between users is decided by looking at their overlap in opinions for other items
User-based CF (Ratings) 10 9 …… 2 1 good bad
Similarity between Users • Only consider items both users have rated • Common similarity measures: • Cosine similarity • Pearson correlation
Recommendation Approaches • Collaborative filtering • Nearest neighbor based • User based • Item based • Model based • Content based strategies • Hybrid approaches
Item-based Collaborative Filtering • Idea: a user is likely to have the same opinion for similar items • Similarity between items is decided by looking at how other users have rated them
Similarity between Items • Only consider users who have rated both items • Common similarity measures: • Cosine similarity • Pearson correlation
Recommendation Approaches • Collaborative filtering • Nearest neighbor based • Model based • Matrix factorization (i.e., SVD) • Content based strategies • Hybrid approaches
Singular Value Decomposition (SVD) • Mathematical method used to apply for many problems • Given any mxn matrix R, find matrices U,I, and V that R = UIVT U is mxr and orthonormal I is rxr and diagonal V is nxr and orthonormal • Remove the smallest values to get Rm,k with k<<r • Rm,k is an k-approximation of R based on k most important latent features • Recommendation is be given based on Rm,k T I1 0 0 0 0 0 0 Ir R = U V …
Problems with Collaborative Filtering • Cold Start: There needs to be enough other users already in the system to find a match. • Sparsity: If there are many items to be recommended, even if there are many users, the user/ratings matrix is sparse, and it is hard to find users that have rated the same items. • First Rater: Cannot recommend an item that has not been previously rated. • New items • Esoteric items • Popularity Bias: Cannot recommend items to someone with unique tastes. • Tends to recommend popular items.
Recommendation Approaches • Collaborative filtering • Content based strategies • Hybrid approaches
Profiles of Users and Items User Profile: (1) Attribute Nationality,Sex,Age,Hobby,etc (2) Text Personal description (3) Link Social network Item Profile: (1) Attribute Price,Weight,Color,Brand,etc (2) Text Product description (3) link Taxonomy of item (category)
Advantages of Content-Based Approach • No need for data on other users. • No cold-start or sparsity problems. • Able to recommend to users with unique tastes. • Able to recommend new and unpopular items • No first-rater problem. • Can provide explanations of recommended items by listing content-features that caused an item to be recommended.
Recommendation Approaches • Collaborative filtering • Content based strategies • Association Rule Mining • Text similarity based • Clustering • Classification • Hybrid approaches
Traditional Data Mining Techniques • Association Rule Mining (ARM) • Sequential Pattern Mining (SPM)
ARM Example: Market Basket Data • Items frequently purchased together: • Uses: • Recommendation • Placement • Sales • Coupons • Objective: increase sales and reduce costs beer diaper
Association Rule Mining • Given a set of transactions, find rules that will predict the occurrence of an item based on the occurrences of other items in the transaction Market-Basket transactions Example of Association Rules {Beer} {Diaper},{Coke} {Eggs}, Implication means co-occurrence, not causality!
Some Definitions An itemset is supported by a transaction if it is included in the transaction Market-Basket transactions <Beer, Diaper> is supported by transaction 1, and 3, and its support is 2/4=50%.
Some Definitions If the support of an itemset exceeds user specified min_support (threshold), this itemset iscalled afrequent itemset (pattern). Market-Basket transactions min_support=50% <Beer, Diaper> is a frequent itemset <Beer, Milk> is not a frequent itemset
Outline • Association Rule Mining • Apriori • FP-growth • Sequential Pattern Mining
Apriori Algorithm • Proposed by Agrawal et al. [VLDB’94] • First algorithm for Association Rule mining • Candidate generation-and-test • Introduced anti-monotone property
Apriori Algorithm Market-Basket transactions
A Naive Algorithm 2 3 3 1 3 1 1 2 1 1 ……. 1 Supmin=2
Apriori Algorithm • Anti-monotone property: If an itemset is not frequent, then any of its superset is not frequent 3 3 2 3 1 2 1 1 ……. Supmin=2
Apriori Algorithm Supmin = 2 L1 C1 1st scan C2 C2 L2 2nd scan L3 C3 3rd scan
Drawbacks of Apriori • Multiple scans of transaction database • Multiple database scans are costly • Huge number of candidates • To find frequent itemset i1i2…i100 • # of scans: 100 • # of Candidates: 2100-1 = 1.27*1030
Outline • Association Rule Mining • Apriori • FP-growth • Sequential Pattern Mining
FP-Growth • Proposed by Han et al. [SIGMOD’00] • Uses the Apriori pruning principle • Scan DB only twice • Once to find frequent 1-itemset (single item pattern) • Once to construct FP-tree (prefix tree, Trie), the data structure of FP-growth
FP-Growth TID Items bought 100 {f, a, c, d, g, i, m, p} 200 {a, b, c, f, l, m, o} 300 {b, f, h, j, o, w} 400 {b, c, k, s, p} 500{a, f, c, e, l, p, m, n} {} f:1 Header Table Item Support f 4 c 4 a 3 b 3 m 3 p 3 TID (ordered) frequent items 100 {f, c, a, m, p} 200 {f, c, a, b, m} 300 {f, b} 400 {c, b, p} 500 {f, c, a, m, p} c:1 a:1 m:1 p:1
FP-Growth TID Items bought 100 {f, a, c, d, g, i, m, p} 200 {a, b, c, f, l, m, o} 300 {b, f, h, j, o, w} 400 {b, c, k, s, p} 500{a, f, c, e, l, p, m, n} {} f:4 c:1 Header Table Item Support f 4 c 4 a 3 b 3 m 3 p 3 TID (ordered) frequent items 100 {f, c, a, m, p} 200 {f, c, a, b, m} 300 {f, b} 400 {c, b, p} 500 {f, c, a, m, p} c:3 b:1 b:1 a:3 p:1 m:2 b:1 p:2 m:1
FP-Growth {} Header Table Item support head f 4 c 4 a 3 b 3 m 3 p 3 f:4 c:1 c:3 b:1 b:1 a:3 p:1 m:2 b:1 p:2 m:1 TID (ordered) frequent items 100 {f, c, a, m, p} 200 {f, c, a, b, m} 300 {f, b} 400 {c, b, p} 500 {f, c, a, m, p}
FP-Growth {} Header Table Item support head f 4 c 4 a 3 b 3 m 3 p 3 f:4 c:1 c:3 b:1 b:1 a:3 p:1 m:2 b:1 p:2 m:1 Conditional pattern bases Item cond. pattern base freq. itemset p fcam:2, cb:1 fp, cp, ap, mp, fcp, fap, fmp, cap, cmp, amp, facp, fcmp, famp, fcamp m fca:2, fcab:1 fm, cm, am, fcm, fam, cam, fcam b fca:1, f:1, c:1 … a fc:3 … c f:3 …
Outline • Association Rule Mining • Apriori • FP-growth • Sequential Pattern Mining • GSP • SPADE, SPAM • PrefixSpan
Applications • Customer shopping sequences Re-query Query 80% Q1 P1 Q2 P2 Note computer Memory CD-ROM within 3 days
Some Definitions If the support of a sequence exceeds user specified min_support, this sequence iscalled asequential pattern. min_support=50% <bdb> is a sequential pattern <adc> is not a sequential pattern
Outline • Association Rule Mining • Apriori • FP-growth • Sequential Pattern Mining • GSP • SPADE, SPAM • PrefixSpan
GSP (Generalized Sequential Pattern Mining) • Proposed by Srikant et al. [EDBT’96] • Uses the Apriori pruning principle • Outline of the method • Initially, every item in DB is a candidate of length-1 • For each level (i.e., sequences of length-k) do • Scan database to collect support count for each candidate sequence • Generate candidate length-(k+1) sequences from length-k frequent sequences using Apriori • Repeat until no frequent sequence or no candidate can be found
Finding Length-1 Sequential Patterns Seq. ID Sequence min_sup =2 10 <(bd)cb(ac)> 20 <(bf)(ce)b(fg)> 30 <(ah)(bf)abf> 40 <(be)(ce)d> 50 <a(bd)bcb(ade)> • Initial candidates: • <a>, <b>, <c>, <d>, <e>, <f>, <g>, <h> • Scan database once, count support for candidates