Principled Regularization for Probabilistic Matrix Factorization

Principled Regularization for Probabilistic Matrix Factorization Robert Bell, Suhrid Balakrishnan AT&T Labs-Research Duke Workshop on Sensing and Analysis of High-Dimensional DataJuly 26-28, 2011

Probabilistic Matrix Factorization (PMF) • Approximate a large n-by-m matrix R by • M = PQ • P and Q each have k rows, k << n, m • mui = puqi • R may be sparsely populated • Prime tool in Netflix Prize • 99% of ratings were missing

Regularization for PMF • Needed to avoid overfitting • Even after limiting rank of M • Critical for sparse, imbalanced data • Penalized least squares • Minimize

Regularization for PMF • Needed to avoid overfitting • Even after limiting rank of M • Critical for sparse, imbalanced data • Penalized least squares • Minimize • or

Regularization for PMF • Needed to avoid overfitting • Even after limiting rank of M • Critical for sparse, imbalanced data • Penalized least squares • Minimize • or • ’s selected by cross validation

Research Questions • Should we use separate P and Q?

Research Questions • Should we use separate P and Q? • Should we use k separate ’sfor each dimension of P and Q?

Matrix Completion with Noise(Candes and Plan, Proc IEEE, 2010) • Rank reduction without explicit factors • No pre-specification of k, rank(M) • Regularization applied directly to M • Trace norm, aka, nuclear norm • Sum of the singular values of M • Minimize subject to • “Equivalent” to L2 regularization for P, Q

Research Questions • Should we use separate P and Q? • Should we use k separate ’sfor each dimension of P and Q? • Should we use the trace norm for regularization?

Bayesian Matrix Factorization (BPMF) (Salakhutdinov and Mnih, ICML 2008) • Let rui ~ N(puqi, 2) • No PMF-type regularization • pu ~ N(P, P-1) and qi ~ N(Q, Q-1) • Priors for 2, P, Q, P, Q • Fit by Gibbs sampling • Substantial reduction in prediction error relative to PMF with L2 regularization

Research Questions • Should we use separate P and Q? • Should we use k separate reg. parameters for each dimension of P and Q? • Should we use the trace norm for regularization? • Does BPMF “regularize” appropriately?

Matrix Factorization with Biases • Let mui =  + au + bi + puqi • Regularization similar to before • Minimize

Matrix Factorization with Biases • Let mui =  + au + bi + puqi • Regularization similar to before • Minimize • or

Research Questions • Should we use separate P and Q? • Should we use k separate reg. parameters for each dimension of P and Q? • Should we use the trace norm for regularization? • Does BPMF “regularize” appropriately? • Should we use separate ’sfor the biases?

Some Things this Talk Will Not Cover • Various extensions of PMF • Combining explicit and implicit feedback • Time varying factors • Non-negative matrix factorization • L1 regularization • ’s depending on user or item sample sizes • Efficiency of optimization algorithms • Use Newton’s method, each coordinate separately • Iterate to convergence

No Need for Separate P and Q • M = (cP)(c-1Q) is invariant for c ≠ 0 • For initial P and Q • Solve for c to minimize • c = • Gives • Sufficient to let P = Q = PQ

Bayesian Motivation for L2 Regularization • Simplest case: only one item • R is n-by-1 • Ru1 = a1 + ui, a1 ~ N(0,  2), ui ~ N(0,  2) • Posterior mean (or MAP) of a1 satisfies • a = ( 2/ 2) • Best  is inversely proportional to  2

Implications for Regularization of PMF • Allow a≠ b • If a2 ≠ b2 • Allow a≠ b≠ PQ • Allow PQ1≠ PQ2≠ … ≠ PQk ? • Trace norm does not • BPMF appears to

Simulation Experiment Structure • n = 2,500 users, m = 400 items • 250,000 observed ratings • 150,000 in Training (to estimate a, b, P, Q) • 50,000 in Validation (to tune ’s) • 50,000 in Test (to estimate MSE) • Substantial imbalance in ratings • 8 to 134 ratings per user in Training data • 33 to 988 ratings per item in Training data

Simulation Model • rui = au + bi + pu1qi1 + pu2qi2 + ui • Elements of a, b, P, Q, and  • Independent normals with mean 0 • Var(au) = 0.09 • Var(bi) = 0.16 • Var(pu1qi1) = 0.04 • Var(pu2qi2) = 0.01 • Var(ui) = 1.00

Evaluation • Test MSE for estimation of mui = E(rui) • MSE = • Limitations • Not real data • Only one replication • No standard errors

PMF Results for k = 0

Results for Matrix Completion • Performs poorly on raw ratings • MSE = .0693 • Not designed to estimate biases • Fit to residuals from PMF with k = 0 • MSE = .0477 • “Recovered” rank was 1 • Worse than MSE’s from PMF: .0428 to .0439

Results for BPMF • Raw ratings • MSE = .0498, using k = 3 • Early stopping • Not designed to estimate biases • Fit to residuals from PMF with k = 0 • MSE = .0433, using k = 2 • Near .0428, for best PMF w/ biases

Summary • No need for separate P and Q • Theory suggests using separate ’s for distinct sets of exchangeable parameters • Biases vs. factors • For individual factors • Tentative simulation results support need for separate ’s across factors • BPMF does so automatically • PMF requires a way to do efficient tuning

Principled Regularization for Probabilistic Matrix Factorization

Principled Regularization for Probabilistic Matrix Factorization

Presentation Transcript

Non-Negative Matrix Factorization

Non-negative Matrix Factorization

Shifted Non-negative Matrix Factorization

Matrix Factorization

Bayesian Nonparametric Matrix Factorization for Recorded Music

Initialization enhancer for non-negative matrix factorization

Non Negative Matrix Factorization

Stochastic Matrix Factorization

Direct Robust Matrix Factorization

Regularization in Matrix Relevance Learning

Unambiguity Regularization for Unsupervised Learning of Probabilistic Grammars

Robust Nonnegative Matrix Factorization

Matrix Factorization and its applications

Bayesian Nonparametric Matrix Factorization for Recorded Music

Matrix Factorization

Probabilistic Sparse Matrix Factorization

SoRec: Social Recommendation Using Probabilistic Matrix Factorization

Collaborative Filtering Matrix Factorization Approach

Matrix Factorization

Matrix Factorization via SGD

Principled Approximations in Probabilistic Programming

Principled Probabilistic Inference and Interactive Activation