Latent Factor Models

Latent Factor Models • Geoff Gordon • Joint work w/ Ajit Singh, Byron Boots, Sajid Siddiqi, Nick Roy

Motivation • A key component of a cognitive tutor: student cognitive model • Tracks what skills student currently knows—latent factors circle-area rectangle-area decompose-area right-answer

Motivation • Student models are a key bottleneck in cognitive tutor authoring and performance • rough estimate: 20-80 hrs to hand-code model for 1 hr of content • result may be too simple, not rigorously verified • But, demonstrated improvements in learning from better models • E.g., Cen et al [2007]:12% less time to learn 6 geometry units (same retention) using tutor w/ more accurate model • This talk: automatic discovery of new models and data-driven revision of existing models via (latent)factor analysis

SCORE: STDNT I, ITEM J Simple case: snapshot, no side information ITEMS STUDENTS

Missing data ITEMS STUDENTS

x1 x2 x3 . . . xn Data matrix X ITEMS STUDENTS

observed unobserved Simple case: model U k latent factors X n students V k latent factors m items U: student latent factors V: item latent factors X: observed performance

Linear-Gaussian version U: Gaussian (0 mean, fixed var) V: Gaussian (0 mean, fixed var) X: Gaussian (fixed var, mean at left) student factor item factor U k latent factors X n students V k latent factors m items

x1 x2 x3 . . . xn u1 u2 u3 . . . un v1 … vk Matrix form: Principal Components Analysis BASIS MATRIX VT ≈ COMPRESSED MATRIX U DATA MATRIX X

PCA: the picture

x1 x2 x3 . . . xn u1 u2 u3 . . . un v1 … vk PCA: matrix form BASIS MATRIX VT ≈ COLS OF V SPAN THE LOW-RANK SPACE COMPRESSED MATRIX U DATA MATRIX X

u1 u2 u3 . . . un v1 … vk Interpretation of factors BASIS WEIGHTS ITEMS BASIS VECTORS STUDENTS BASIS VECTORS ARE CANDIDATE “SKILLS” OR “KNOWLEDGE COMPONENTS” WEIGHTS ARE STUDENTS’ KNOWLEDGE LEVELS

PCA is a widely successful model FACE IMAGES FROM Groundhog Day, EXTRACTED BY CAMBRIDGE FACE DB PROJECT

x1 x2 x3 . . . xn Data matrix: face images PIXELS IMAGES

u1 u2 u3 . . . un v1 … vk Result of factoring BASIS WEIGHTS PIXELS BASIS VECTORS IMAGES BASIS VECTORS ARE OFTEN CALLED “EIGENFACES”

Eigenfaces IMAGE CREDIT: AT&T LABS CAMBRIDGE

PCA: the good • Unsupervised: need no human labels of latent state! • No worry about “expert blind spot” • Of course, labels helpful if available • Post-hoc human interpretation of latents is nice too—e.g., intervention design

PCA: the bad • Linear, Gaussian • PCA assumes E(X) is linear in UV • PCA assumes (X–E(X)) is i.i.d. Gaussian

Nonlinearity: conjunctive skills P(CORRECT) SKILL 2 SKILL 1

Nonlinearity: disjunctive skills P(CORRECT) SKILL 2 SKILL 1

Nonlinearity: “other” P(CORRECT) SKILL 2 SKILL 1

Non-Gaussianity ITEMS • Typical hand-developed skill-by-item matrix SKILLS

Result of Gaussian assumption true recovered rows of true and recovered V matrices

The ugly: MLE only • PCA yields maximum-likelihood estimate • Good, right? • sadly, the usual reasons to want the MLE don’t apply here • e.g., consistency: variance and bias of estimates of U and V do not approach 0 (unless #items/student and #students/item  ) • Result: MLE is typically far too confident of itself

Too certain: example Learned coefficients (e.g., a row of U) Predictions

Result: “fold-in problem” • Nonsensical results when trying to apply learned model to a new student or item • Similar to overfitting problem in supervised learning: confident-but-wrong parameters do not generalize to new examples • Unlike overfitting, fold-in problem doesn’t necessarily go away with more data

Summary: 3 problems w/ PCA • Can’t handle nonlinearity • Can’t handle non-Gaussian distributions • Uses MLE only (==> fold-in problem) • Let’s look at each problem in turn

Nonlinearity • In PCA, had Xij ≈ Ui ⋅ Vj • What if • Xij ≈ exp(Ui ⋅ Vj) • Xij ≈ logit(Ui ⋅ Vj) • …

Non-Gaussianity • In PCA, had Xij ~ Normal(μ), μ = Ui ⋅ Vj • What if • Xij ~ Poisson(μ) • Xij ~ Binomial(p) • …

Exponential family review • Exponential family of distributions: • P(X | θ) = P0(X) exp(X⋅θ – G(θ)) • G(θ) is always strictly convex, differentiable on interior of domain • means G’ is strictly monotone (strictly generalized monotone in 2D or higher)

Exponential family review • Exponential family PDF: • P(X | θ) = P0(X) exp(X⋅θ – G(θ)) • Surprising result: G’(θ) = g(θ) = E(X | θ) • g & g–1 = “link function” • θ = “natural parameter” • E(X | θ) = “expectation parameter”

Examples • Normal(mean) • g = identity • Poisson(log rate) • g = exp • Binomial(log odds) • g = sigmoid

Nonlinear & non-Gaussian • Let P(X | θ) be an exponential family with natural parameter θ • Predict Xij ~ P(X | θij), where θij = Ui ⋅ Vj • e.g., in Poisson, E(Xij) = exp(θij) • e.g., in Binomial, E(Xij) = logit(θij)

Optimization problem max ∑ log P(Xij | θij) s.t. θij = Ui ⋅ Vj • “Generalized linear” or “exponential family” PCA • all P(…) terms are exponential families • analogy to GLMs + log P(U) + log P(V) U,V [Collins et al, 2001] [Gordon, 2002] [Roy & Gordon, 2005]

Special cases • PCA, probabilistic PCA • Poisson PCA • k-means clustering • Max-margin matrix factorization (MMMF) • Almost: pLSI, pHITS, NMF

Comparison to AFM • p = probability correct • θ = student overall performance • β = skill difficulty • Q = item x skill matrix •  = skill practice slope • T = number of practice opportunities θ Tikk 1 Q x 0 β

Theorem • In GL PCA, finding U which maximizes likelihood (holding V fixed) is a convex optimization problem • And, finding best V (holding U fixed) is a convex problem • Further, Hessian is block diagonal • So, an efficient and effective optimization algorithm: alternately improve U and V

Example: compressing histograms w/ Poisson PCA A B C Points: observed frequencies in ℝ3 Hidden manifold: a 1-parameter family of multinomials

Example ITERATION 1

Remaining problem: MLE • Well-known rule of thumb: if MLE gets you in trouble due to overfitting, move to fully-Bayesian inference • Typical problem: computation • In our case, the computation is just fine if we’re a little clever • Additional wrinkle: switch to hierarchical model

observed unobserved Bayesian hierarchical exponential-family PCA R U k latent factors X n students V S m items k latent factors U: student latent factors V: item latent factors X: observed performance R: shared prior for student latents S: shared prior for item latents student factor item factor

A little clever: MCMC Z P(X)

Experimental comparisonGeometry Area 1996-1997 data • Geometry tutor: 139 items presented to 59 students • On average, each student tested on 60 items

Results: hold-out error credit: Ajit Singh Embedding dimension for *EPCA is K = 15

Latent Factor Models

Latent Factor Models

Presentation Transcript

Latent Growth Curve Models

LECTURE 8 : FACTOR MODELS

Latent normal models for missing data

Factor Models

IR Models: Latent Semantic Analysis

8. Heterogeneity: Latent Class Models

Measuring abstract concepts: Latent Variables and Factor Analysis

Latent Tree Models

Latent Tree Models Part IV: Applications

13. Latent Class Logit Models

Max-Margin Latent Variable Models

Learning Latent Factor Models of Human Travel

Hierarchically nested factor models

Multilevel Models with Latent Variables

ANOVA Single Factor Models

Latent trajectory models: an appetizer

Predictive Discrete Latent Factor Models for large incomplete dyadic data

Skills Diagnosis with Latent Variable Models

Recommender Systems: Latent Factor Models

9. Heterogeneity: Latent Class Models

ANOVA Two Factor Models