660 likes | 820 Views
Latent Factor Models. Geoff Gordon Joint work w/ Ajit Singh, Byron Boots, Sajid Siddiqi, Nick Roy. Motivation. A key component of a cognitive tutor: student cognitive model Tracks what skills student currently knows — latent factors. circle-area. rectangle-area. decompose-area.
E N D
Latent Factor Models • Geoff Gordon • Joint work w/ Ajit Singh, Byron Boots, Sajid Siddiqi, Nick Roy
Motivation • A key component of a cognitive tutor: student cognitive model • Tracks what skills student currently knows—latent factors circle-area rectangle-area decompose-area right-answer
Motivation • Student models are a key bottleneck in cognitive tutor authoring and performance • rough estimate: 20-80 hrs to hand-code model for 1 hr of content • result may be too simple, not rigorously verified • But, demonstrated improvements in learning from better models • E.g., Cen et al [2007]:12% less time to learn 6 geometry units (same retention) using tutor w/ more accurate model • This talk: automatic discovery of new models and data-driven revision of existing models via (latent)factor analysis
SCORE: STDNT I, ITEM J Simple case: snapshot, no side information ITEMS STUDENTS
Missing data ITEMS STUDENTS
x1 x2 x3 . . . xn Data matrix X ITEMS STUDENTS
observed unobserved Simple case: model U k latent factors X n students V k latent factors m items U: student latent factors V: item latent factors X: observed performance
Linear-Gaussian version U: Gaussian (0 mean, fixed var) V: Gaussian (0 mean, fixed var) X: Gaussian (fixed var, mean at left) student factor item factor U k latent factors X n students V k latent factors m items
x1 x2 x3 . . . xn u1 u2 u3 . . . un v1 … vk Matrix form: Principal Components Analysis BASIS MATRIX VT ≈ COMPRESSED MATRIX U DATA MATRIX X
x1 x2 x3 . . . xn u1 u2 u3 . . . un v1 … vk PCA: matrix form BASIS MATRIX VT ≈ COLS OF V SPAN THE LOW-RANK SPACE COMPRESSED MATRIX U DATA MATRIX X
u1 u2 u3 . . . un v1 … vk Interpretation of factors BASIS WEIGHTS ITEMS BASIS VECTORS STUDENTS BASIS VECTORS ARE CANDIDATE “SKILLS” OR “KNOWLEDGE COMPONENTS” WEIGHTS ARE STUDENTS’ KNOWLEDGE LEVELS
PCA is a widely successful model FACE IMAGES FROM Groundhog Day, EXTRACTED BY CAMBRIDGE FACE DB PROJECT
x1 x2 x3 . . . xn Data matrix: face images PIXELS IMAGES
u1 u2 u3 . . . un v1 … vk Result of factoring BASIS WEIGHTS PIXELS BASIS VECTORS IMAGES BASIS VECTORS ARE OFTEN CALLED “EIGENFACES”
Eigenfaces IMAGE CREDIT: AT&T LABS CAMBRIDGE
PCA: the good • Unsupervised: need no human labels of latent state! • No worry about “expert blind spot” • Of course, labels helpful if available • Post-hoc human interpretation of latents is nice too—e.g., intervention design
PCA: the bad • Linear, Gaussian • PCA assumes E(X) is linear in UV • PCA assumes (X–E(X)) is i.i.d. Gaussian
Nonlinearity: conjunctive skills P(CORRECT) SKILL 2 SKILL 1
Nonlinearity: disjunctive skills P(CORRECT) SKILL 2 SKILL 1
Nonlinearity: “other” P(CORRECT) SKILL 2 SKILL 1
Non-Gaussianity ITEMS • Typical hand-developed skill-by-item matrix SKILLS
Result of Gaussian assumption true recovered rows of true and recovered V matrices
Result of Gaussian assumption true recovered rows of true and recovered V matrices
The ugly: MLE only • PCA yields maximum-likelihood estimate • Good, right? • sadly, the usual reasons to want the MLE don’t apply here • e.g., consistency: variance and bias of estimates of U and V do not approach 0 (unless #items/student and #students/item ) • Result: MLE is typically far too confident of itself
Too certain: example Learned coefficients (e.g., a row of U) Predictions
Result: “fold-in problem” • Nonsensical results when trying to apply learned model to a new student or item • Similar to overfitting problem in supervised learning: confident-but-wrong parameters do not generalize to new examples • Unlike overfitting, fold-in problem doesn’t necessarily go away with more data
Summary: 3 problems w/ PCA • Can’t handle nonlinearity • Can’t handle non-Gaussian distributions • Uses MLE only (==> fold-in problem) • Let’s look at each problem in turn
Nonlinearity • In PCA, had Xij ≈ Ui ⋅ Vj • What if • Xij ≈ exp(Ui ⋅ Vj) • Xij ≈ logit(Ui ⋅ Vj) • …
Non-Gaussianity • In PCA, had Xij ~ Normal(μ), μ = Ui ⋅ Vj • What if • Xij ~ Poisson(μ) • Xij ~ Binomial(p) • …
Exponential family review • Exponential family of distributions: • P(X | θ) = P0(X) exp(X⋅θ – G(θ)) • G(θ) is always strictly convex, differentiable on interior of domain • means G’ is strictly monotone (strictly generalized monotone in 2D or higher)
Exponential family review • Exponential family PDF: • P(X | θ) = P0(X) exp(X⋅θ – G(θ)) • Surprising result: G’(θ) = g(θ) = E(X | θ) • g & g–1 = “link function” • θ = “natural parameter” • E(X | θ) = “expectation parameter”
Examples • Normal(mean) • g = identity • Poisson(log rate) • g = exp • Binomial(log odds) • g = sigmoid
Nonlinear & non-Gaussian • Let P(X | θ) be an exponential family with natural parameter θ • Predict Xij ~ P(X | θij), where θij = Ui ⋅ Vj • e.g., in Poisson, E(Xij) = exp(θij) • e.g., in Binomial, E(Xij) = logit(θij)
Optimization problem max ∑ log P(Xij | θij) s.t. θij = Ui ⋅ Vj • “Generalized linear” or “exponential family” PCA • all P(…) terms are exponential families • analogy to GLMs + log P(U) + log P(V) U,V [Collins et al, 2001] [Gordon, 2002] [Roy & Gordon, 2005]
Special cases • PCA, probabilistic PCA • Poisson PCA • k-means clustering • Max-margin matrix factorization (MMMF) • Almost: pLSI, pHITS, NMF
Comparison to AFM • p = probability correct • θ = student overall performance • β = skill difficulty • Q = item x skill matrix • = skill practice slope • T = number of practice opportunities θ Tikk 1 Q x 0 β
Theorem • In GL PCA, finding U which maximizes likelihood (holding V fixed) is a convex optimization problem • And, finding best V (holding U fixed) is a convex problem • Further, Hessian is block diagonal • So, an efficient and effective optimization algorithm: alternately improve U and V
Example: compressing histograms w/ Poisson PCA A B C Points: observed frequencies in ℝ3 Hidden manifold: a 1-parameter family of multinomials
Example ITERATION 1
Example ITERATION 2
Example ITERATION 3
Example ITERATION 4
Example ITERATION 5
Example ITERATION 9
Remaining problem: MLE • Well-known rule of thumb: if MLE gets you in trouble due to overfitting, move to fully-Bayesian inference • Typical problem: computation • In our case, the computation is just fine if we’re a little clever • Additional wrinkle: switch to hierarchical model
observed unobserved Bayesian hierarchical exponential-family PCA R U k latent factors X n students V S m items k latent factors U: student latent factors V: item latent factors X: observed performance R: shared prior for student latents S: shared prior for item latents student factor item factor
A little clever: MCMC Z P(X)
Experimental comparisonGeometry Area 1996-1997 data • Geometry tutor: 139 items presented to 59 students • On average, each student tested on 60 items
Results: hold-out error credit: Ajit Singh Embedding dimension for *EPCA is K = 15