1 / 65

Latent Factor Models

Latent Factor Models. Geoff Gordon Joint work w/ Ajit Singh, Byron Boots, Sajid Siddiqi, Nick Roy. Motivation. A key component of a cognitive tutor: student cognitive model Tracks what skills student currently knows — latent factors. circle-area. rectangle-area. decompose-area.

denim
Download Presentation

Latent Factor Models

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Latent Factor Models • Geoff Gordon • Joint work w/ Ajit Singh, Byron Boots, Sajid Siddiqi, Nick Roy

  2. Motivation • A key component of a cognitive tutor: student cognitive model • Tracks what skills student currently knows—latent factors circle-area rectangle-area decompose-area right-answer

  3. Motivation • Student models are a key bottleneck in cognitive tutor authoring and performance • rough estimate: 20-80 hrs to hand-code model for 1 hr of content • result may be too simple, not rigorously verified • But, demonstrated improvements in learning from better models • E.g., Cen et al [2007]:12% less time to learn 6 geometry units (same retention) using tutor w/ more accurate model • This talk: automatic discovery of new models and data-driven revision of existing models via (latent)factor analysis

  4. SCORE: STDNT I, ITEM J Simple case: snapshot, no side information ITEMS STUDENTS

  5. Missing data ITEMS STUDENTS

  6. x1 x2 x3 . . . xn Data matrix X ITEMS STUDENTS

  7. observed unobserved Simple case: model U k latent factors X n students V k latent factors m items U: student latent factors V: item latent factors X: observed performance

  8. Linear-Gaussian version U: Gaussian (0 mean, fixed var) V: Gaussian (0 mean, fixed var) X: Gaussian (fixed var, mean at left) student factor item factor U k latent factors X n students V k latent factors m items

  9. x1 x2 x3 . . . xn u1 u2 u3 . . . un v1 … vk Matrix form: Principal Components Analysis BASIS MATRIX VT ≈ COMPRESSED MATRIX U DATA MATRIX X

  10. PCA: the picture

  11. x1 x2 x3 . . . xn u1 u2 u3 . . . un v1 … vk PCA: matrix form BASIS MATRIX VT ≈ COLS OF V SPAN THE LOW-RANK SPACE COMPRESSED MATRIX U DATA MATRIX X

  12. u1 u2 u3 . . . un v1 … vk Interpretation of factors BASIS WEIGHTS ITEMS BASIS VECTORS STUDENTS BASIS VECTORS ARE CANDIDATE “SKILLS” OR “KNOWLEDGE COMPONENTS” WEIGHTS ARE STUDENTS’ KNOWLEDGE LEVELS

  13. PCA is a widely successful model FACE IMAGES FROM Groundhog Day, EXTRACTED BY CAMBRIDGE FACE DB PROJECT

  14. x1 x2 x3 . . . xn Data matrix: face images PIXELS IMAGES

  15. u1 u2 u3 . . . un v1 … vk Result of factoring BASIS WEIGHTS PIXELS BASIS VECTORS IMAGES BASIS VECTORS ARE OFTEN CALLED “EIGENFACES”

  16. Eigenfaces IMAGE CREDIT: AT&T LABS CAMBRIDGE

  17. PCA: the good • Unsupervised: need no human labels of latent state! • No worry about “expert blind spot” • Of course, labels helpful if available • Post-hoc human interpretation of latents is nice too—e.g., intervention design

  18. PCA: the bad • Linear, Gaussian • PCA assumes E(X) is linear in UV • PCA assumes (X–E(X)) is i.i.d. Gaussian

  19. Nonlinearity: conjunctive skills P(CORRECT) SKILL 2 SKILL 1

  20. Nonlinearity: disjunctive skills P(CORRECT) SKILL 2 SKILL 1

  21. Nonlinearity: “other” P(CORRECT) SKILL 2 SKILL 1

  22. Non-Gaussianity ITEMS • Typical hand-developed skill-by-item matrix SKILLS

  23. Result of Gaussian assumption true recovered rows of true and recovered V matrices

  24. Result of Gaussian assumption true recovered rows of true and recovered V matrices

  25. The ugly: MLE only • PCA yields maximum-likelihood estimate • Good, right? • sadly, the usual reasons to want the MLE don’t apply here • e.g., consistency: variance and bias of estimates of U and V do not approach 0 (unless #items/student and #students/item  ) • Result: MLE is typically far too confident of itself

  26. Too certain: example Learned coefficients (e.g., a row of U) Predictions

  27. Result: “fold-in problem” • Nonsensical results when trying to apply learned model to a new student or item • Similar to overfitting problem in supervised learning: confident-but-wrong parameters do not generalize to new examples • Unlike overfitting, fold-in problem doesn’t necessarily go away with more data

  28. Summary: 3 problems w/ PCA • Can’t handle nonlinearity • Can’t handle non-Gaussian distributions • Uses MLE only (==> fold-in problem) • Let’s look at each problem in turn

  29. Nonlinearity • In PCA, had Xij ≈ Ui ⋅ Vj • What if • Xij ≈ exp(Ui ⋅ Vj) • Xij ≈ logit(Ui ⋅ Vj) • …

  30. Non-Gaussianity • In PCA, had Xij ~ Normal(μ), μ = Ui ⋅ Vj • What if • Xij ~ Poisson(μ) • Xij ~ Binomial(p) • …

  31. Exponential family review • Exponential family of distributions: • P(X | θ) = P0(X) exp(X⋅θ – G(θ)) • G(θ) is always strictly convex, differentiable on interior of domain • means G’ is strictly monotone (strictly generalized monotone in 2D or higher)

  32. Exponential family review • Exponential family PDF: • P(X | θ) = P0(X) exp(X⋅θ – G(θ)) • Surprising result: G’(θ) = g(θ) = E(X | θ) • g & g–1 = “link function” • θ = “natural parameter” • E(X | θ) = “expectation parameter”

  33. Examples • Normal(mean) • g = identity • Poisson(log rate) • g = exp • Binomial(log odds) • g = sigmoid

  34. Nonlinear & non-Gaussian • Let P(X | θ) be an exponential family with natural parameter θ • Predict Xij ~ P(X | θij), where θij = Ui ⋅ Vj • e.g., in Poisson, E(Xij) = exp(θij) • e.g., in Binomial, E(Xij) = logit(θij)

  35. Optimization problem max ∑ log P(Xij | θij) s.t. θij = Ui ⋅ Vj • “Generalized linear” or “exponential family” PCA • all P(…) terms are exponential families • analogy to GLMs + log P(U) + log P(V) U,V [Collins et al, 2001] [Gordon, 2002] [Roy & Gordon, 2005]

  36. Special cases • PCA, probabilistic PCA • Poisson PCA • k-means clustering • Max-margin matrix factorization (MMMF) • Almost: pLSI, pHITS, NMF

  37. Comparison to AFM • p = probability correct • θ = student overall performance • β = skill difficulty • Q = item x skill matrix •  = skill practice slope • T = number of practice opportunities θ Tikk 1 Q x 0 β

  38. Theorem • In GL PCA, finding U which maximizes likelihood (holding V fixed) is a convex optimization problem • And, finding best V (holding U fixed) is a convex problem • Further, Hessian is block diagonal • So, an efficient and effective optimization algorithm: alternately improve U and V

  39. Example: compressing histograms w/ Poisson PCA A B C Points: observed frequencies in ℝ3 Hidden manifold: a 1-parameter family of multinomials

  40. Example ITERATION 1

  41. Example ITERATION 2

  42. Example ITERATION 3

  43. Example ITERATION 4

  44. Example ITERATION 5

  45. Example ITERATION 9

  46. Remaining problem: MLE • Well-known rule of thumb: if MLE gets you in trouble due to overfitting, move to fully-Bayesian inference • Typical problem: computation • In our case, the computation is just fine if we’re a little clever • Additional wrinkle: switch to hierarchical model

  47. observed unobserved Bayesian hierarchical exponential-family PCA R U k latent factors X n students V S m items k latent factors U: student latent factors V: item latent factors X: observed performance R: shared prior for student latents S: shared prior for item latents student factor item factor

  48. A little clever: MCMC Z P(X)

  49. Experimental comparisonGeometry Area 1996-1997 data • Geometry tutor: 139 items presented to 59 students • On average, each student tested on 60 items

  50. Results: hold-out error credit: Ajit Singh Embedding dimension for *EPCA is K = 15

More Related