250 likes | 471 Views
Introduction to Machine Learning for Information Retrieval. Xiaolong Wang. What is Machine Learning. In short, tricks of m aths Two major tasks: Supervised Learning: a.k.a. Regression, Classification… Unsupervised Learning: a.k.a. data manipulation, clustering …. Supervised Learning.
E N D
Introduction to Machine Learning for Information Retrieval Xiaolong Wang
What is Machine Learning • In short, tricks of maths • Two major tasks: • Supervised Learning: • a.k.a. Regression, Classification… • Unsupervised Learning: • a.k.a. data manipulation, clustering …
Supervised Learning • Label:usually manually labeled • Data: data representation, usually as a vector • Prediction Function:selecting one from a predefined family of functions that has the best prediction classification regression
Supervised Learning • Two formulations: • F1: Given a set of Xi, Yi, learn a function • Yi • Binary: Spam v.s. Non-spam • Numeric: Very relevant(5), somewhat relevant(4), marginal relevant(3), somewhat irrelevant(2), very irrelevant(1) • Xi • Number of words, occurrence of each word, … • f • usually linear function
Supervised Learning • Two formulations: • F2: Give a set of Xi, Yi ,learn a function such that • Yi: more complex label than binary or numeric • Multiclass learning: entertainment v.s. sports v.s. politics… • Structural learning: syntactic parsing more general Y X
Supervised Learning • Training • Optimization: • Loss: difference b/w true label Yiand predicted label wTXi • Squared Loss (regression): (Yi – wTXi)2 • Hinge Loss (classification): max(0, 1 – Yi .wTXi) • Logistic Loss (classification): log(1 + exp(-Yi .wTXi))
Supervised Learning • Training • Optimization: • Regularization: Without regularization: overfitting
Supervised Learning • Training • Optimization: • Regularization: Large margin, small ||w||
Supervised Learning • Optimization: • Art of maximization • Unconstraint: • First order: Gradient descent • Second order: Newtonian method • Stochastic: stochastic gradient descent (SGD) • Constraint: • Active set method • Interior Point Method • Alternative Direction Method of Multiplier (ADMM)
Unsupervised Learning • Clustering: • PCA • kNN
Machine Learning for Information Retrieval • Learning to Rank • Topic Modeling
Learning to Rank http://research.microsoft.com/en-us/people/hangli/li-acl-ijcnlp-2009-tutorial.pdf
Learning to Rank • X = (q, d) • Features: e.g. Matching between Query and Document
Learning to Rank • Labels: • Pointwise: relevant vs. irrelevant; 5,4,3,2,1 • Pairwise: doc A > doc B, doc C > doc D • Listwise: permutation • Acquisition: • Expert Annotation • Clickthrough: click ,skip above
Learning to Rank • Prediction function: • Extract Xq,d from (q, d) • Ranking document by sorting wTXq,d • Loss function: • Pointwise • Pairwise • Listwise
Learning to Rank • Pointwise: • Regression: Square loss • Pairwise: • Classification: (q, d1) > (q, d2) => positive example Xq,d1 – Xq, d2 • Listwise: • Optimization: NDCG@j Relevance (0/1) of document at rank i Gain Discount of rank i Normalized Cumulative
Topic Modeling • Topic Modeling • Factorization of Words * Documents matrix • Clustering of document • Projectdocuments(vectorof#vocabulary)intolowerdimension(vectorof#topics) • WhatisTopic? • Linearcombinationofwords • Nonnegative weights,sumto1=>probability
Topic Modeling • Generative models: story-telling • Latent Semantic Analysis, LSA • Probabilistic Latent Semantic Analysis, PLSA • Latent DirichletAllocation, LDA
Topic Modeling • Latent Semantic Analysis (LSA): • Deerwesteret al (1990) • Singular Value Decomposition (SVD) applied to words * documents matrix • How to interpret negative values?
Topic Modeling • Probabilistic Latent Semantic Analysis (PLSA): • Thomas Hofmann (1999) • How words/documents are generated (as described by probability) topics documents documents topics documents words Maximal Likelihood: …… d1, voyage d2, sky d1, fish d3, trip d1, boat d2, voyage
Topic Modeling • Latent Dirichlet Allocation (LDA) • David Blei et al. (2003) • PLSA with a Dirichlet prior • What is Bayesian inference? Conjugate Prior? Posterior? Frequentistv.s. Bayesian • Tossing a Coin prior likelihood Parameter to be estimated Posterior probability • Canonical Maximal Likelihood (Frequentist) as a special form of Bayesian Maximal a Posterior (MAP) when g(r) is uniform prior • Bayesian as an inference method: • Estimate r: posterior mean, or MAP • Estimate new toss to be head:
Topic Modeling • Latent Dirichlet Allocation (LDA) • David Blei et al. (2003) • PLSA with a Dirichletprior • What additional info we know about ? • Sparsity: • each topic has nonzero probability on few words; • each document has nonzero probability on few topics; topics documents documents topics documents words Dirichlet distribution defines probability on simplex • Parameter of Multinomial: • Nonnegative • Sum to 1 simplex Dirichlet can encourage sparsity