Introduction to Machine Learning for Information Retrieval

Introduction to Machine Learning for Information Retrieval Xiaolong Wang

What is Machine Learning • In short, tricks of maths • Two major tasks: • Supervised Learning: • a.k.a. Regression, Classification… • Unsupervised Learning: • a.k.a. data manipulation, clustering …

Supervised Learning • Label：usually manually labeled • Data： data representation, usually as a vector • Prediction Function：selecting one from a predefined family of functions that has the best prediction classification regression

Supervised Learning • Two formulations: • F1: Given a set of Xi, Yi, learn a function • Yi • Binary: Spam v.s. Non-spam • Numeric: Very relevant(5), somewhat relevant(4), marginal relevant(3), somewhat irrelevant(2), very irrelevant(1) • Xi • Number of words, occurrence of each word, … • f • usually linear function

Supervised Learning • Two formulations: • F2: Give a set of Xi, Yi ,learn a function such that • Yi: more complex label than binary or numeric • Multiclass learning: entertainment v.s. sports v.s. politics… • Structural learning: syntactic parsing more general Y X

Supervised Learning • Training • Optimization: • Loss: difference b/w true label Yiand predicted label wTXi • Squared Loss (regression): (Yi – wTXi)2 • Hinge Loss (classification): max(0, 1 – Yi .wTXi) • Logistic Loss (classification): log(1 + exp(-Yi .wTXi))

Supervised Learning • Training • Optimization: • Regularization: Without regularization: overfitting

Supervised Learning • Training • Optimization: • Regularization: Large margin, small ||w||

Supervised Learning • Optimization: • Art of maximization • Unconstraint: • First order: Gradient descent • Second order: Newtonian method • Stochastic: stochastic gradient descent (SGD) • Constraint: • Active set method • Interior Point Method • Alternative Direction Method of Multiplier (ADMM)

Unsupervised Learning • Clustering: • PCA • kNN

Machine Learning for Information Retrieval • Learning to Rank • Topic Modeling

Learning to Rank http://research.microsoft.com/en-us/people/hangli/li-acl-ijcnlp-2009-tutorial.pdf

Learning to Rank • X = (q, d) • Features: e.g. Matching between Query and Document

Learning to Rank

Learning to Rank • Labels: • Pointwise: relevant vs. irrelevant; 5,4,3,2,1 • Pairwise: doc A > doc B, doc C > doc D • Listwise: permutation • Acquisition: • Expert Annotation • Clickthrough: click ,skip above

Learning to Rank

Learning to Rank • Prediction function: • Extract Xq,d from (q, d) • Ranking document by sorting wTXq,d • Loss function: • Pointwise • Pairwise • Listwise

Learning to Rank • Pointwise: • Regression: Square loss • Pairwise: • Classification: (q, d1) > (q, d2) => positive example Xq,d1 – Xq, d2 • Listwise: • Optimization: NDCG@j Relevance (0/1) of document at rank i Gain Discount of rank i Normalized Cumulative

Topic Modeling • Topic Modeling • Factorization of Words * Documents matrix • Clustering of document • Projectdocuments(vectorof#vocabulary)intolowerdimension(vectorof#topics) • WhatisTopic? • Linearcombinationofwords • Nonnegative weights,sumto1=>probability

Topic Modeling • Generative models: story-telling • Latent Semantic Analysis, LSA • Probabilistic Latent Semantic Analysis, PLSA • Latent DirichletAllocation, LDA

Topic Modeling • Latent Semantic Analysis (LSA): • Deerwesteret al (1990) • Singular Value Decomposition (SVD) applied to words * documents matrix • How to interpret negative values?

Topic Modeling • Probabilistic Latent Semantic Analysis (PLSA): • Thomas Hofmann (1999) • How words/documents are generated (as described by probability) topics documents documents topics documents words Maximal Likelihood: …… d1, voyage d2, sky d1, fish d3, trip d1, boat d2, voyage

Topic Modeling • Latent Dirichlet Allocation (LDA) • David Blei et al. (2003) • PLSA with a Dirichlet prior • What is Bayesian inference? Conjugate Prior? Posterior? Frequentistv.s. Bayesian • Tossing a Coin prior likelihood Parameter to be estimated Posterior probability • Canonical Maximal Likelihood (Frequentist) as a special form of Bayesian Maximal a Posterior (MAP) when g(r) is uniform prior • Bayesian as an inference method: • Estimate r: posterior mean, or MAP • Estimate new toss to be head:

Topic Modeling • Latent Dirichlet Allocation (LDA) • David Blei et al. (2003) • PLSA with a Dirichletprior • What additional info we know about ? • Sparsity: • each topic has nonzero probability on few words; • each document has nonzero probability on few topics; topics documents documents topics documents words Dirichlet distribution defines probability on simplex • Parameter of Multinomial: • Nonnegative • Sum to 1 simplex Dirichlet can encourage sparsity

Introduction to Machine Learning for Information Retrieval

Introduction to Machine Learning for Information Retrieval

Presentation Transcript

Introduction to Information Retrieval

Introduction to Information Retrieval

Introduction to Information Retrieval

Introduction to information retrieval

Introduction to Information Retrieval

Introduction to Information Retrieval

Introduction to Information Retrieval

Machine Learning for multimedia information retrieval

Learning Techniques for Information Retrieval

Introduction to Information Retrieval

Introduction to Information Retrieval

Introduction to Information Retrieval

Machine Learning and Information Retrieval

Introduction to Information Retrieval

Introduction to Information Retrieval

Introduction to Information Retrieval

Introduction to Information Retrieval

Learning to Rank for Information Retrieval

Introduction to Information Retrieval

Introduction to Information Retrieval

Introduction to Information Retrieval

Introduction to Information Retrieval