New Models for Relational Classification

New Models for Relational Classification Ricardo Silva (Statslab) Joint work with Wei Chu and Zoubin Ghahramani

The talk • Classification with non-iid data • A source of non-iidness: relational information • A new family of models, and what is new • Applications to classification of text documents

The prediction problem X Y

Standard setup Xnew X N  Ynew Y

Prediction with non-iid data X1 X2 Xnew  Ynew Y1 Y2

Where does the non-iid information come from? • Relations • Links between data points • Webpage A links to Webpage B • Movie A and Movie B are often rented together • Relations as data • “Linked webpages are likely to present similar content” • “Movies that are rented together often have correlated personal ratings”

The vanilla relational domain: time-series • Relations: “Yi precedes Yi + k”, k > 0 • Dependencies: “Markov structure G” Y1 Y2 Y3 … …

A model for integrating link data • How to model the class labels dependencies? • Movies that are rented together often might have all other sources of common, unmeasured factors • These hidden common causes affect the ratings

MovieFeatures(M2) MovieFeatures(M1) Rating(M2) Rating(M1) Example Same director? Same genre? Both released in same year? Target same age groups?

Integrating link data • Of course, many of these common causes will be measured • Many will not • Idea: • Postulate a hidden common cause structure, based on relations • Define a model Markov to this structure • Design an adequate inference algorithm

Example: Political Books database • A network of books about recent US politics sold by the online bookseller Amazon.com • Valdis Krebs, http://www.orgnet.com/ • Relations: frequent co-purchasing of books by the same buyers • Political inclination factors as the hidden common causes

Political Books relations

Political Books database • Features: • I collected the Amazon.com front page for each of the books • Bag-of-words, tf-idf features, normalized to unity • Task: • Binary classification: “liberal” or “not-liberal” books • 43 liberal books out of 105

Contribution • We will • show a classical multiple linear regression model • built a relational variation • generalize with a more complex set of independence constraints • generalize it using Gaussian processes

Seemingly unrelated regression (Zellner,1962)  X • Y = (Y1, Y2), X = (X1, X2) • Suppose you regress Y1 ~ X1, X2 and • X2 turns out to be useless • Analogously for Y2 ~ X1, X2 (X1 vanishes) • Suppose you regress Y1 ~ X1, X2, Y2 • And now every variable is a relevant predictor X1 X2 Y1    X1 X2 Y2 Y1

Graphically, with latents Capital(GE) Capital(Westinghouse) X: Stock price(GE) Stock price(Westinghouse) Y: Industry factor k? Industry factor 2 Industry factor 1 …

The Directed Mixed Graph (DMG) Capital(GE) Capital(Westinghouse) X: Stock price(GE) Stock price(Westinghouse) Y: Richardson (2003), Richardson and Spirtes (2002)

A new family of relational models • Inspired by SUR • Structure: DMG graphs • Edges postulated from given relations X1 X2 X3 X4 X5 Y3 Y1 Y4 Y5 Y2

Model for binary classification • Nonparametric Probit regression • Zero-mean Gaussian process prior over f( . ) P(yi = 1| xi) = P(y*(xi) > 0) y*(xi) = f(xi) + i, i ~ N(0, 1)

Relational dependency model • Make {} dependent multivariate Gaussian • For convenience, decouple it into two error terms  = * + 

Dependency model: the decomposition Independent from each other  = * +  Marginally independent Dependent according to relations  =* +  Diagonal Not diagonal, with 0s onlyon unrelated pairs

Dependency model: the decomposition • If K was the original kernel matrix for f(. ), the covariance of g(. ) is simply y*(xi) = f(xi) +  = f(xi) +  + * = g(xi) + * g(.) = K + *

Approximation • Posterior for f(.), g(.) is a truncated Gaussian, hard to integrate • Approximate posterior with a Gaussian • Expectation-Propagation (Minka, 2001) • The reason for * becomes apparent in the EP approximation

Approximation • Likelihood does not factorize over f( . ), but factorizes over g( . ) • Approximate each factor p(yi | g(xi)) with a Gaussian • if * were 0, yi would be a deterministic function of g(xi)  p(g | x, y)  p(g | x) p(yi | g(xi)) i

Generalizations • This can be generalized for any number of relations Y3 Y1 Y4 Y5 Y2  = * + 1 + 2 + 3

But how to parameterize ? • Non-trivial • Desiderata: • Positive definite • Zeroes on the right places • Few parameters, but broad family • Easy to compute

But how to parameterize ? • “Poking zeroes” on a positive definite matrix doesn’t work Y1 Y2 Y3 positive definite not positive definite

Approach #1 • Assume we can find all cliques for the bi-directed subgraph of relations • Create a “factor analysis model”, where • for each clique Ci there is a latent variable Li • members of each clique are the only children of Li • Set of latents {L} is a set of N(0, 1) variables • coefficients in the model are equal to 1

Approach #1 L1 L2 • Y1 = L1 + 1 • Y2 = L1 + L2 + 2 Y3 Y1 Y4 Y2 Y1 Y2 Y3 Y4

Approach #1 • In practice, we set the variance of each  to a small constant (10-4) • Covariance between any two Ys is • proportional to the number of cliques they belong together • inversely proportional to the number of cliques they belong to individually

Approach #1 • Let U be the correlation matrix obtained from the proposed procedure • To define the error covariance, use a single hyperparameter   [0, 1]  =(I – Udiag) + U * 

Approach #1 • Notice: if everybody is connected, model is exchangeable and simple L1 Y3 Y1 Y2 Y3 Y4 Y1 Y4 Y2  =

Approach #1 • Finding all cliques is “impossible”, what to do? • Triangulate and them extract cliques • Can be done in polynomial time • This is a relaxation of the problem, since constraints are thrown away • Can have bad side effects: the “Blow-Up” effect

Political Books dataset

Political Books dataset:the “Blow-up” effect

Approach #2 • Don’t look for cliques: create a latent for each pair of variables • Very fast to compute, zeroes respected L13 Y3 Y3 Y1 Y4 L13 Y1 Y4 Y2 Y2 L13 L13

Approach #2 • Correlations, however, are given by • Penalizes nodes with many neighbors, even if Yi and Yj have many neighbors in common • We call this the “pulverization” effect 1 Corr(i, j)  Sqrt(#neigh(i) . #neigh(j))

Political Books dataset

Political Books dataset:the “pulverization” effect

WebKB dataset: links of pages in University of Washington

Approach #1

Approach #2

Comparison:undirected models • Generative stories • Conditional random fields (Lafferty, McCallum, Pereira, 2001) • Wei et al., 2006/Richardson and Spirtes, 2002; X1 X3 X2 Y1 Y2 Y3

Chu Wei’s model X1 X3 X2 • Dependency family equivalent to a pairwise Markov random field Y1* Y3* Y2* Y1 Y2 Y3 R12 = 1 R23 = 1 Y1 Y2 Y3

Properties of undirected models • MRFs propagate information among “test” points Y2 Y3 Y4 Y6 Y1 Y7 Y8 Y10 Y5 Y9 Y11 Y12

Properties of DMG models • DMGs propagate information among “training” points Y2 Y3 Y4 Y6 Y1 Y7 Y8 Y10 Y5 Y9 Y11 Y12

Properties of DMG models • In a DMG, each “test” point will have in the Markov blanket a whole “training component” Y2 Y3 Y4 Y6 Y1 Y7 Y8 Y10 Y5 Y9 Y11 Y12

Properties of DMG models • It seems acceptable that a typical relational domain will not have a “extrapolation” pattern • Like typical “structured output” problems, e.g., NLP domains • Ultimately, the choice of model concerns the question: • “Hidden common causes” or “relational indicators”?

Experiment #1 • A subset of the CORA database • 4,285 machine learning papers, 7 classes • Links: citations between papers • “hidden common cause” interpretation: particular ML subtopic being treated • Experiment: 7 binary classification problems, Class 5 vs. others • Criterion: AUC

Experiment #1 • Comparisons: • Regular GP • Regular GP + citation adjacency matrix • Chu Wei’s Relational GP (RGP) • Our method, miXed graph GP (XGP) • Fairly easy task • Analysis of low-sample tasks • Uses 1% of the data (roughly 10 data points for training) • Not that useful for XGP, but more useful for RGP

New Models for Relational Classification

New Models for Relational Classification

Presentation Transcript

Infinite Hidden Relational Models

Learning Probabilistic Relational Models

Hierarchical Relational Models for Document Networks

Linear Models for Classification : Probabilistic Methods

Lecture 3. Linear Models for Classification

Relational Data Models

Relational Probability Models

A Classification Framework for Component Models

Linear Models for Classification

New Models for Relational Classification

Probabilistic Models for Relational Data

Relational Probability Models

Learning Probabilistic Relational Models

Classification adjustments (For funding models)

Linear Models for Classification

Classification: Linear Models

- Relational - Graphical Models

COMBINING HETEROGENEOUS MODELS FOR MEASURING RELATIONAL SIMILARITY

A Classification Framework for Component Models

Discriminative Probabilistic Models for Relational Data

Hierarchical Relational Models for Document Networks

Relational Models