560 likes | 676 Views
New Models for Relational Classification. Ricardo Silva (Statslab) Joint work with Wei Chu and Zoubin Ghahramani. The talk. Classification with non-iid data A source of non-iidness: relational information A new family of models, and what is new
E N D
New Models for Relational Classification Ricardo Silva (Statslab) Joint work with Wei Chu and Zoubin Ghahramani
The talk • Classification with non-iid data • A source of non-iidness: relational information • A new family of models, and what is new • Applications to classification of text documents
Standard setup Xnew X N Ynew Y
Prediction with non-iid data X1 X2 Xnew Ynew Y1 Y2
Where does the non-iid information come from? • Relations • Links between data points • Webpage A links to Webpage B • Movie A and Movie B are often rented together • Relations as data • “Linked webpages are likely to present similar content” • “Movies that are rented together often have correlated personal ratings”
The vanilla relational domain: time-series • Relations: “Yi precedes Yi + k”, k > 0 • Dependencies: “Markov structure G” Y1 Y2 Y3 … …
A model for integrating link data • How to model the class labels dependencies? • Movies that are rented together often might have all other sources of common, unmeasured factors • These hidden common causes affect the ratings
MovieFeatures(M2) MovieFeatures(M1) Rating(M2) Rating(M1) Example Same director? Same genre? Both released in same year? Target same age groups?
Integrating link data • Of course, many of these common causes will be measured • Many will not • Idea: • Postulate a hidden common cause structure, based on relations • Define a model Markov to this structure • Design an adequate inference algorithm
Example: Political Books database • A network of books about recent US politics sold by the online bookseller Amazon.com • Valdis Krebs, http://www.orgnet.com/ • Relations: frequent co-purchasing of books by the same buyers • Political inclination factors as the hidden common causes
Political Books database • Features: • I collected the Amazon.com front page for each of the books • Bag-of-words, tf-idf features, normalized to unity • Task: • Binary classification: “liberal” or “not-liberal” books • 43 liberal books out of 105
Contribution • We will • show a classical multiple linear regression model • built a relational variation • generalize with a more complex set of independence constraints • generalize it using Gaussian processes
Seemingly unrelated regression (Zellner,1962) X • Y = (Y1, Y2), X = (X1, X2) • Suppose you regress Y1 ~ X1, X2 and • X2 turns out to be useless • Analogously for Y2 ~ X1, X2 (X1 vanishes) • Suppose you regress Y1 ~ X1, X2, Y2 • And now every variable is a relevant predictor X1 X2 Y1 X1 X2 Y2 Y1
Graphically, with latents Capital(GE) Capital(Westinghouse) X: Stock price(GE) Stock price(Westinghouse) Y: Industry factor k? Industry factor 2 Industry factor 1 …
The Directed Mixed Graph (DMG) Capital(GE) Capital(Westinghouse) X: Stock price(GE) Stock price(Westinghouse) Y: Richardson (2003), Richardson and Spirtes (2002)
A new family of relational models • Inspired by SUR • Structure: DMG graphs • Edges postulated from given relations X1 X2 X3 X4 X5 Y3 Y1 Y4 Y5 Y2
Model for binary classification • Nonparametric Probit regression • Zero-mean Gaussian process prior over f( . ) P(yi = 1| xi) = P(y*(xi) > 0) y*(xi) = f(xi) + i, i ~ N(0, 1)
Relational dependency model • Make {} dependent multivariate Gaussian • For convenience, decouple it into two error terms = * +
Dependency model: the decomposition Independent from each other = * + Marginally independent Dependent according to relations =* + Diagonal Not diagonal, with 0s onlyon unrelated pairs
Dependency model: the decomposition • If K was the original kernel matrix for f(. ), the covariance of g(. ) is simply y*(xi) = f(xi) + = f(xi) + + * = g(xi) + * g(.) = K + *
Approximation • Posterior for f(.), g(.) is a truncated Gaussian, hard to integrate • Approximate posterior with a Gaussian • Expectation-Propagation (Minka, 2001) • The reason for * becomes apparent in the EP approximation
Approximation • Likelihood does not factorize over f( . ), but factorizes over g( . ) • Approximate each factor p(yi | g(xi)) with a Gaussian • if * were 0, yi would be a deterministic function of g(xi) p(g | x, y) p(g | x) p(yi | g(xi)) i
Generalizations • This can be generalized for any number of relations Y3 Y1 Y4 Y5 Y2 = * + 1 + 2 + 3
But how to parameterize ? • Non-trivial • Desiderata: • Positive definite • Zeroes on the right places • Few parameters, but broad family • Easy to compute
But how to parameterize ? • “Poking zeroes” on a positive definite matrix doesn’t work Y1 Y2 Y3 positive definite not positive definite
Approach #1 • Assume we can find all cliques for the bi-directed subgraph of relations • Create a “factor analysis model”, where • for each clique Ci there is a latent variable Li • members of each clique are the only children of Li • Set of latents {L} is a set of N(0, 1) variables • coefficients in the model are equal to 1
Approach #1 L1 L2 • Y1 = L1 + 1 • Y2 = L1 + L2 + 2 Y3 Y1 Y4 Y2 Y1 Y2 Y3 Y4
Approach #1 • In practice, we set the variance of each to a small constant (10-4) • Covariance between any two Ys is • proportional to the number of cliques they belong together • inversely proportional to the number of cliques they belong to individually
Approach #1 • Let U be the correlation matrix obtained from the proposed procedure • To define the error covariance, use a single hyperparameter [0, 1] =(I – Udiag) + U *
Approach #1 • Notice: if everybody is connected, model is exchangeable and simple L1 Y3 Y1 Y2 Y3 Y4 Y1 Y4 Y2 =
Approach #1 • Finding all cliques is “impossible”, what to do? • Triangulate and them extract cliques • Can be done in polynomial time • This is a relaxation of the problem, since constraints are thrown away • Can have bad side effects: the “Blow-Up” effect
Approach #2 • Don’t look for cliques: create a latent for each pair of variables • Very fast to compute, zeroes respected L13 Y3 Y3 Y1 Y4 L13 Y1 Y4 Y2 Y2 L13 L13
Approach #2 • Correlations, however, are given by • Penalizes nodes with many neighbors, even if Yi and Yj have many neighbors in common • We call this the “pulverization” effect 1 Corr(i, j) Sqrt(#neigh(i) . #neigh(j))
Comparison:undirected models • Generative stories • Conditional random fields (Lafferty, McCallum, Pereira, 2001) • Wei et al., 2006/Richardson and Spirtes, 2002; X1 X3 X2 Y1 Y2 Y3
Chu Wei’s model X1 X3 X2 • Dependency family equivalent to a pairwise Markov random field Y1* Y3* Y2* Y1 Y2 Y3 R12 = 1 R23 = 1 Y1 Y2 Y3
Properties of undirected models • MRFs propagate information among “test” points Y2 Y3 Y4 Y6 Y1 Y7 Y8 Y10 Y5 Y9 Y11 Y12
Properties of DMG models • DMGs propagate information among “training” points Y2 Y3 Y4 Y6 Y1 Y7 Y8 Y10 Y5 Y9 Y11 Y12
Properties of DMG models • In a DMG, each “test” point will have in the Markov blanket a whole “training component” Y2 Y3 Y4 Y6 Y1 Y7 Y8 Y10 Y5 Y9 Y11 Y12
Properties of DMG models • It seems acceptable that a typical relational domain will not have a “extrapolation” pattern • Like typical “structured output” problems, e.g., NLP domains • Ultimately, the choice of model concerns the question: • “Hidden common causes” or “relational indicators”?
Experiment #1 • A subset of the CORA database • 4,285 machine learning papers, 7 classes • Links: citations between papers • “hidden common cause” interpretation: particular ML subtopic being treated • Experiment: 7 binary classification problems, Class 5 vs. others • Criterion: AUC
Experiment #1 • Comparisons: • Regular GP • Regular GP + citation adjacency matrix • Chu Wei’s Relational GP (RGP) • Our method, miXed graph GP (XGP) • Fairly easy task • Analysis of low-sample tasks • Uses 1% of the data (roughly 10 data points for training) • Not that useful for XGP, but more useful for RGP