1 / 56

New Models for Relational Classification

New Models for Relational Classification. Ricardo Silva (Statslab) Joint work with Wei Chu and Zoubin Ghahramani. The talk. Classification with non-iid data A source of non-iidness: relational information A new family of models, and what is new

truly
Download Presentation

New Models for Relational Classification

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. New Models for Relational Classification Ricardo Silva (Statslab) Joint work with Wei Chu and Zoubin Ghahramani

  2. The talk • Classification with non-iid data • A source of non-iidness: relational information • A new family of models, and what is new • Applications to classification of text documents

  3. The prediction problem X Y

  4. Standard setup Xnew X N  Ynew Y

  5. Prediction with non-iid data X1 X2 Xnew  Ynew Y1 Y2

  6. Where does the non-iid information come from? • Relations • Links between data points • Webpage A links to Webpage B • Movie A and Movie B are often rented together • Relations as data • “Linked webpages are likely to present similar content” • “Movies that are rented together often have correlated personal ratings”

  7. The vanilla relational domain: time-series • Relations: “Yi precedes Yi + k”, k > 0 • Dependencies: “Markov structure G” Y1 Y2 Y3 … …

  8. A model for integrating link data • How to model the class labels dependencies? • Movies that are rented together often might have all other sources of common, unmeasured factors • These hidden common causes affect the ratings

  9. MovieFeatures(M2) MovieFeatures(M1) Rating(M2) Rating(M1) Example Same director? Same genre? Both released in same year? Target same age groups?

  10. Integrating link data • Of course, many of these common causes will be measured • Many will not • Idea: • Postulate a hidden common cause structure, based on relations • Define a model Markov to this structure • Design an adequate inference algorithm

  11. Example: Political Books database • A network of books about recent US politics sold by the online bookseller Amazon.com • Valdis Krebs, http://www.orgnet.com/ • Relations: frequent co-purchasing of books by the same buyers • Political inclination factors as the hidden common causes

  12. Political Books relations

  13. Political Books database • Features: • I collected the Amazon.com front page for each of the books • Bag-of-words, tf-idf features, normalized to unity • Task: • Binary classification: “liberal” or “not-liberal” books • 43 liberal books out of 105

  14. Contribution • We will • show a classical multiple linear regression model • built a relational variation • generalize with a more complex set of independence constraints • generalize it using Gaussian processes

  15. Seemingly unrelated regression (Zellner,1962)  X • Y = (Y1, Y2), X = (X1, X2) • Suppose you regress Y1 ~ X1, X2 and • X2 turns out to be useless • Analogously for Y2 ~ X1, X2 (X1 vanishes) • Suppose you regress Y1 ~ X1, X2, Y2 • And now every variable is a relevant predictor X1 X2 Y1    X1 X2 Y2 Y1

  16. Graphically, with latents Capital(GE) Capital(Westinghouse) X: Stock price(GE) Stock price(Westinghouse) Y: Industry factor k? Industry factor 2 Industry factor 1 …

  17. The Directed Mixed Graph (DMG) Capital(GE) Capital(Westinghouse) X: Stock price(GE) Stock price(Westinghouse) Y: Richardson (2003), Richardson and Spirtes (2002)

  18. A new family of relational models • Inspired by SUR • Structure: DMG graphs • Edges postulated from given relations X1 X2 X3 X4 X5 Y3 Y1 Y4 Y5 Y2

  19. Model for binary classification • Nonparametric Probit regression • Zero-mean Gaussian process prior over f( . ) P(yi = 1| xi) = P(y*(xi) > 0) y*(xi) = f(xi) + i, i ~ N(0, 1)

  20. Relational dependency model • Make {} dependent multivariate Gaussian • For convenience, decouple it into two error terms  = * + 

  21. Dependency model: the decomposition Independent from each other  = * +  Marginally independent Dependent according to relations  =* +  Diagonal Not diagonal, with 0s onlyon unrelated pairs

  22. Dependency model: the decomposition • If K was the original kernel matrix for f(. ), the covariance of g(. ) is simply y*(xi) = f(xi) +  = f(xi) +  + * = g(xi) + * g(.) = K + *

  23. Approximation • Posterior for f(.), g(.) is a truncated Gaussian, hard to integrate • Approximate posterior with a Gaussian • Expectation-Propagation (Minka, 2001) • The reason for * becomes apparent in the EP approximation

  24. Approximation • Likelihood does not factorize over f( . ), but factorizes over g( . ) • Approximate each factor p(yi | g(xi)) with a Gaussian • if * were 0, yi would be a deterministic function of g(xi)  p(g | x, y)  p(g | x) p(yi | g(xi)) i

  25. Generalizations • This can be generalized for any number of relations Y3 Y1 Y4 Y5 Y2  = * + 1 + 2 + 3

  26. But how to parameterize ? • Non-trivial • Desiderata: • Positive definite • Zeroes on the right places • Few parameters, but broad family • Easy to compute

  27. But how to parameterize ? • “Poking zeroes” on a positive definite matrix doesn’t work Y1 Y2 Y3 positive definite not positive definite

  28. Approach #1 • Assume we can find all cliques for the bi-directed subgraph of relations • Create a “factor analysis model”, where • for each clique Ci there is a latent variable Li • members of each clique are the only children of Li • Set of latents {L} is a set of N(0, 1) variables • coefficients in the model are equal to 1

  29. Approach #1 L1 L2 • Y1 = L1 + 1 • Y2 = L1 + L2 + 2 Y3 Y1 Y4 Y2 Y1 Y2 Y3 Y4

  30. Approach #1 • In practice, we set the variance of each  to a small constant (10-4) • Covariance between any two Ys is • proportional to the number of cliques they belong together • inversely proportional to the number of cliques they belong to individually

  31. Approach #1 • Let U be the correlation matrix obtained from the proposed procedure • To define the error covariance, use a single hyperparameter   [0, 1]  =(I – Udiag) + U * 

  32. Approach #1 • Notice: if everybody is connected, model is exchangeable and simple L1 Y3 Y1 Y2 Y3 Y4 Y1 Y4 Y2  =

  33. Approach #1 • Finding all cliques is “impossible”, what to do? • Triangulate and them extract cliques • Can be done in polynomial time • This is a relaxation of the problem, since constraints are thrown away • Can have bad side effects: the “Blow-Up” effect

  34. Political Books dataset

  35. Political Books dataset:the “Blow-up” effect

  36. Approach #2 • Don’t look for cliques: create a latent for each pair of variables • Very fast to compute, zeroes respected L13 Y3 Y3 Y1 Y4 L13 Y1 Y4 Y2 Y2 L13 L13

  37. Approach #2 • Correlations, however, are given by • Penalizes nodes with many neighbors, even if Yi and Yj have many neighbors in common • We call this the “pulverization” effect 1 Corr(i, j)  Sqrt(#neigh(i) . #neigh(j))

  38. Political Books dataset

  39. Political Books dataset:the “pulverization” effect

  40. WebKB dataset: links of pages in University of Washington

  41. Approach #1

  42. Approach #2

  43. Comparison:undirected models • Generative stories • Conditional random fields (Lafferty, McCallum, Pereira, 2001) • Wei et al., 2006/Richardson and Spirtes, 2002; X1 X3 X2 Y1 Y2 Y3

  44. Chu Wei’s model X1 X3 X2 • Dependency family equivalent to a pairwise Markov random field Y1* Y3* Y2* Y1 Y2 Y3 R12 = 1 R23 = 1 Y1 Y2 Y3

  45. Properties of undirected models • MRFs propagate information among “test” points Y2 Y3 Y4 Y6 Y1 Y7 Y8 Y10 Y5 Y9 Y11 Y12

  46. Properties of DMG models • DMGs propagate information among “training” points Y2 Y3 Y4 Y6 Y1 Y7 Y8 Y10 Y5 Y9 Y11 Y12

  47. Properties of DMG models • In a DMG, each “test” point will have in the Markov blanket a whole “training component” Y2 Y3 Y4 Y6 Y1 Y7 Y8 Y10 Y5 Y9 Y11 Y12

  48. Properties of DMG models • It seems acceptable that a typical relational domain will not have a “extrapolation” pattern • Like typical “structured output” problems, e.g., NLP domains • Ultimately, the choice of model concerns the question: • “Hidden common causes” or “relational indicators”?

  49. Experiment #1 • A subset of the CORA database • 4,285 machine learning papers, 7 classes • Links: citations between papers • “hidden common cause” interpretation: particular ML subtopic being treated • Experiment: 7 binary classification problems, Class 5 vs. others • Criterion: AUC

  50. Experiment #1 • Comparisons: • Regular GP • Regular GP + citation adjacency matrix • Chu Wei’s Relational GP (RGP) • Our method, miXed graph GP (XGP) • Fairly easy task • Analysis of low-sample tasks • Uses 1% of the data (roughly 10 data points for training) • Not that useful for XGP, but more useful for RGP

More Related