240 likes | 257 Views
Graph-based Text Classification: Learn from Your Neighbors. Ralitsa Angelova , Gerhard Weikum : Max Planck Institute for Informatics Stuhlsatzenhausweg 85 66123, Saarbr ücken, Germany. Present by Chia-Hao Lee. outline. Introduction Graph-based Classification
E N D
Graph-based Text Classification: Learn from Your Neighbors Ralitsa Angelova ,Gerhard Weikum: Max Planck Institute for Informatics Stuhlsatzenhausweg 85 66123, Saarbrücken, Germany Present by Chia-Hao Lee
outline • Introduction • Graph-based Classification • Incorporating Metric Label Distances • Experimental • Conclusion
Introduction • Automatic classification is a supervised learning technique for assigning thematic categories to data items such as customer records, gene-expression data records, Web pages, or text documents. • The standard approach is to represent each data item by a feature vector and learn parameters of mathematical decision models. • Context-free : the decision is based only on the feature vector of a given data item, disregarding the other data items in the test set.
Introduction • In many settings, this “context-free” approach does not exploit the available information about relationships between data items. • Using the relationship information, we can construct a graph G in which each data item is a node and each relationship instance forms an edge between the corresponding nodes. • In the following we will mostly focus on text documents with links to and from other documents.
Introduction • A straightforward approach to capturing a document’s neighbors would be to incorporate the features and feature weights of the neighbors into the feature vector of the given document itself. • A more advanced approach is to model the mutual influence between neighboring documents, aiming to estimate the class labels of all test documents simultaneously.
Introduction • A simple example for RL (Relaxation labeling) is shown in figure 1. • Let our set of class be . • We wish to assign to every document marked “?” its most probable label. • Let the contingency matrix in figure 1b) be estimated from the training data.
Introduction • The theory paper by Kleinberg and Tardos views the classification problem for nodes in an undirected graph as a metric labeling problem where we aim to optimize a combinatorial function consisting of assignment costs and separation costs.
Graph-Based Classification • Our approach is based on the probabilistic formulation of the classification problem and uses a relaxation labeling technique to derive two major approaches for finding the maximally likely labeling λ of the given test graph: hard and soft labeling. • D : a set of documents G : a graph whose vertices correspond to documents and edges represent the link structure of D. : the label of node u. : the feature vector that locally captures the content of document d.
Graph-Based Classification • Taking into account the underlying link structure and document d’s context-based feature vector, the probability of a label to be assigned to d is : • In the spirit of the introduction’s discuss on emphasizing the influence of the immediate neighbors for each document, ,we obtain and denote it by . • The independent of the labels of other nodes in the graph given the labels of its immediate neighbors. We abbreviate into .
Graph-Based Classification • We abbreviate ,the graph-unaware probability based only on d’s local content, by . • The additional independence assumption that there is no direct of its coupling between the content of a document and the labels of its neighbors, the following central equation holds for the total probability , summing up the posterior probabilities for all possible labelings of the neighborhood:
Graph-Based Classification • In the same vein, if we further assume independence among all neighbor labels of the same node, we reach the following formulation for our neighborhood-conscious classification problem: • This can be computed in an iterative manner as follow:
Graph-Based Classification • Hard labeling : In contrast to the presented soft labeling approach, we also consider a method that take into account only the most probable label assignments in the test document neighborhood to be significant for the computation. Let be the maximum probable label :
Graph-Based Classification • Soft Labeling : The soft labeling approach aims to achieve better accuracy of the classification by avoiding the overly eager “rounding” that the hard labeling approach does.
Incorporating Metric Label Distance • Intuitively, neighboring documents should receive similar class labels. • For example, suppose we have a set of classes and we wish to find the most probable label for a test document d. • A document discussing scientific problems (S) would be much farther away from both C and E. • So, a similarity metric imposed on the set of labels C would have high values for the pair (C,E) and small values for class pairs (C,S) and (E,S).
Incorporating Metric Label Distance • This is why introducing a metric should help improve the classification result. In this metric, similar classes are separated by a shorter distance and impose smaller separation cost on an edge labeling. • Our approach, on the other hand, is general, and we construct the metric Γ automatically from the training data. • We incorporate the label metric into the iterations for computing the probability of an edge labeling by treating as a scaling factor.
Incorporating Metric Label Distance • This way, we magnify the impact of edges between nodes with similar labels and scale down the impact of edges between dissimilar ones:
Experiments • We have tested our graph-based classifier on three different data sets. • The first one includes approximately 16000scientific publications chosen from the DBLP database. • The second dataset has been selected from the internet movie database IMDB. • The third dataset used in the experiments was the online encyclopedia Wikipedia.
Conclusion • The presented GC method for graph-based classification is a way of exploiting context relationships of data items. • Incorporating metric distances among different labels contributed to the very good performance of GC method. • This is one new form of exploiting knowledge about the relationships among category labels and thus the structure of the classifier’s target space.