230 likes | 313 Views
Local one class optimization. Gal Chechik, Stanford joint work with Koby Crammer, Hebrew university of Jerusalem. The one-class problem:. Find a subset of similar/typical samples
E N D
Local one class optimization Gal Chechik, Stanford joint work with Koby Crammer, Hebrew university of Jerusalem
The one-class problem: Find a subset of similar/typical samples Formally: find a ball of a given radius (with some metric) that covers as many data points as possible (related to the set covering problem).
Motivation I Unsupervised setting: Sometimes we wish to model small parts of the data and ignore the rest. This happens when many data points are irrelevant. Example: • Finding sets of co-expressed genes in genome wide-experiment: identify the relevant genes out of thousands irrelevant ones. • Finding a set of document of the same topic, in an heterogeneous corpus
Motivation II Supervised setting: Learning given positive samples only Examples: • Protein interactions • Intrusion detection application Care about low false positive rate
Current approaches Often treat the problem as Outliers and novelty detection:most samples are relevant Current approaches use • A convex cost function (Schölkopf 95, Tax and Duin 99, Ben-Hur et al 2001). • A parameter that affects the size or weight of the ball • Bias towards center of massWhen searching for a small ball, the center of the optimal ball is in the global center of mass, w*=argmin Σx(x-w)2 missing the interesting structures.
Current approaches Example with synthetic data: • 2 Gaussians + uniform background Convex one class (OSU-SVM) Local one-class
How do we do it: • A cost function designed for small sets • A probabilistic approach: allow soft assignment to the set • Regularized optimization
1. A cost function for small sets • The case where only few samples are relevant • Use cost function that is flatfor samples not in the set • Two parameters: • Divergence measure DBF • Flat cost K • Indifferent to the position of “irrelevant” samples. • Solutions converge to the center of mass when ball is large.
2. A probabilistic formulation • We are given m samples in a d dimensional space or simplex, indexed by x . • p(x) is the prior distribution over samples • c ={TRUE,FALSE} is an R.V. that characterizes assignment to the interesting set (the “Ball”). • p(c|x) reflects our belief that the sample x is “interesting”. • The cost function will be D=p(c|x)DBF(w|vx) + (1-p(c|x))KDBF is a divergence measure, to be discussed later
3. Regularized optimization The goal: minimize the mean cost+regularization min β <DBF,K(,wC;vx)>p(c,x) + I(C;X) {p(c|x),w} • The first term: measures the mean distortion <DBF,R(p(c|x),w;vx)> = Σ p(x) [p(c|x)BF(w|vx)+(1-p(c|x))K] • The second term: regularizes the compression of the data (removes information about X) I(C;X) = H(X) – H(X|C), It pushes for putting many points in the set. • This target function is not convex
To solve the problem • It turns out that for a family of divergence functions, called Bregman divergences, we can analytically describe properties of the optimal solution. • The proof follows the analysis of the Information Bottleneck method (Tishby,Pereira,Bialek,99)
Bregman divergences • A Bregman divergence is defined by a convex function F (in our case F(v)=Σf(vi)) • Common examples: L2 norm f(x)=½x2 Itkura-Saito f(x)=-log(x) DKLf(x)=xlog(x) Unnormalized relative entropy f(x)=xlogx-x • Lemma: Convexity of the Bregman Ball The set of points {v s.t. BF(v||w)<R} is convex
Properties of the solution OC solutions obey three fixed point equations When β→∞, Best assignment for x is to minimize
The effect of the K • K controls the nature of the solution. • Is the cost of leaving a point out of the ball • Large K => large radius & many points in set • For the L2 norm, K is formally related to the prior of a single Gaussian fit to the subset. • A full description of a data may require to solve for the complete spectrum of K values.
Algorithm: One-Class IB Adapting the sequential-IB algorithm: One-Class IB: Input: set of m points vx, divergence BF, cost K Output:centroid w, assignment p(c|x) Optimization method: • Iteratively operating sample-by-sample, try to modify the status of a single sample • One step Look-ahead re-fit the model and decide if to change assignment of a sample • This uses a simple formula because of the nice properties of Bregman divergences • search in the dual space of samples, rather than parameters w.
Experiments 1: information retrieval Five most frequent categories of Reuters21578. Each document represented as a multinomial distribution over 2000 terms. The experimental setup: For each category: • train with half of the positive documents, • test with all rest of documents Compared one-class IB with One-class Convex which uses a convex loss function (Crammer& Singer-2003). Controlled by a single parameter η, that determines weight of the class.
Experiments 1: information retrieval Compare precision recall performance, for a range or K/μ values. precision recall
Experiments 1: information retrieval Centroids of clusters, and their distances from the center of mass
Experiments 2: gene expression A typical application for searching small but interesting sets of genes. Genes represented by expression profile across tissues from different patients Alizadeh-2000, (B-cell lymphoma tissues) has mortality data which can be used as an objective method for validating quality of the genes selected.
Experiments 2: gene expression One-class IB compared with one-class SVM (L2) For a series of K values, gene sets with lowest loss function was found (10 restarts). The set of genes was used for regression vs, the mortality data. good Significance of regression prediction (p- value) bad
Future work: finding ALL relevant subsets • Complete characterization of all interesting subsets in the data. • Assume we have a function that assign an interest value to each subset. We search in the space of subsets and for all local maxima. • Requires to define the locality. A natural measure of locality in the subsets-space is the Hamming distance. • The complete characterization of the data require description using a range of local neighborhoods.
Future work: multiple one-class • Synthetic example: two overlapping Gaussians and background uniform noise
Conclusions • We focus on learning one-class for cases where a small ball is sought. • Formalize the problem using IB, and derive its formal solutions • One-class IB performs well in the regime of small subsets.