230 likes | 822 Views
Multiple-Instance Learning. Paper 1: A Framework for Multiple-Instance Learning [Maron and Lozano-Perez, 1998] Paper 2: EM-DD: An Improved Multiple-Instance Learning Technique [Zhang and Goldman, 2001]. Multiple-Instance Learning (MIL). A variation on supervised learning
E N D
Multiple-Instance Learning Paper 1: A Framework for Multiple-Instance Learning [Maron and Lozano-Perez, 1998] Paper 2: EM-DD: An Improved Multiple-Instance Learning Technique [Zhang and Goldman, 2001]
Multiple-Instance Learning (MIL) • A variation on supervised learning • Supervised learning: training data are well labeled. • MIL: each training example is a set (or bag) of instances along with a single label equal to the maximum label among all instances in the bag. • Goal: to learn to accurately predict the label of previously unseen bags.
MIL Setup • Training Data: D = {<B1, l1>, …., <Bm, lm>} m bags where bag Bihas label li. • Boolean labels Positive Bags: Bi+ Negative Bags: Bi- If bag Bi+= { B+i1,…, B+ij, … B+in}, then B+ijis the jthinstance in B+i. B+ijk is the value of the kth feature of the instance B+ij • Real-value labels li = max(li1, li2, … , lin)
Diverse Density Algorithm[Maron and Lozano-Perez, 1998] • Main idea: • Find a point in feature space that have a high Diverse Density – • High density of positive instances (“close ” to at least one instance from each positive bag) • Low density of negative instances (“far” from every instance in every negative bag) • Higher diverse density = higher probability of being the target concept.
A Motivating Example for DD • To find an area where there is both high density of positive points and low density of negative points. • The difficulty with using regular density, which adds up the contribution of every positive bag and subtracts negative bags, is illustrated in (b), Section B.
Diverse Density • Assuming that the target concept is a single point t and x is some point in feature space, Pr(x = t | B1+,…, Bn+, B1-,…, Bn-) ……………..(1) represents the probability that x is the target concept given the training examples. • We can find t if we maximize the above probability over all points x.
Probabilistic Measure of Diverse Density • Using Bayes’ rule, maximizing (1) is equivalent to maximizing Pr(B1+,…, Bn+, B1-,…, Bn- | x = t) ……………..(2) • Further assuming that the bags are conditionally independent given t, the best hypothesis is argmaxx∏i Pr(Bi+| x = t) ∏i Pr(Bi-| x = t) ……(3)
General Definition of DD • Again using Bayes’ rule, (3) is equivalent to argmaxx∏i Pr(x = t |Bi+) ∏i Pr(x = t |Bi-) ……(4) (assume a uniform prior over concept location) • x will have high Diverse Density if every positive bag has an instance close to x and no negative bags are close to x.
Noise-or Model • The causal probability of instance j in bag Bi Pr(x = t |Bij) = exp( -|| Bij – x ||2 ) • A positive bag’s contribution: Pr(x = t |Bi+) =Pr(x = t |Bi1+, Bi2+,…) =1- ∏j(1-Pr(x = t |Bij+) ) • A negative bag’s contribution: Pr(x = t |Bi-) =Pr(x = t |Bi1-, Bi2-,…) =∏j(1-Pr(x = t |Bij-) )
Feature Relevance • “closeness” depends on the features. • Problem: some features might be irrelevant, and some others might be more important than the others. || Bij – x ||2 = ∑k wk ( Bijk – xk )2 • Solution: “weight” the features depending on their relevance. Find the best weighting of the features by finding the weights that maximize Diverse Density.
Label Prediction • Predict the label of unknown bag Bi for hypothesis t : Label(Bi | t) = maxj{exp[-∑k(wk(Bijk – tk))2]} where wk is a scale factor indicating the importance of feature value for dimension k.
Finding the Maximum DD • Use gradient ascent with multiple starting points • The maximum DD peak is made of contributions from some set of positive points. • Start an ascent from every positive point, one of them is likely to be closest to the maximum. We can contribute most to it and have a climb directly on it. • While this heuristic is sensible for maximizing w.r.t. location, maximizing w.r.t. scaling of feature weights may still lead to local maxima.
Experiments Figure 3(a) shows the regular density surface for the data set in Figure 2, and it is clear that finding the peak is difficult. Figure 3(b) plots the DD surface, and it is easy to pick out the global maxima which is the desired concept.
Performance Evaluation • The table below lists the average accuracy of twenty runs, compared with the performance of the two principal algorithms reproted in [Dietterich et al., 1997] (iterated-discrim APR and GFS elim-ked APR), as well as the MULTINST algorithm from [Auer, 1997].
EM-DD[Zhang and Goldman, 2001] • In the MIL setting, the label of a bag is determined by the "most positive" instance in the bag, i.e., the one with the highest probability of being positive among all the instances in that bag. The difficulty of MIL comes from the ambiguity of not knowing which instance is the most likely one. • In [Zhang and Goldman, 2001], the knowledge of which instance determines the label of the bag is modeled using a set of hidden variables, which are estimated using the Expectation Maximization style approach. This results in an algorithm called EM-DD, which combines this EM-style approach with the DD algorithm.
EM-DD Algorithm • Expectation Maximization algorithm [Dempster,Laird and Rubin, 1977] • Start with an initial guess h (which can be obtained using original DD algorithm),set to some appropriate instance from a positive bag. • E-Step: h is used to pick one instance from each bag that is most likely (given generative model) to be responsible for its label. • M-Step: two-step gradient ascent search (quasi-newton search) of the standard DD algorithm to find a new h’ that maximizes DD(h).