510 likes | 622 Views
Recall Systems: Efficient Learning and Use of Category Indices. Omid Madani With Wiley Greiner, David Kempe, and Mohammad Salavatipour. Overview. Problems and motivation Proposal: recall systems Experiments Related work and conclusions. Massive Learning. Lots of ...
E N D
Recall Systems: Efficient Learning and Use of Category Indices Omid Madani With Wiley Greiner, David Kempe, and Mohammad Salavatipour
Overview • Problems and motivation • Proposal: recall systems • Experiments • Related work and conclusions
Massive Learning • Lots of ... • Instances (millions, unbounded..) • Dimensions (1000s and beyond) • Categories (1000s and beyond) • Two questions: • How to quickly categorize? • How to efficiently learn to categorize efficiently?
Yahoo! Page Topics (Y! Directory) Arts&Humanities Business&Economy Recreation&Sports Sports Photography History Contests Amateur Magazines Education college Over 100,000 categories in the Yahoo! directory basketball Given a page, quickly categorize… Larger for vision, text prediction,... (millions and beyond)
Efficiency • Two phases (unless truly online): • Learning • Classification time/deployment • Resource requirements: • Memory • Time • Sample efficiency
Idea • Cues in input may quickly narrow down possibilities => “index” categories • Like search engine, but learn a good index • Goal here: index reduces possible classes, classifiers are then applied for precise classification
Summary Findings • Very fast: • Train time: learned in minutes on thousands of instances/categories • 10s of online classifiers trained on each instance (not 1000s) • Index doesn’t hurt classifier accuracy!
Instance x Recognition System Recall System Reduced set of candidate categories Classifier Application Categories for x
The Problem: Tripartite Graph features instances categories f1 x1 c1 x2 f2 c2 x3 f3 c3 x4 f4 c4 x5 f5 x6 x7
Output: An Index features concepts Bipartite graph f1 c1 set of edges (E)= “COVER” f2 c2 f3 c3 f4 c4 f5 c5
Using the Index • Given instance x, retrieve the following candidate set of concepts: A concept is retrieved when a disjunction of features is satisfied
Terminology • False positive: The retrieved concept shouldn’t have been retrieved (irrelevant) • False negative: The concept should have been retrieved, but was not (missed)
Learning to Index • Lets learn the cover (the edges) • Online and mistake driven • Mistake means: • A false negative concept, or • Too many false positives
The Indexer Algorithm • For each concept c keep a sparse vector Vc, initially 0 • Begin with empty cover • On each instance x, • Retrieve candidates concepts • Update Vc for each false negative c (promotion) • If fp-count > tolerance, update Vc for each false positive c (demotion) • Update index accordingly • Update classifiers
Use Feature Weights • For each concept c keep a sparse vector Vc, initially 0 • An (i,j)-edge exists in the cover iff Inclusion threshold
Updating the Vectors • Increase/decrease feature weights in Vc that appear in x by learning rate • In promotion, if feature is not present in Vc: initialize to 1 or 1/df • In demotion: ignore 0 features • Max normalize weights (optional) • Update the index • Takes O(|x| + |Vc|) on every instance
Analysis • Under a distribution X on instances • A given cover E induces a • A false-positive rate (fp-rate): • A false-negative rate = fn-rate fp-rate(E) [fp-count on x]
Analysis • If fp-rate(E) <= fp, and fn-rate(E) <= fn, we say the cover is a (fp,fn)-cover • Is there an algorithm that converges efficiently to a (fp, fn)-cover? • We can show this for the max-norm algorithms, given existence of (0,0)-cover, and we set tolerance to 0
Convergence of max-norm • The max-norm algorithm converges to a (0,0)-cover, given such exists, and tolerance is set to 0 • The max-norm algorithm makes O(KL) mistakes for a concept with K pure features, and average instance length of L
Pure Features • Pure feature f for c = if f occurs, the instance belong to c • A “pure” feature never gets “punished” for its concept • Will take O(L) mistakes to get other irrelevant features out of index
Complexity Results • Existence of (fn,fp)-cover is NP-hard (when fp > 0, fn can remain 0). • Approximation is also NP-hard! • Why successful in practice?!
Variations • Some alternatives: • Use of weights for ranking • Other update policies • Additive updates • Use of other norms, or no norm • Batch versus online • …
Instance x Recognition System Recall System Reduced set of candidate categories Classifier Application Categories for x
The Classifiers • (Possibly) Binary classifiers: • One for each concept • For learning the classifiers: • Online learning algorithms
Learners Used • Need online algorithms • Experimented with: • Perceptron • Winnow • Committees of these (voted perceptrons, etc.)
Questions • Small tolerance (10s, 100s) enough? • Convergence? Overhead (speed & memory)? • Overall performance? (together with classifier training and testing)
Size Statistics • 3 large text categorization corpora: • The big new Reuters corpus (Rose et al) • An ads dataset (internal) • ODP = open directory project (web pages and their categories)
Experimental set up • Split data into 70% train and 30% test • Same split used for all experiments • Algorithm parameters: • Tolerance = 100, • Learning rate = 1.2 • Inclusion threshold = 0.1 • 2.4 ghz with 64 gig ram
Reuters With Classifiers All three domains but subset of classes
fp-rate at pass i fn-rate at pass i Indexer’s Performance Reuters Ads ODP
Indexer’s Timings m = minutes, h = hours
Performance With Classifiers I Reuters No = index NOT used Yes = index used
With Classifiers II Reuters, 50 sample categories Ads, 76 sample categories ODP, 108 sample categories f1 score (harmonic mean of precision and recall) at pass i
Error Plot total False negative False positive
W and fp-rate Convergence number of instances
Fn-rate vs. Tolerance Fn-rate tolerance
Fp-rate vs. tolerance Fp-rate
Index Size Statistics After 20 Passes
High Out-degree Features • In Reuters: • “woodmark” (outdegree 10) • Wooden Furniture Measuring • Precision Instruments • Electronic Active Components • … • “prft” (64) • “shr” (59)
Related Work • Fast classification candidates: • hierarchical learning, trees (kd, metric, ball, vp, cover, ..), • inverted indices (search engines!) • Fast learning candidates: • Nearest neighbors • Naïve Bayes • Generative models • Hierarchical learning • Feature selection/reduction
Related • Fast visual categorization in biological systems (e.g. Thorpe et al) • Psychology of concepts (e.g. Murphy’02) • Associative memory, speed up learning, blackboard systems, models of aspects of mind/brain
Summary • Problem: Efficiently learn and classify when categories abound • Proposed the recall system: an index that serves as a filter • Efficiently learned the filter quickly learned a quick system!
Current/Future • Evaluation on other domains • Language modeling, prediction • Vision .. • Extend techniques • Ranking (easier than labeling: got very promising results) • Learn “staged” versions • Concept discovery • Understand better: • Why such efficient algorithm work? • Why should good covers exist? What tolerance? • Strengthen convergence analysis
Acknowledgements • Thanks to Thomas Pierce for helping us with the Nutch engine • The Y!R ML group (DeCoste and Keerthi) for discussions