1 / 58

Learning Data Representations with “ Partial Supervision ”

Learning Data Representations with “ Partial Supervision ”. Ariadna Quattoni. Outline. Motivation: Low dimensional representations. Principal Component Analysis. Structural Learning. Vision Applications. NLP Applications. Joint Sparsity. Vision Applications. Outline.

shaman
Download Presentation

Learning Data Representations with “ Partial Supervision ”

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Learning Data Representations with “Partial Supervision” Ariadna Quattoni

  2. Outline • Motivation: Low dimensional representations. • Principal Component Analysis. • Structural Learning. • Vision Applications. • NLP Applications. • Joint Sparsity. • Vision Applications.

  3. Outline • Motivation: Low dimensional representations. • Principal Component Analysis. • Structural Learning. • Vision Applications. • NLP Applications. • Joint Sparsity. • Vision Applications.

  4. Semi-Supervised Learning “Raw” Feature Space Output Space Core Task:Learn a function from X to Y Labeled Dataset (Small) Classical Setting Unlabeled Dataset (Large) Partial Supervision Setting Partially Labeled Dataset (Large)

  5. Semi-Supervised LearningClassical Setting Unlabeled Dataset Learn Representation Labeled Dataset Train Classifier Dimensionality Reduction

  6. Semi-Supervised LearningPartial Supervision Setting Unlabeled Dataset + Partial Supervision Learn Representation Labeled Dataset Train Classifier Dimensionality Reduction

  7. Why is “learning representations” useful? • Infer the intrinsic dimensionality of the data. • Learn the “relevant” dimensions. • Infer the hidden structure.

  8. Example: Hidden Structure 20 Symbols 4 Topics Subset of 3 symbols Data Covariance Generate a datapoint: • Choose a topic T. • Sample 3 symbols from T.

  9. 1 Example: Hidden Structure • Number of latent dimensions = 4 • Map each x to the topic that generated it • Function: DataPoint Projection Matrix Topic Vector Latent Representation

  10. Outline • Motivation: Low dimensional representations. • Principal Component Analysis. • Structural Learning. • Vision Applications. • NLP Applications. • Joint Sparsity. • Vision Applications.

  11. Classical SettingPrincipal Components Analysis • Rows of theta as a ‘basis’: • Example generated by: T1 T2 T4 T3 • Low Reconstruction Error:

  12. Minimum Error Formulation Approximate high dimensional x with low dimensional x‘ Orthonormal basis Error: Data covariance Solution Distorsion

  13. Principal Component Analysis2D Example Projection Error • Uncorrelated variables and • Cut dimensions according to their variance. • Variables must be correlated.

  14. Outline • Motivation: Low dimensional representations. • Principal Component Analysis. • Structural Learning. • Vision Applications. • NLP Applications. • Joint Sparsity. • Vision Applications.

  15. Partial Supervision Setting[Ando & Zhang JMLR 2005] Unlabeled Dataset + Partial Supervision Create Auxiliary Tasks Structure Learning

  16. Partial Supervision Setting • Unlabeled data + partial supervision: • Images with associated natural language captions. • Video sequences with associated speech. • Document + keywords • How could the partial supervision help? • A hint for discovering important features. • Use the partial supervision to define “auxiliary tasks”. • Discover feature groupings that are useful for these tasks. Sometimes ‘auxiliary tasks’ defined from unlabeled data alone. E.g. Auxiliary Task for word tagging predicting substructures-

  17. Auxiliary Tasks: Core task: Is a vision or machine learning article? computer vision papers machine learning papers mask occurrences of keywords keywords: object recognition, shape matching, stereo keywords: machine learning, dimensionality reduction keywords: linear embedding, spectral methods, distance learning Auxiliary task: predict object recognition from document content

  18. Auxiliary Tasks

  19. Learning from auxiliary tasks Hypothesis learned for related tasks Structure Learning Learning with prior knowledge Learning with no prior knowledge Hypothesis learned from examples Best hypothesis

  20. Learning Good Hypothesis Spaces • Class of linear predictors: • is an h by d matrix of structural parameters. • Goal: Find the parameters and shared that minimizes the joint loss. • Class of linear predictors: • is an h by d matrix of structural parameters. • Goal: Find the parameters and shared that minimizes the joint loss. Shared parameters Problem specific parameters Loss on training set

  21. Algorithm Step 1:Train classifiers for auxiliary tasks.

  22. Algorithm Step 2:PCA On Classifiers Coefficients by taking the first h eigenvectors of Covariance Matrix: Linear subspace of dimension h; a good low dimensional approximation to the space of coefficients.

  23. Algorithm Step 3: Training on the core task Project data: Equivalent to training core task on the original d dimensional space with parameters constraints:

  24. Example Object = { letter, letter, letter } • An object abC

  25. Example • The same object seen in a different font Abc

  26. Example • The same object seen in a different font ABc

  27. Example • The same object seen in a different font abC

  28. Example words 6 Letters (topics) 5 fonts per letter (symbols) “ABC” object “ADE” object “BCF” words “ABD” words auxiliary task: recognize object . 20 words 30 Symbols  30 Features ... acE E A B C

  29. PCA on Data can not recoverlantent structure Covariance DATA

  30. PCA on Coefficients can recover latent structure Auxiliary Tasks W Features i.e. fonts Topics i.e Letters Parameters for object BCD

  31. PCA on Coefficients can recover latent structure Features i.e. fonts Covariance W Features i.e. fonts Each Block of Correlated Variables corresponds to a Latent Topic

  32. Outline • Motivation: Low dimensional representations. • Principal Component Analysis. • Structural Learning. • Vision Applications. • NLP Applications. • Joint Sparsity. • Vision Applications.

  33. News domain golden globes ice hockey figure skating grammys Dataset: News images from Reuters web-site. Problem: Predicting news topics from images.

  34. Learning visual representations using images with captions Diana and Marshall Reed leave the funeral of miner David Lewis in Philippi, West Virginia on January 8, 2006. Lewis was one of 12 miners who died in the Sago Mine. Former U.S. President Bill Clinton speaks during a joint news conference with Pakistan's Prime Minister Shaukat Aziz at Prime Minister house in Islamabad. The Italian team celebrate their gold medal win during the flower ceremony after the final round of the men's team pursuit speedskating at Oval Lingotto during the 2006 Winter Olympics. Auxiliary task: predict “ team ” from image content Senior Hamas leader Khaled Meshaal (2nd-R), is surrounded by his bodyguards after a news conference in Cairo February 8, 2006. U.S. director Stephen Gaghan and his girlfriend Daniela Unruh arrive on the red carpet for the screening of his film 'Syriana' which runs out of competition at the 56th Berlinale International Film Festival. Jim Scherr, the US Olympic Committee's chief executive officer seen here in 2004, said his group is watching the growing scandal and keeping informed about the NHL's investigation into Rick Tocchet,

  35. Learning visual topics word ‘games’ might contain the visual topics: word ‘Demonstrations’ might contain the visual topics: pavement medals people people Auxiliary tasks share visual topics Different words can share topics. Each topic can be observed under different appearances.

  36. Experiments Results

  37. Outline • Motivation: Low dimensional representations. • Principal Component Analysis. • Structural Learning. • Vision Applications. • NLP Applications. • Joint Sparsity. • Vision Applications.

  38. Chunking • Named entity chunking Jane lives in New York and works for Bank of New York. PER LOC ORG • Syntactic chunking But economistsinEuropefailed to predictthat … NP PP NP VP SBAR Data points: word occurrences Labels: Begin-PER, Inside-PER, Begin-LOC, …, Outside

  39. Example input vector representation … lives inNew York … curr-“New” 1 1 curr-“in” • input vector X • High-dimensional vectors. • Most entries are 0. 1 1 left-“in” left-“lives” 1 1 right-“New” right-“York”

  40. Algorithmic Procedure • Create m auxiliary problems. • Assign auxiliary labels to unlabeled data. • Compute  (shared structure) by joint empirical risk minimization over all the auxiliary problems. • Fix , and minimize empirical riskon the labeled data for the target task. Predictor: Additional features

  41. Example auxiliary problems ? ? : ? Predict 1 from 2 . compute shared Q add Qf2as new features 1 current word 1 Example auxiliary problems left word Is the current word “New”? Is the current word “day”? Is the current word “IBM”? Is the current word “computer”? : 2 1 right word

  42. Experiments (CoNLL-03 named entity) • 4 classes: LOC, ORG, PER, MISC • Labeled data: News documents. 204K words (English), 206K words (German) • Unlabeled data: 27M words (English), 35M words (German) • Features: A slight modification of ZJ03. Words, POS, char types, 4 chars at the beginning/endingin a 5-word window; words in a 3-chunk window; labels assigned to two words on the left, bi-gram of the current word and left label; labels assigned to previous occurrences of the current word. No gazetteer.No hand-crafted resources.

  43. Auxiliary problems 300 auxiliary problems.

  44. Syntactic chunking results (CoNLL-00) (+0.79%) Exceeds previous best systems.

  45. Other experiments Confirmed effectiveness on: • POS tagging • Text categorization(2 standard corpora)

  46. Outline • Motivation: Low dimensional representations. • Principal Component Analysis. • Structural Learning. • Vision Applications. • NLP Applications. • Joint Sparsity. • Vision Applications.

  47. Joint Sparse Approximation Notation Collection of Tasks

  48. Single Task Sparse Approximation • Consider learning a single sparse linear classifier of the form: • We want a few features with non-zero coefficients • Recent work suggests to use L1 regularization: L1 penalizes non-sparse solutions Classification error • Donoho [2004] proved (in a regression setting) that the solution with smallest L1 norm is also the sparsest solution.

  49. Joint Sparse Approximation • Setting : Joint Sparse Approximation penalizes solutions that utilize too many features Average Loss on training set k

  50. Joint Regularization Penalty • How do we penalize solutions that use too many features? Coefficients for for feature 2 Coefficients for classifier 2 • Would lead to a hard combinatorial problem .

More Related