1 / 20

WHY MEANINGFUL AUTOMATIC TAGGING OF IMAGES IS VERY HARD

This presentation explores the complexities of automatic tagging of images, highlighting the challenges faced due to the vast array of images, semantic gaps, and human interpretation needed for effective labeling. The talk delves into the difficulties in scaling pixel-based methods and the limitations of relying on non-pixel information for tagging. It also discusses the obstacles posed by the sheer volume of images and the difficulty in establishing computational measures for similarity. Moreover, the presentation touches on security systems leveraging Human Interaction Proof (HIP) methods, emphasizing the need for human labor in image labeling processes. Despite advancements in technology, achieving accurate and meaningful image tagging remains a demanding task.

geraldn
Download Presentation

WHY MEANINGFUL AUTOMATIC TAGGING OF IMAGES IS VERY HARD

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. WHY MEANINGFUL AUTOMATIC TAGGING OF IMAGES IS VERY HARD Theo Pavlidis Stony Brook University t.pavlidis@ieee.org ICME2009 talk

  2. We expect that dealing with images to be much harder than dealing with text. • The human visual system has evolved from animal visual systems over a period of more than 200 million years. • Speech is barely over 100 thousand years old. • Written text is about 5 thousand years old. In humans the visual system occupies 1/3 of the brain, a much larger portion than the auditory system. 85% of human sensorial information is the result of visual inputs. ICME2009 talk

  3. Three Specific Pieces of Evidence why Auto-Tagging is hard • Failure of past pixel-based techniques to scale to real world data. • Efforts to base tagging on non-pixel information and their limits. • Security systems based on the assumption that automatic tagging is impossible. ICME2009 talk

  4. Pixel-based methods do not scale • Methods work well in published examples, but fail at large because of: • Huge cardinality of the set of all possible images: the number of different discernible images is at least 1025 (over a trillion squared). • Semantic gap (actually semantic abyss) ICME2009 talk

  5. A pair from a set of 536 (>1025) images ICME2009 talk

  6. Cardinality Problems • Because the number of images is so large it is very hard to find a representative sample. • Even if many of the different images may have the “same” meaning for a human viewer, their pixel values may differ a lot. Hence the semantic and other gaps. • Aside: The cardinality problem can be dealt by limiting the class of images and the matching rules (examples are applications in biometrics). Using synthetic data (if we know the rules) also helps. ICME2009 talk

  7. The Semantic Abyss Perceptually close(agreement amongst observers) Computationally close(similar pixel statistics) ICME2009 talk

  8. The Conceptual Abyss Conceptually Close(but not for all observers.) Computationally close(Large areas withsimilar local pixel statistics) ICME2009 talk

  9. A Major Obstacle • Human observers tend to agree on images that are quite similar or quite dissimilar (slide on “semantic abyss”) but not on those in between (slide on “conceptual abyss”). • If there is no agreement on similarity amongst human observers how can we establish computational measures for similarity? ICME2009 talk

  10. Tagging (Labeling) is much harder than matching because it requires interpretation ΠΑΝΚΟΣΜΙΟΣ ΠΟΛΕΜΟΣ ΠΟΛΕΜΟΣ ΠΑΤΗΡ ΠΑΝΤΩΝ Not surprisingly, results of online systems are poor. ICME2009 talk

  11. Results from ALIPR building, landmark, rock, historical, ruin, texture, man-made, landscape, natural, sky, ocean, castle, car, beach, grass indoor, rock, flower, food, pattern, yellow, texture, agate, vegetable, natural, fruit, barbecue, cuisine, dessert, tree. ICME2009 talk

  12. Result No. 1 from a new site Mammals, show, Business Woman, animals, black, business, attitude, full, office workers, business, computers, office, smiles, close-up, businessman, adults, parents ICME2009 talk

  13. Result No. 2 from a new site Rest, chairs, architecture, animals, Europe, church, boats, livestock, ports, city, Italy, the sea, building, boat, beach, housing, harbor, holiday ICME2009 talk

  14. Three Specific Pieces of Evidence why Auto-Tagging is hard • Failure of past pixel-based techniques to scale to real world data. • Efforts to base tagging on non-pixel information and their limits. • Security systems based on the assumption that automatic tagging is impossible. ICME2009 talk

  15. Efforts to base tagging on non-pixel information and their limits • Iftext is available with an image, then several authors (starting in 1995) have described methods for assigning tags (coupled with image analysis). • Linguistic ambiguity presents challenges to the labeling process. ICME2009 talk

  16. Efforts to base tagging on non-pixel information and their limits • For images obtained with digital cameras, the EXIF record in combination with some pixel information can be used to assign tags, e.g. “Sunset in New York City Harbor”. (See Wong and Leung [15].) • But the EXIF record is not always available and it may not be preserved by image processing programs. ICME2009 talk

  17. Three Specific Pieces of Evidence why Auto-Tagging is hard • Failure of past pixel-based techniques to scale to real world data. • Efforts to base tagging on non-pixel information and their limits. • Security systems based on the assumption that automatic tagging is impossible. ICME2009 talk

  18. Security systems basedon Human Interaction Proof (HIP) • HIP (and CAPTCHA) are methods that try to distinguish human users from web-bots. • Currently they relyon distorted text. • A more secure system for the future is to ask what is in an image. (Assuming that web-bots cannot do that.) • But then we need enormous human labor to label images for checking the answers ICME2009 talk

  19. Harnessing Human Labor • Luis Von Ahn (a co-inventor of CAPTCHA) observed that people spent a lot of time playing computer games, so he created the ESP game where people end up labeling images. • Google licensed the ESP method and created the Google Image Labeler. • Results of human labeling are “cleaned-up” by statistical analysis. ICME2009 talk

  20. Conclusions • Automating tagging by image processing techniques seems impossible in the foreseeable future. • There is a need for more research on methods for direct or indirect human tagging. ICME2009 talk

More Related