WHY MEANINGFUL AUTOMATIC TAGGING OF IMAGES IS VERY HARD

WHY MEANINGFUL AUTOMATIC TAGGING OF IMAGES IS VERY HARD Theo Pavlidis Stony Brook University t.pavlidis@ieee.org ICME2009 talk

We expect that dealing with images to be much harder than dealing with text. • The human visual system has evolved from animal visual systems over a period of more than 200 million years. • Speech is barely over 100 thousand years old. • Written text is about 5 thousand years old. In humans the visual system occupies 1/3 of the brain, a much larger portion than the auditory system. 85% of human sensorial information is the result of visual inputs. ICME2009 talk

Three Specific Pieces of Evidence why Auto-Tagging is hard • Failure of past pixel-based techniques to scale to real world data. • Efforts to base tagging on non-pixel information and their limits. • Security systems based on the assumption that automatic tagging is impossible. ICME2009 talk

Pixel-based methods do not scale • Methods work well in published examples, but fail at large because of: • Huge cardinality of the set of all possible images: the number of different discernible images is at least 1025 (over a trillion squared). • Semantic gap (actually semantic abyss) ICME2009 talk

A pair from a set of 536 (>1025) images ICME2009 talk

Cardinality Problems • Because the number of images is so large it is very hard to find a representative sample. • Even if many of the different images may have the “same” meaning for a human viewer, their pixel values may differ a lot. Hence the semantic and other gaps. • Aside: The cardinality problem can be dealt by limiting the class of images and the matching rules (examples are applications in biometrics). Using synthetic data (if we know the rules) also helps. ICME2009 talk

The Semantic Abyss Perceptually close(agreement amongst observers) Computationally close(similar pixel statistics) ICME2009 talk

The Conceptual Abyss Conceptually Close(but not for all observers.) Computationally close(Large areas withsimilar local pixel statistics) ICME2009 talk

A Major Obstacle • Human observers tend to agree on images that are quite similar or quite dissimilar (slide on “semantic abyss”) but not on those in between (slide on “conceptual abyss”). • If there is no agreement on similarity amongst human observers how can we establish computational measures for similarity? ICME2009 talk

Tagging (Labeling) is much harder than matching because it requires interpretation ΠΑΝΚΟΣΜΙΟΣ ΠΟΛΕΜΟΣ ΠΟΛΕΜΟΣ ΠΑΤΗΡ ΠΑΝΤΩΝ Not surprisingly, results of online systems are poor. ICME2009 talk

Results from ALIPR building, landmark, rock, historical, ruin, texture, man-made, landscape, natural, sky, ocean, castle, car, beach, grass indoor, rock, flower, food, pattern, yellow, texture, agate, vegetable, natural, fruit, barbecue, cuisine, dessert, tree. ICME2009 talk

Result No. 1 from a new site Mammals, show, Business Woman, animals, black, business, attitude, full, office workers, business, computers, office, smiles, close-up, businessman, adults, parents ICME2009 talk

Result No. 2 from a new site Rest, chairs, architecture, animals, Europe, church, boats, livestock, ports, city, Italy, the sea, building, boat, beach, housing, harbor, holiday ICME2009 talk

Efforts to base tagging on non-pixel information and their limits • Iftext is available with an image, then several authors (starting in 1995) have described methods for assigning tags (coupled with image analysis). • Linguistic ambiguity presents challenges to the labeling process. ICME2009 talk

Efforts to base tagging on non-pixel information and their limits • For images obtained with digital cameras, the EXIF record in combination with some pixel information can be used to assign tags, e.g. “Sunset in New York City Harbor”. (See Wong and Leung [15].) • But the EXIF record is not always available and it may not be preserved by image processing programs. ICME2009 talk

Security systems basedon Human Interaction Proof (HIP) • HIP (and CAPTCHA) are methods that try to distinguish human users from web-bots. • Currently they relyon distorted text. • A more secure system for the future is to ask what is in an image. (Assuming that web-bots cannot do that.) • But then we need enormous human labor to label images for checking the answers ICME2009 talk

Harnessing Human Labor • Luis Von Ahn (a co-inventor of CAPTCHA) observed that people spent a lot of time playing computer games, so he created the ESP game where people end up labeling images. • Google licensed the ESP method and created the Google Image Labeler. • Results of human labeling are “cleaned-up” by statistical analysis. ICME2009 talk

Conclusions • Automating tagging by image processing techniques seems impossible in the foreseeable future. • There is a need for more research on methods for direct or indirect human tagging. ICME2009 talk

WHY MEANINGFUL AUTOMATIC TAGGING OF IMAGES IS VERY HARD

WHY MEANINGFUL AUTOMATIC TAGGING OF IMAGES IS VERY HARD

Presentation Transcript

Prediction is very hard

WHY MEANINGFUL AUTOMATIC TAGGING OF IMAGES IS VERY HARD

Why Machine Intelligence is Very Hard

Why Security Testing Is Hard

Automatic Web Tagging and Person Tagging Using Language Models

Automatic Part-of-Speech Tagging of Arabic Text

Why Artificial Intelligence is Very Hard

Healthcare IT: Why is this so hard ? Can we build meaningful solutions?

What is Vision? Why is it Hard ?

Why EPUB interoperability is hard

Very Hard Problems

Why is Automatic Recognition of

is hard to do (Why?)

Why is improvement so hard?

What is Vision? Why is it Hard ?

Automatic Matching of Multi-View Images

Tagging of digital historical images

Automatic analysis of biological images

Why is farming so hard?

Steve.Museum Social Tagging of Museum Images

Why electrical testing and tagging is necessary

Why Machine Intelligence is Very Hard