CS 764 Seminar in Computer Vision

CS 764Seminar in Computer Vision Ramin Zabih Fall 1998

Course mechanics • Meeting time will be Tue/Thu 11-12, here • Starting a week from today • Home page is now up www/CS764 • Assignment: present one paper • You’ll have a lot of freedom, but you need to talk to me in advance • Some possible papers will be posted shortly

Topic of this seminar • The use of “knowledge” in the analysis of visual data • Sometimes called “context” • Clearly this is vital • On both psychological and technical grounds • But how? No one has much of an idea… • What is the interface between reasoning and perception? (Or, mind and body?)

What is the visual system’s “contract” • Two standard (bad) answers • Answer 1: describe the scene in terms of surfaces [low-level vision] • There is a green patch 2” wide 1’ away • Answer 2: describe the scene in terms of objects [model-based recognition] • Start with a set of 3D models (modelbase) • Determine position and pose

Why are these answers wrong? • They are almost purely data-driven • Bottom-up (from the data) versus top-down (from somewhere else) • They report “objective fact”, with no room for the task at hand • For a given image, there is only one right answer • Other problems as well • Not very useful, etc.

Technical and psychological arguments • There are technical arguments against this • Vision is an inverse problem • Many 3D scenes could explain a single 2D image • On engineering grounds, this makes no sense • Ultimately, perception is used for some task • The human perceptual system has both top-down and bottom-up elements • Various optical illusions • Two people can look at the same picture and see something completely different

Your vision system doesn’t listen

It makes “reasonable” assumptions

Low-level vision has its solution • Inverse problems require assumptions • The assumptions for low-level vision are extremely general (I.e., weak) • Reflect the physics of the visible world • For example, motion or depth or intensity tend to be “coherent” • Saying that every pixel is moving differently from its neighbors is a very unlikely answer • The world we live in tends not to do that • Helmholtz’s “unconscious inference”

We’ll need high-level vision • Most of the field is low-level vision or model-based recognition • Partly to avoid the confusion CS764 is about • Key question: how to avoid brittleness? • Can make the visual system compute just what we need for our task (I.e., berries) • But how to handle the unexpected (I.e., lions)?

A short historical perspective • 1960’s vision was completely task-specific • A black blob in the center of the image is a telephone • These efforts are now considered “hacks” • 1970’s vision became completely general • Marr pushed the field towards precise technical questions • Low-level vision and recognition became dominant

Tasks strike back • In the mid-1980’s, several attempts were made to re-introduce a notion of task • Active/animate/purposive vision • These attempts are widely viewed as failures, for good reasons • We’ll look at them a bit next week • It’s not enough to have good intuitions • There needs to be technical merit as well

Desiderata • Technical solutions (algorithms) that are very roughly consistent with human data • Goal is not AI, psychology or philosophy • Provide visual summaries useful for tasks, but degrade gracefully • Handle open/unstructured environments • Deal with expectations and breakdown

Our path for 764 • No good computational work to read • Perhaps Vera will fix this? • We will examine papers along these lines: • Computational approaches that failed • Psychological data that is highly suggestive • Neurologically inspired architectures • Cognitive scientists and philosophers • Their goal is argument, not algorithm! • They’ve thought the most about these issues

CS 764 Seminar in Computer Vision