500 likes | 559 Views
Macquarie University July 25, 2002. Visual Indexing and demonstrative reference. Zenon Pylyshyn Rutgers Center for Cognitive Science. And maybe a note or two about objecthood and individuality. Plan of talk: Visual Indexes. Theoretical motivations behind the FINST theory
E N D
Macquarie University July 25, 2002 Visual Indexing and demonstrative reference Zenon Pylyshyn Rutgers Center for Cognitive Science • And maybe a note or two about objecthood and individuality
Plan of talk: Visual Indexes • Theoretical motivations behind the FINST theory • The need for a primitive mechanism of individuation • Individuation must be of distal objects, which leads to the Correspondence Problem: When do two proximal tokens correspond to the same distal object? • A special case: incremental construction of visual representations • Empirical studies of individuation and indexing • Object-specific effects (static & moving objects) • Multiple Object Tracking technique. • What, if any, encoded properties are used to individuate, index, and track objects?
Visual Indexes (FINSTs) and what they mean for vision science and cognitive science The need for a mechanism that individuates objects Examples of solving the correspondence problem Individuating distal objects requires solving the correspondence problem Object-based allocation of visual attention A special case of the corre-spondence problem occurs when visual representations are constructed over time. Multiple Object Tracking and Visual Indexes: what it means for connecting vision and the world
An important function of early vision is to individuate and select token elements (let’s call them “objects” for now) • The most basic perceptual operation is the individuation and selection that precedes the formulation of perceptual judgments. • Individuating is different from discriminating. • Making visual judgments presupposes that the things (objects) that judgments are about have been individuated and selected (or indexed). • Another way to put this is that the arguments of perceptual predicates P(x,y,z,…) must be bound to things in the world in order for the judgment to have perceptual content.
An important function of early vision is to individuate and select token elements (let’s call them “objects” for now) • The most basic perceptual operation is the individuation and selection that precedes the formulation of perceptual judgments. • Individuating is different from discriminating • Making visual judgments presupposes that the things (objects) that judgments are about have been individuated and selected (or indexed). • Another way to put this is that the arguments of perceptual predicates P(x,y,z,…) must be bound to things in the world in order for the judgment to have perceptual content.
An important function of early vision is to individuate and select token elements (let’s call them “objects” for now) • The most basic perceptual operation is the individuation and selection that precedes the formulation of perceptual judgments. • Individuating is different from discriminating. • Making visual judgments presupposes that the things (objects) that judgments are about have been individuated and indexed. • Another way to put this is that the arguments of perceptual predicates P(x,y,z,…) must be bound to things in the world in order for the judgment to have perceptual content.
Several objects must be picked out at once in relational judgments • For example, when we judge that certain objects are collinear, we must select (and the visual system must be able to refer to) the relevant individual objects.
Several objects must be picked out at once in relational judgments • The same is true for other relational predicates, like inside or on-the-same-contour… etc. We must pick out the relevant individual objects first.
How do we individuate and select objects in our field of view? • The principal way we select individual objects is by foveating them – by looking directly at them (Notice that this results in a deictic reference). • We can also select with focal attention, which is independent of direction of gaze. • Focal attention appears to be unitary, yet we can select more than one thing at a time (e.g., in making a relational judgment). So it seems that we need to distinguish attending from selecting: That’s where Visual Indexes or FINSTs come in. • A question for later: In virtue of what properties are primitive objects individuated and indexed?
Individuating must pick out objects in the world. This leads to the ubiquitous correspondence problem in vision • Apparent motion, stereo vision, tracking, and very many visual computations face the problem of identifying which proximal features of an image correspond to the same individual distal object. • Less well known is the correspondence problem faced when a single visual representation is constructed incrementally over time. • The way the correspondence problem is solved determines what the vision module counts as an individual “object”. These primitive objects are thus mind-dependent. “FINSTs define FINGs”
Example of the correspondence problem for apparent motion The gray disks correspond to the first flash and the black ones to the second flash. Which of the 24 possible matches will the visual system select as the solution to this correspondence problem? What principal does it use? (Dawson & Pylyshyn, 1988) Curved matches Linear matches
One of the most troubling forms of the correspondence problem occurs because visual representations are constructed incrementally over time It is clear that when vision requires eye movements, a visual representation is constructed incrementally. But there is also evidence that percepts are built up over time even for the automatic perception of simple forms. So this type of correspondence problem is routine in vision. Why does it constitute a special problem?
The correspondence problem for incremental construction of a visual representation • When a property F of some particular individual (token) object O is noticed or encoded, the visual system must check whether object O is already represented. If it is, the new property must be associated with the existing representation of O. • If the only way to identify a particular individual object O is by its description, then the way to solve this correspondence problem is to find an object in memory that bears a particular description. Which description? If objects can change their properties, we don’t know under what description the object was last stored. Perhaps we look for an object with a description that overlaps the present one, or which shares some essential property with the present one and assume that it is the same object as the current one.
The correspondence problem for incremental construction of a visual representation • Even if were feasible to use this method, it would in general be computationally intractable, and it is not what our visual system does. We do not find it more difficult to construct a representation of a scene that has many identical parts, as would be predicted from this technique (since it would then be more difficult to find a unique descriptor for each object and the correspondence problem would quickly expand).
In virtue of what properties of objects are they individuated? • The most plausible property used in selecting and accessing an object is its location (and when objects are visually identical this may be the only unique property available). • It is widely believed that we access an object’s properties by first retrieving its location. • This assumption is made by every theory of pattern detection and visual search. And there is ample evidence for the priority of location information. • But there is also evidence that we can access an individual object solely by virtue of its spatio-temporal continuity or persistence qua individual.This is referred to as object-based attention.
But…. • Although there is a great deal of evidence for the priority of encoding location, this does not show that properties must be accessed by their location. • In studies in which objects remain stationary, location is confounded with individuality: being at a particular location is then coextensive with being a particular individual. • There are at least two possible ways to unconfound location and individuality: • use moving objects • use objects whose identity and/or ‘motion’ is independent of their spatial location.
Distinguishing access-by-location and-access-by-individual (also known as object-based attention) • Moving objects • Object-specific priming (Object Files) • Object-specific Inhibition of Return * • Simultanagnosia & Visual Neglect * • Multiple Object Tracking (MOT) 2. Spatially coincident objects • Single-object advantage * • tracking in “feature space” * Some of these may be omitted for lack of time
Moving object studies… Object-specific Priming (aka object-file theory: Kahneman et al. 1992) Sequence of displays in a simple Object-Priming experiment
Demonstration of the Object File display: (Kahneman, Treisman & Gibbs, 1992) Positive Example Negative Example
Multiple Object Tracking Experiments How do we do it? What properties of individual objects do we use?
People can track 5 or more objects under a wide variety of conditions Objects don’t even have to avoid collisions!
Objects can even disappear from view, as long as they do it in the right way There must be local evidence of an occluding surface.
A possible location-updating tracking algorithm • While the targets are visually distinct, scan attention to each target and encode its location on a list. When targets begin to move; • For n=1 to 4; Check the n’th position in the list and retrieve the location Loc(n) listed there. • Go to location Loc(n). Find the closest element to Loc(n). • Update the n’th position on the list with the actual location of the element found in #3. This becomes the new value of Loc(n). • Move attention to the location encoded in the next list position, Loc(n+1). • Repeat from #2 until elements stop moving. • Go to each Loc(n) in turn and report elements located there. • Testing of the above algorithm assumes (1) focal attention is required to encode locations (i.e., encoding is not parallel), (2) focal attention is unitary and has to be scanned from location to location. But it assumes no encoding (or dwell) time at each element.
Predicted performance of the location updating algorithm as a function of attention scanning speed
What properties are used in(a) selecting objects, and (b) tracking objects? Notice that these are different operations and need not involve the same properties
Role of object properties What properties can be used to select (index) an object in MOT? • We have evidence that selecting objects can be done either automatically or voluntarily, but only under certain conditions: • Automatic selection requires “popout” features (sudden appearance, motion, stereo depth, etc) • Voluntary selection can use any discriminable property, but the objects must be attended serially and the property must be available long enough for this to occur (Annan study)
Role of object properties (continued)What properties can be used to track indexed objects? • We have (suggestive) evidence that observers do not encode or use intrinsic object properties (e.g., color, shape) during tracking: • When we stop and ask, observers cannot tell us what properties objects had and they do not notice when properties like color/shape change; • There is some evidence that tracking occurs (at least of small numbers of objects) even if it is not task-relevant (e.g., Kahneman & Treisman’s Object Files); • We have some evidence that when objects differ in non-identifying (asynchronously changing) properties, they cannot be tracked any better than if they do not differ in these properties.
Role of object properties (continued)Do observers use some version of object locations for tracking? • Perhaps instead of using the location-updating method to track, observers use objects’ “spatiotemporal trace” property (e.g., “space-time worms”). • But the notion of spatiotemporal trace presupposes that it is the trace of a single individual object, and not a sequence of time-slices of different objects. Therefore it assumes that the individual object has been selected and tracked. So responding to a spatiotemporal trace may be the same as tracking an object’s identity.
Observers can track non-spatial ‘virtual objects’ that move through ‘property space’: Tracking superimposed surfaces Two superimposed Gabor patches that vary in spatial frequency, color and angle Blaser, Pylyshyn & Holcombe (2000)
Snapshots snapshots taken every 250 ms Such generalized ‘objects’ can be tracked individually, and they also show single-object superiority for change detection.
Finally: Some speculations about what vision needs and what the vision module may provide (1) 1. We need a mechanism that puts us in causal contact with distal objects in a visual scene – a contact that does not depend on the object satisfying a certain description, but on a brute causal connection. • We have seen many reasons for needing this, but I have not mentioned that such a mechanism is essential for connecting vision and action!
Speculations on what vision needs and what the visual module may provide (2) 2. We need a mechanism that keeps track of the identity of distal objects without using their encoded properties. Such a mechanism realizes a rudimentary identity-tracker, with its own ‘rules’. 3. This is not a general identity-maintenance process; it will not allow you to recognize the identity of a person in a picture and a person on the street. But it may provide a way to maintain same-objecthood within the modular early vision system. There is also this tantalizing fact … • There is evidence for such a mechanism in babies as young as 4 months (Leslie, Spelke)!
Relation to work on infants’ sensitivity to the cardinality of sets of objects Alan Leslie’s “Object Indexes” Infants as young as 4 months of age show surprise (longer looking time) when they watch two things being placed behind a screen and when the screen is lifted it reveals only one thing. Below 10 months of age they are in general not surprised when the screen is lifted to reveal two things that are different from the ones they saw being placed behind the screen, so long as their numerosity is correct. In some cases, infants (age 12 months) use the difference in color of the objects they are shown one-at-a-time to infer their numerosity, but they do not record the colors and use them to identify the objects that are revealed when the screen is lifted.
Leslie & Tremoulet: Infants aged 10 and 12 months are shown a red and then a green object that are then hidden behind a screen. The 10 month old is surprised if raising the screen reveals the wrong number of objects, not if it reveals the wrong color of objects. Color is used to individuate objects, but not to keep track of them! At 12 months children can use color to keep track of what went behind the screen.
Forms of representation for a robot: using indexicals Pylyshyn, Z.W. (2000). Situating vision on the world.Trends in Cognitive Sciences, 4(5), 197-207
Indexes play a role very similar to that of demonstratives.Are demonstratives essential for characterizing beliefs and for explaining the connection between beliefs and actions? Here is an example due to John Perry*: “The author of the book Hiker’s Guide to the Desolation Wilderness stands in the wilderness beside Gilmore Lake, looking at the Mt. Tallac trail as it leaves the lake and climbs the mountain. He desires to leave the wilderness. He believes that the best way out from Gilmore Lake is to follow the Mt. Tallac trail up the mountain … But he doesn’t move. He is lost. He is not sure whether he is standing beside Gilmore Lake, looking at Mt. Tallac, or beside Clyde Lake, looking at the Maggie peaks. Then he begins to move along the Mt. Tallac trail. If asked, he would have to explain the crucial change in his beliefs in this way: ‘I came to believe that this is the Mt. Tallac trail and that is Gilmore Lake’.” • Perry, J. The problem of the essential indexical. In Themes from Kaplan (eds. Almog, J., Perry, J. & Wettstein, H.) (Oxford University Press, New York, 1989).
Perry’s example is intended to show that in order to understand and explain the action of the lost author it is essential to use demonstratives such as this and that in expressing the author’s beliefs. A unique description of the Mt. Tallac trail might help bring the person to the right belief, but the problem of connecting the belief to an action would remain unsolved until the person had a deictic or demonstrative thought such as: “Thatis the Mt. Tallac trail.”or perhaps,“The trail I am now looking at is the Mt. Tallac trail”
Selected references related to this talk • Annan, V., & Pylyshyn, Z. W. (2002). Can indexes be voluntarily assigned in multiple object tracking? Paper presented at Vision Sciences 2002, Sarasota, FL. • Ballard, D. H., Hayhoe, M. M., Pook, P. K., & Rao, R. P. N. (1997). Deictic codes for the embodiment of cognition. Behavioral and Brain Sciences, 20(4), 723-767. • Blaser, E., Pylyshyn, Z. W., & Holcombe, A. O. (2000). Tracking an object through feature-space. Nature, 408(9), 196-199. • Burkell, J., & Pylyshyn, Z. W. (1997). Searching through subsets: A test of the visual indexing hypothesis. Spatial Vision, 11(2), 225-258. • Dawson, M., & Pylyshyn, Z. W. (1988). Natural constraints in apparent motion. In Z. W. Pylyshyn (Ed.), Computational Processes in Human Vision: An interdisciplinary perspective (pp. 99-120). Stamford, CT: Ablex Publishing. • Intriligator, J., & Cavanagh, P. (2001). The spatial resolution of attention. Cognitive Psychology, 4(3), 171-216. • Leslie, A. M., Xu, F., Tremoulet, P. D., & Scholl, B. J. (1998). Indexing and the object concept: Developing `what' and `where' systems. Trends in Cognitive Sciences, 2(1), 10-18. • Nissen, M. J. (1985). Accessing features and objects: Is location special? In M. I. Posner & O. S. Marin (Eds.), Attention and performance XI (pp. 205-219). Hillsdale, NJ: Lawrence Erlbaum. • Pylyshyn, Z. W. (1989). The role of location indexes in spatial perception: A sketch of the FINST spatial-index model. Cognition, 32, 65-97. • Pylyshyn, Z. W. (1994). Some primitive mechanisms of spatial attention. Cognition, 50, 363-384. • Pylyshyn, Z. W. (2000). Situating vision in the world. Trends in Cognitive Sciences, 4(5), 197-207. • Pylyshyn, Z. W. (2001). Visual indexes, preconceptual objects, and situated vision. Cognition, 80(1/2), 127-158. • Pylyshyn, Z. W. (submitted). Tracking without keeping track: some puzzling findings concerning multiple object tracking. • Pylyshyn, Z. W., Burkell, J., Fisher, B., Sears, C., Schmidt, W., & Trick, L. (1994). Multiple parallel access in visual attention. Canadian Journal of Experimental Psychology, 48(2), 260-283. • Pylyshyn, Z. W., & Storm, R. W. (1988). Tracking multiple independent targets: evidence for a parallel tracking mechanism. Spatial Vision, 3(3), 1-19. • Scholl, B. J., & Pylyshyn, Z. W. (1999). Tracking multiple items through occlusion: Clues to visual objecthood. Cognitive Psychology, 38(2), 259-290. • Scholl, B. J., Pylyshyn, Z. W., & Feldman, J. (2001). What is a visual object: Evidence from target-merging in multiple-object tracking. Cognition, 80, 159-177. • Scholl, B. J., Pylyshyn, Z. W., & Franconeri, S. L. (submitted). The relationship between property-encoding and object-based attention: Evidence from multiple-object tracking. • Sears, C. R., & Pylyshyn, Z. W. (2000). Multiple object tracking and attentional processes. Canadian Journal of Experimental Psychology, 54(1), 1-14. • Tipper, S., Driver, J., & Weaver, B. (1991). Object-centered inhibition of return of visual attention. Quarterly Journal of Experimental Psychology, 43A, 289-298.
The whole truth about multiple object tracking And many more demos ….
Some other findings concerning object tracking (1) • Detection of events on targets is better than on nontargets, but this does not generalize to locations between targets; • Objects can continue to be tracked when they disappear completely behind occluders, as long as the mode of disappearance is compatible with there being an occluding surface; • Objects can all disappear from view for as long as 330 ms without impairing tracking; • When objects disappear behind an occluder and come out a different color or shape, the change is unnoticed;
Some other findings concerning object tracking (2) • Not all distinct feature clusters can be tracked; some, like the endpoints of a line, cannot; • People can track items that automatically attract attention, or they can decide which items to track; but in the latter case it appears that the may have to visit each object serially • Successful tracking of an object entails keeping track of it as a particular individual, yet people are poor at keeping track of which successfully tracked (initially numbered) item is which. This may be because: • When observers make errors, they are more likely to switch the identity of a target with that of another target than the identity of a target with that of a nontarget.
How do we do it? What properties of individual objects do we use? • MOT with occlusion • MOT with Virtual Occluders • MOT with implosion/explosion • MOT MOT of the endpoints of a line • MOT squares with rubber band connections • MOT with IDs (which is which?) • Track non-flashed (3 blinks) • Track Non-flashed (one flash)