380 likes | 518 Views
Robust Object Recognition with Cortex-Like Mechanisms. Thomas Serre, Lior Wolf, Stanley Bileshi, Maximilian Riesenhuber, and Tomaso Poggio, Member, IEEE IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 29, NO. 3, MARCH 2007. Thomas Serre
E N D
Robust Object Recognition with Cortex-Like Mechanisms Thomas Serre, Lior Wolf, Stanley Bileshi, Maximilian Riesenhuber, and Tomaso Poggio, Member, IEEEIEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 29, NO. 3, MARCH 2007
Thomas Serre In 2005, the PhD degree in neuroscience from the MIT. His main research focuses on object recognition with both brains and machines. Tomaso Poggio Eugene McDermott Professor in the Brain Science and Human Behavior
outline • Introduction • Related work • The Standard Model of Visual Cortex • The selection of feature • Detailed implement • Empirical evaluation • Object Recognition in Clutter • Object Recognition without Clutter • Object Recognition of Texture-Based Objects • Toward a Full system for scene understanding
Introduction • We present a system that is based on a quantitative theory of the ventral stream of visual cortex. • A key element in the approach is a new set of scale and position-tolerant feature detectors, which agree quantitatively with the tuning properties of cell along the ventral stream of visual cortex.
Related work:The Standard Model of Visual Cortex • Object recognition in cortex thought to be mediated by the ventral visual pathway. • Neurally interconnected:=> retina ,=> Lateral Geniculate Nucleus Nucleus (LGN) of the thalamus to primary visual cortex(V1) and extrastriate visual areas,=> V2, => V4=> IT=> prefrontal cortex(PFC) linking perception to memory and action
Related work:The Standard Model of Visual Cortex • Our system follows a recent theory of the feedforward path of object recognition in cortex that accounts for the first 100-200 milliseconds of processing.
Related work:The Standard Model of Visual Cortex • A core of well-accepted facts about the ventral stream in the visual cortex • Visual processing is hierarchical, aiming to build invariance to position and scale first and then to viewpoint and other transformations. • Along the hierarchy, the receptive fields of the neurons (i.e., the part of the visual field that could potentially elicit a response from the neuron) as well as the complexity of their optimal stimuli (i.e., the set of stimuli that elicit a response of the neuron) increases.
Related work:The Standard Model of Visual Cortex • The initial processing of information is feedforward (for immediate recognition tasks, i.e., when the image presentation is rapid and there is no time for eye movements or shifts of attention). • Plasticity and learning probably occurs at all stages and certainly at the level of inferotemporal(IT) cortex and prefrontal cortex(PFC), the top-most layers of the hierarchy.
Trade-off selectivity invariance Related work:Feature selection • appearance-based patch of an image • very selective for a target shape. • but lack invariance with respect to object transformations.
Related work:Feature selection • histogram-based descriptor • very robust with respect to object transformations, • Most popular features :SIFT features • It excels in the redetection of a previously seen object under new image transformations. • It is very unlikely that these features could perform well on a generic object recognition task. • The new appearance-based feature descriptors described here exhibit a balanced trade-off between invariance and selectivity.
Detailed implementation • Along the hierarchy, from V1 to IT, two functional stages are interleaved: • Simple (S) units build an increasingly complex and specific representation by combining the response of several subunits with different selectivity with TUNING operation. • Complex (C) units build an increasingly invariant representation (to position and scale) by combing the response of several subunits with the same selectivity but at slightly different position and scales with a MAX-like operation.
Detailed implementation • By interleaving these two operation, an increasingly complex and invariant representation is built. • Two routes: • Main route • follows the hierarchy of cortical stages strickly. • Bypass route • skip some of the stages • Bypass routes may help provide q richer vocabulary of shape-tuned units with different levels of complexity and invariance.
The aspect ratio: The orientation: The effective width: The wavelength: Detailed implementation • S1 units: • Correspond to the classical simple cells of Hubel and Wiesel found in the primary visual cortex (V1) • S1 units take the form of Gabor functions
17 spatial frequencies(=scakes) 4 orientations Detailed implementation • 136 different types of S1 units:(2 phases x 4 orientation x 17 sizes) • Each portion of the visual field is analyzed by a full set of unit types.
Detailed implementation • Perform TUNING operation between the incoming pattern of input x and there weight vector w. • The response of a S1 unit is maximal when x matches w exactly.
Each portion of the visual field is analyzed by a macro-column which contains all types of mini-columns. Contains a set of units all with the same selectivities.
Detailed implementation • C1 units: • Corresponds to cortical complex cell which show some tolerance to shift and size. • Each of the complex C1 unit receives the outputs of a group of simple S1 units from the first layer with the same preferred orientation but at slightly different positions and sizes. • The operation by which the S1 unit responses are combined at the C1 level is a nonlinear MAX-like operation.
Detailed implementation • This process is done for each of the four orientations and each scale band independently.
Detailed implementation • For instance, • The first band: S=1.two S1 maps: the one obtrained using a filter of size 7x7 and 9x9. • For each orientation,the C1 unit responses are computed by subsampling these maps using NsxNs=8x8. • One single measurement is obtained by taking the maximum of all 64 elements. • As a last stage, we take a max over the two scales from within the same spatial neighborhood.
Detailed implementation • S2 unit: • A TURNING operation is taken over C1 units at different preferred orientations to increase the complexity of the optimal stimulus. • S2level units becomes selective to more complex patterns – such as the combination of oriented bars to form contours or boundary-conformations.
Detailed implementation • Each S2 units response depends in a Gaussian-way on the Euclidean distance between a new input and a stored prototype . • Pi is one of the N features learned during training. • patch X from the previous C1 layer at a particular scale S
Detailed implementation • C2 • Our final set of shift- and scale-invariant C2 responses is computed by taking a global maximum over all scales and position for each S2 type over the entire S2 lattice. • Units that are tuned to the same preferred stimulus but at slightly different positions and scales.
Detailed implementation • The learning stage • Corresponds to selecting a set of N prototypes Pi for the S2 units. • The classsifcation stage • The C1 and C2 standard model features (SMF) are then extracted and further passed to a simple linear classifier.
Empirical evaluation • Object Recognition in Clutter • Object Recognition without Clutter • Object Recognition of Texture-Based Objects • Toward a Full System for Scene Understanding
Empirical evaluation:Object Recognition in Clutter • “In clutter” referred to as weakly supervised • target object in both training and test sets appears at variable scales and positions within the unsegmented image. • To perform a simple object present/absent recognition task. • The number of C2 features depends only on the number of patches extracted during training and is independent of the size of the input image.
Empirical evaluation:Object Recognition without Clutter • Windowing approach. • To class target object in each fixed-sized image window extracted from an input image at various scales and postion. • Limited variability to scale and position
Empirical evaluation:Object Recognition without Clutter Top row: Sample StreetScenes examples Middle row: True hand-labeling. Bottom row: Results obtained with a system trained on examples like those in the second row.
Empirical evaluation:Object Recognition without Clutter • Training the SMFs-based systems • We trained the classes car, pedestrian, and bycycle. • Resize to 128x128 pixels and convert to gray level.
Empirical evaluation:Object Recognition of Texture-Based Object • Performance is measured by considering each pixel, rather than each instance of an object. • We consider four texture-based objects: buildings, trees, roads, and skies.
Empirical evaluation:Object Recognition of Texture-Based Object • Training the SMFs-based Systems • avoid errors due to overlap and loose polygonal labeling in the StreetScenes database by removing pixels with either multiple labels or no label. • Training samples were never drawn form within 15 pixels of any object’s border.
Empirical evaluation:Toward a Full system for scene understanding • The objects to be detected are divided into two distinct categories, texture-based objects and shape-based objects.
Empirical evaluation:Toward a Full system for scene understanding • Shaped-based Object Detection in StreetScenes • Shaped-based objects are those objects for which there exists a strong part-to-part correspondence between examples. • In conjunction with a standard windowing technique is used to keep the tract of location of objects.
Empirical evaluation:Toward a Full system for scene understanding • Pixels-Wise Detection of Texture-Based Objects • These objects (buildings, roads, trees, and skies) are better described by their texture rather than the geometric structure of reliably detectable parts. • Applying them to each pixel within the image, one obtains a detection confidence map of the original.