Article summary of Hierarchical models of object recognition in cortex by Riesenhuber & Poggio - Chapter


Recognition of visual objects

The recognition of visual objects is fundamental. Research often takes place with a repeated cognitive task with two essential requirements: invariance and specificity. Cells from the inferotemporal cortex (IT, the highest visual area in the ventral visual pathway) appear to play a key role in object recognition. The cells respond to what one sees with complex objects such as faces. Certain neurons respond specifically to certain faces and not to other faces. The question remains: how can they respond to different faces while the stimulus offer is practically the same in the retina?

This is also reflected in the striate cortex in cats. Both simple and complex cells respond to a presented bar. For example, it appears that the small simple cells have narrow receptive fields that are strongly position-dependent and that the complex cells have large receptive fields and are not position-dependent. Hubel and Wiesel have made a model in which the simple cells respond as if they are neighbor cells. Where cells that sit next to each other also see the world next to each other. So you often see a group of cells firing together. A direct follow-up of this model leads to a higher-order-complex cells scheme.

Cells in the V4 can control their attention and they can respond to an adaptation in their receptive field. There is little evidence that this mechanism is used to translate invariant object recognition. Invariance of each transformation can be built up by converting afferent cells with different variations of the same stimulus. Evidence has now been found that groups of cells that respond to whole or partial vision are learned through a learning process. The vision invariance problem can then be presented by a small number of neurons. This idea gives us two problems.

Problem 1

In monkeys, it is that learning (for them) unknown stimuli (such as faces) is possible because they learn a part of the invariant via just one view of the object. If this object is presented with a lot of distractor objects around it, it can be learned in combination with these objects. The cells thus become invariant at other positions.

Problem 2

The model does indicate how view tuned units (VTU, groups that fire at a specific object) are built, but not how they arise.

Results

The model is based on a simple hierarchical feedforward architecture. It is assumed that the structure reflects the invariance and that characteristic specificity must be built up from different mechanisms. The pooling mechanism should provide robust feature detectors. This means that it must allow detection on specific characteristics without getting confused by clutter or context in the receptive field.

There are two alternatives to a pooling mechanism.

Linear addition = SUM.

Equal weights are hereby weighed. Responses to a complex cell are invariant as long as the stimulus remains in the receptive field of the cell. However, there is no response as to whether there actually is a bar in the receptive field. The output signal is the sum of the afferent cells and so there is no characteristic specificity.

Non-linear maximum operation = MAX.

The strongest afferent cell determines the postsynaptic response. With MAX, the response is determined by determining the most active afferent cell and this signal is seen as the best match for a portion of the stimulus. This makes MAX respond better.

In both cases, the response of a complex cell is invariant to the bar on the receptive field. A non-linear MAX function is a good way that correctly describes the pool when invariant. This includes implicit scanning of afferent cells of the same type. The strongest is then selected from the cells that respond and this is the most consistent with the invariance. Pooling combinations of afferent cells provides a mixed signal caused by different stimuli.

MAX systems are comparable in some respects to neurophysiological data. For example, if two stimuli are offered in the receptive field of an IT neuron, then the neuron's response is dominated by the stimulus that receives the most responses separately. This corresponds to how the MAX model predicts when it comes to afferent neurons. A number of studies provide support for the MAX model. These studies often find a high non-linear tuning of IT cells. This corresponds to the MAX response function. A linear model cannot make such strong changes with a small change in input.

In some cases, clutter can cause the value to change from the MAX function. The quality of the match in the final phase has then changed, so that the power of the VTU response is also different. A solution for this is to add more specific characteristics. Simulations have shown that this model is able to recognize objects in a context.

The MAX model can be used well to describe brain processes. MAX responses are probably from cortical microcircuits in lateral inhibition between neurons in the cortical layer. In addition, the MAX response is important for object recognition.

Join World Supporter
Join World Supporter
Log in or create your free account

Why create an account?

  • Your WorldSupporter account gives you access to all functionalities of the platform
  • Once you are logged in, you can:
    • Save pages to your favorites
    • Give feedback or share contributions
    • participate in discussions
    • share your own contributions through the 7 WorldSupporter tools
Follow the author: Vintage Supporter
Comments, Compliments & Kudos

Add new contribution

CAPTCHA
This question is for testing whether or not you are a human visitor and to prevent automated spam submissions.
Image CAPTCHA
Enter the characters shown in the image.