James DiCarlo and his team help define goals for primate-like object recognition by artificial neural networks.

Testing the limits of artificial visual recognition systems


James DiCarlo and his team help define goals for primate-like object recognition by artificial neural networks.

While it can sometimes seem hard to see the forest from the trees, pat yourself on the back: as a human you are actually pretty good at object recognition. A major goal for artificial visual recognition systems is to be able to distinguish objects in the way that humans do. If you see a tree or a bush from almost any angle, in any degree of shading (or even rendered in pastels and pixels in a Monet), you would recognize it as a tree or a bush. However, such recognition has traditionally been a challenge for artificial visual recognition systems. Researchers at MIT’s McGovern Institute for Brain Research and Department of Brain and Cognitive Sciences (BCS) have now directly examined and shown that artificial object recognition is quickly becoming more primate-like, but still lags behind when scrutinized at higher resolution.

In recent years, dramatic advances in “deep learning” have produced artificial neural network models that appear remarkably similar to aspects of primate brains. James DiCarlo, Peter de Florez Professor and Department Head of BCS, set out to determine and carefully quantify how well the current leading artificial visual recognition systems match humans and other higher primates when it comes to image categorization. In recent years, dramatic advances in “deep learning” have produced artificial neural network models that appear remarkably similar to aspects of primate brains, so DiCarlo and his team put these latest models through their paces.

Rishi Rajalingham, a graduate student in DiCarlo’s lab conducted the study as part of his thesis work at the McGovern Institute. As Rajalingham puts it “one might imagine that artificial vision systems should behave like humans in order to seamlessly be integrated into human society, so this tests to what extent that is true.”

The team focused on testing so-called “deep, convolutional neural networks” (DCNNs), and specifically those that had trained on ImageNet, a collection of large-scale category-labeled image sets that have recently been used as a library to train neural networks (called DCNNIC models). These specific models have thus essentially been trained in an intense image recognition bootcamp. The models were then pitted against monkeys and humans and asked to differentiate objects in synthetically constructed images. These synthetic images put the object being categorized in unusual backgrounds and orientations. The resulting images (such as the floating camel shown above) evened the playing field for the machine models (humans would ordinarily have a leg up on image categorization based on assessing context, so this was specifically removed as a confounder to allow a pure comparison of specific object categorization).

DiCarlo and his team found that humans, monkeys and DCNNIC models all appeared to perform similarly, when examined at a relatively coarse level. Essentially, each group was shown 100 images of 24 different objects. When you averaged how they did across 100 photos of a given object, they could distinguish, for example, camels pretty well overall. The researchers then zoomed in and examined the behavioral data at a much finer resolution (i.e. for each single photo of a camel), thus deriving more detailed “behavioral fingerprints” of primates and machines. These detailed analyses of how they did for each individual image revealed strong differences: monkeys still behaved very consistently like their human primate cousins, but the artificial neural networks could no longer keep up.

“I thought it was quite surprising that monkeys and humans are remarkably similar in their recognition behaviors, especially given that these objects (e.g. trucks, tanks, camels, etc.) don’t “mean” anything to monkeys” says Rajalingham. “It’s indicative of how closely related these two species are, at least in terms of these visual abilities.”

DiCarlo’s team gave the neural networks remedial homework to see if they could catch up upon extra-curricular training by now training the models on images that more closely resembled the synthetic images used in their study. Even with this extra training (which the humans and monkeys did not receive), they could not match a primate’s ability to discern what was in each individual image.

DiCarlo conveys that this is a glass half-empty and half-full story. Says DiCarlo, “The half full part is that, today’s deep artificial neural networks that have been developed based on just some aspects of brain function are far better and far more human-like in their object recognition behavior than artificial systems just a few years ago,” explains DiCarlo. “However, careful and systematic behavioral testing reveals that even for visual object recognition, the brain’s neural network still has some tricks up its sleeve that these artificial neural networks do not yet have.”

Dicarlo’s study begins to define more precisely when it is that the leading artificial neural networks start to “trip up”, and highlights a fundamental aspect of their architecture that struggles with categorization of single images. This flaw seems to be unaddressable through further brute force training. The work also provides an unprecedented and rich dataset of human (1476 anonymous humans to be exact) and primate behavior that will help act as a quantitative benchmark for improvement of artificial neural networks.

 

Image: Example of synthetic image used in the study. For category ‘camel’, 100 distinct, synthetic camel images were shown to DCNNIC models, humans and rhesus monkeys. 24 different categories were tested altogether.