Is There 20/20 Vision in Computer Vision?
Computer vision (CV) has made incredible strides in accuracy and applicability. Object classification, in many applications, can achieve 99% accuracy; this is up from 50% a decade ago. CV also can extract complex features in images to identify signs of disease in patients.
Object classification, in many applications, can achieve 99% accuracy; this is up from 50% a decade ago.
This trend towards growing CV capabilities raises questions such as:
- Is CV better than, worse than, or simply different from human vision (HV)?
- How can man and machine work together in the image annotation process to make CV training more robust?
Modern CV benefits from the availability of high-resolution image capture, fast data processing, and large training sets. This is what makes it excellent for tasks like medical imagery interpretation. Most CV models are built in a feed-forward neural network arrangement, in which pixel saturation/color values are first analyzed, then edges are identified, and so on until ultimately objects are classified. This makes them easy to engineer.
In contrast, the human vision system possesses limited resolution at the retina, and modest bandwidth from the optic nerve to the brains’ optical cortex. We’re not talented at picking out a hairline fracture in a manufacturing tool. However, people can build cognitive models of their environment and of new objectives with relatively few examples. In fact, scientists believe that humans reconstruct images in the brain, as much as they actually “see” them. Further, our neural networks contain feedback loops, allowing people to “refocus the lens.”
The cognitive capacities of the brain simply don’t exist in today’s CV AI models. This means that CV conclusions can turn out to be very wrong. Experiments have shown that, if one takes an image of a sloth and slightly adjusts the orientation of image elements, then a model could mistakenly interpret a sloth in a tree as being a race car on a track. Or a bunch of black wavy lines similar to a pear’s outline, could be interpreted as a group of penguins. Consider the implications of a “friend or foe” identification by a military drone.
For CV to continue its growth, it will need to apply multi-modal AI, incorporating cognitive computing/semantic networks. Ultimately, there will be bots that engage with the world and combine vision, touch and a generalized representation of action-and-reaction, to interpret vision more like people do.
In the near-term, the above insights lead to the implications below for annotating images used in CV model training.
- Data handled by annotators will expand from mainly 2-dimensional images to include 3-dimensional object representations. This will help overcome a weakness today, in which reorienting objects in an image can confuse the object classification model. For example, rotating a red octagonal STOP sign can cause it to be labeled as a barbell in a gym or as a tennis racket executing a slice shot. Annotation of 3-dimensional images will be combined with an ML technique known as “capsule networks,” which capture spacial relationships between parts of an object to improve and accelerate object classification.
Figure 2: Example of object orientation. The Google Inception-v3 classifier  correctly labels the canonical poses of objects (a), but fails to recognize out-of-distribution images of objects in unusual poses (b–d), including real photographs retrieved from the Internet (d). The left 3 × 3 images (a–c) are found by our framework and rendered via a 3D renderer. Below each image are its top-1 predicted label and confidence score.
- Annotators will also become more involved in labeling meaning in videos. Adding motion information to imagery extends CV interpretation skills from objects to their actions, relationships and even cause-and-effect (“a dog is licking its owner’s knee, causing a ticklish reaction.”)
- Annotation teams could become involved in prioritizing the importance of particular features in defining an image. This addresses the “brittleness” of current models, where small changes to secondary features undesirably shifts the model to a different classification space.
- Finally, annotators should become familiar with “adversarial” ML modes, which explicitly learn to manipulate an image so that another model misclassifies it. While many adversarially created images will only be detectable by other algorithms, there will be cases in which human observation and insight will help defeat malicious use of AI.
Going back to the question, “Is CV better than, worse than, or simply different from human vision (HV)?” From the information gathered it is clear that CV is making breakthroughs in improving accuracy however, there is still an overarching challenge between humans and AI which seems to be the strong internal human interpretation bias. Analysis tools and extensive cross checks help to rationalize the interpretation of data and help put human bias into perspective. Read our blog post on how to reduce model bias.
- General (17)