Researchers at Brown and John Hopkins universities have developed a ‘visual’ Turing test using which a computer’s ability to understand movements, relationships and implied intent from images can be gauged and evaluate how well can it decipher information from images.
The team described its new system as a ‘visual Turing test’, after the legendary computer scientist Alan Turing’s test which measures the extent to which computers display human-like intelligence.
“There have been some impressive advances in computer vision in recent years,” said Stuart Geman, the James Manning Professor of Applied Mathematics at Brown. “We felt that it might be time to raise the bar in terms of how these systems are evaluated and benchmarked.”
Traditional computer vision benchmarks tend to measure an algorithm’s performance in detecting objects within an image (the image has a tree, or a car or a person), or how well a system identifies an image’s global attributes (scene is outdoors or in the nighttime).
The system Geman and his colleagues developed, described in the journal Proceedings of the National Academy of Sciences, is designed to test a contextual understanding of photos.
It works by generating a string of yes or no questions about an image, which are posed sequentially to the system being tested. Each question is progressively more in-depth and based on the responses to the questions that have come before.
For example, an initial question might ask a computer if there’s a person in a given region of a photo. If the computer says yes, then the test might ask if there’s anything else in that region – perhaps another person.
If there are two people, the test might ask: “Are person1 and person2 talking?”
The questions are geared toward gauging the computer’s understanding of the contextual “storyline” of the photo.
Because the questions are computer-generated, the system is more objective than having a human simply query a computer about an image, researchers said.
There is a role for a human operator, however. The human’s role is to tell the test system when a question is unanswerable because of the ambiguities of the photo.
For instance, asking the computer if a person in a photo is carrying something is unanswerable if most of the person’s body is hidden by another object. The human operator would flag that question as ambiguous.