Key Points:
- Researchers at the Massachusetts Institute of Technology (MIT) developed a new metric called “Minimum Viewing Time” (MVT) to analyze the complexity of image recognition for AI systems. The MVT measures the time needed for accurate human identification of an image.
- This research has established that larger AI models showed progress on simpler images, but struggled with more challenging images. The researchers believe that the AI industry has failed to consider the absolute difficulty of an image or dataset, leading to skewed evaluation standards.
The study, led by David Mayo, an MIT PhD student in electrical engineering and computer science, explored how certain images take longer to recognize compared to others. Mayo believes the exploration is imperative for understanding and enhancing machine vision models. Currently, AI models perform well on specific datasets, but humans outperform them in real-world applications. One reason behind such disparity is the lack of information about the absolute difficulty of an image or dataset.
The research team developed a new metric, the “Minimum Viewing Time” (MVT) to quantify an image’s difficulty level, based on how long a person takes to identify it accurately. The team evaluated the model using a subset of ImageNet, a popular dataset in machine learning, and ObjectNet, a dataset designed for object recognition robustness. They found that current test sets, including ObjectNet, are skewed towards easier, shorter MVT images. The larger models showed substantial improvement on simpler images, but less progress on tougher images. The CLIP models, combining language and vision, represented a more human-like recognition.
The researchers released the image sets categorized by difficulty, along with tools to compute MVT. These tools can be added to existing benchmarks, making it possible to measure test set difficulty before deploying real-world systems, discover neural correlates of image difficulty and advance object recognition methods.
Despite advancements, the team acknowledged the limitations of the study, as the methodology focused only on object recognition. Cluttered images introduced complexities not covered by the current model. However the team is optimistic, focusing on the potential for more robust and human-like performance in object recognition, as models are evaluated using more consistent real-world visual understanding.
Their work paves the way in advancing understanding of visual complexities, particularly important in the health care sector where the ability to interpret medical images depends on the diversity and difficulty distribution of the images. Mayo and Cummings are also exploring neurological underpinnings of visual recognition, investigating whether the brain exhibits different activity when processing easy versus challenging images.