IBM Research Creates New Benchmark For Measuring Common Sense In AI
As the child grows, so does their ability to differentiate things, learn to generalise novel agents and situations around, and react to external stimuli. AI agents must understand humans and deduce our mental states from observable activities in order to engage with them in the real world. Humans have an easy time grasping concepts like this. Agents can be distinguished from objects, and we can expect them to adhere to physical limits and behave efficiently to fulfil goals within those constraints.
To that end, IBM has created a new benchmark – AGENT (Action, Goal, Efficiency, coNstraint, uTility) – to evaluate an AI model’s core psychological reasoning ability. The aim is to slowly progress towards building AI agents capable of inferring mental states, predicting the next step, and working with humans as partners.
The paper was presented at ICML 2021 and is produced by researchers at the MIT-IBM Watson AI Lab as part of their work with DARPA.
AGENT consists of a large-scale dataset of 3D animations of an agent moving under various physical constraints and interacting with various objects. These animations are organised into four categories of trials, designed to probe a machine learning model’s understanding of key situations that have served to reveal infants’ intuitive psychology.
- testing their attributions of goal preferences,
- to test the action efficiency,
- unobserved constraints, and
- cost-reward trade-offs.
The overview of trial types of four scenarios is shown below.
Image: IBM Research
All trials were divided into two phases:
- The first one is a familiarisation phase showing one or multiple videos of the typical behaviours of a particular agent, and
- The second one includes a test phase showing a single video of the same agent either in a new physical situation (the Goal Preference, Action Efficiency and Cost-Reward Trade-offs scenarios) or the same video as familiarisation but revealing a portion of the scene that was previously occluded (unobserved constraints).
Setting a baseline for AGENT
The researchers took two baselines, including Bayesian Inverse Planning and Core Knowledge (BIPaCK) and ToMnet-G aka Theory of Mind neural network extended with a graph neural network).
The first one – BIPaCK is a generative model that combines a computational framework for understanding action using Bayesian inference with core knowledge of physics powered by simulation. From a scene, the researchers extract the entities (the agent, objects, and obstacles) and their rough-state information (3D bounding boxes and colour codes), based either on the ground truth provided in AGENT or on results from a perception model. In the next step, they recreate an approximated physical scene in a physics engine that is different from the environment in the videos. To obtain a character embedding for a specific agent, ToMnet-G encodes the familiarisation videos. To encode states, researchers use a graph neural network, in which all elements (including barriers) are represented as nodes.
For the proposed tasks in the benchmark, the two baseline models, BIPaCK and ToMnet-G, were used to compare their performance on AGENT to human performance. Overall, on the comparison, it was found that BIPaCK outperforms ToMnet-G, particularly in tests of strong generalisation. In order to demonstrate core psychological reasoning abilities, the findings thus suggest that it is a must for an AI model to acquire or have built-in representations of how agents plan, combining cost-reward calculations with core knowledge of objects and physics.
Future prospects
Many open areas remain open for improvement. AGENT can thus be considered as a well-structured diagnostic tool for developing and evaluating common sense in AI models. Moreover, it also:
- Validates the use of standard developmental psychology approaches to generate AI models comparable to those used to teach human infants.
- It may be feasible to develop AI models that can learn and reason, explain their decisions and the relationships between objects and ideas, and even comprehend psychology and physics in the same manner that people do.
- It might be possible for an AI system to successfully engage in social interactions, make common-sense decisions in social situations involving multiple agents, and use tools to accomplish a given task, such as using a key to open a car’s door or walk across the street.
To conclude, it is important to note that developing such AI systems with common sense may take years; however, to start with, AGENT is a tool to achieve the desired results in the future.




