How jump-start deals with exploration challenges in reinforcement learning
Reinforcement learning is when the model tries to produce a better result through a continuous process of trial and error. The mechanism has an agent that runs through repeated trials within an environment based on the feedback it receives. However, the first few steps in this process are the hardest for the agent when the policy has no prior data to rely on and needs to randomly encounter rewards for improvement. This process is known as exploration. To skirt the challenges that exploration has, the policy can be armed with some prior knowledge. This will help the agent to collect data beforehand with non-zero rewards. This, in itself, though, does not give optimal results.
The exploration problem
Such policies could be acquired from demonstration data like through behavioural cloning, from sub-optimal prior data like through offline reinforcement learning or even through manual engineering. When this prior policy itself is parameterised as a function approximator, it can initialise a policy gradient method. But algorithms that are sample-efficient and based on value functions are hard to bootstrap using this method. For value functions to initialise effectively, they require both good as well as bad data, and the fact that a reasonable starting policy is readily available does not translate into improved performance. So then, how can we bootstrap a value-based reinforcement learning algorithm with a prior policy so that it achieves reasonable but sub-optimal performance?
Source: Google AI blog
Rolling in prior policy
Google AI has conducted a study that intends to mitigate issues with initialising using a concept called jump-start reinforcement learning. The research found that any reinforcement learning algorithm can be bootstrapped by gradually “rolling in” the prior policy, which is known as the guide policy. The guide policy resolves the issues with exploration to a great extent and allows fast learning. Improvement in the exploration policy reduces the effect of the guide policy that eventually leads to a reinforcement learning-only policy. This indicates that the model that can then independently work on self-improvement.
Source: Google AI blog
While the research was very focused on value-based methods, the approach to it was very generic and could be downstreamed to other reinforcement learning methods. The guide policy is only required to take actions based on its observations of the environment and maintaining a reasonable performance level. This guide policy is considered better than a random policy and quickens the early stages of reinforcement learning. This is why this method is known as jump-start reinforcement learning.
Jump-start reinforcement learning can use any kind of prior policy to fasten the pace of reinforcement learning and can also easily work with other existing offline and online reinforcement learning methods. Also, compared to classic reinforcement learning alternatives, jump-start reinforcement learning has an upper bound, i.e., it offers the best possible accuracy when it comes to sample complexity.
JSRL was naturally compared to imitation and reinforcement learning (IL+RL) methods as they were used to initialise reinforcement learning through prior policy. The comparison was made on the basis of a bunch of D4RL benchmark tasks that included simulated robotic control environments, datasets of offline data from human demonstrators, planners and other learned policies.
When it comes to complicated tasks like vision-based robotic manipulation, using offline data can leave a knowledge gap because of the challenges associated with dimensionality. Both continuous-control action space and pixel-based state space have high dimensionality that turns into scaling challenges for IL+RL methods when it comes to the amount of data required. To determine JSRL’s performance, the study took two simulated robotic manipulation tasks into consideration: indiscriminate grasping or lifting any object and instance grasping or lifting a specific target object.
Conclusion
JSRL can improve the exploration process for initialising RL tasks by leveraging the prior policy. The algorithm rolls in the pre-existing guide policy followed by the self-improving exploration policy. According to the study findings, JSRL is more sample efficient than other IL+RL approaches that are complex. It eases the duties of the exploration policy to a great extent, increasing the efficiency of the initialisation process.


