Guide To TensorForce: A TensorFlow-based Reinforcement Learning Framework

TensorForce is an open-source library for Reinforcement Learning, built on the top of the TensorFlow library. Python 3 is required for leveraging this deep RL framework. It is currently maintained by Alexander Kuhnle while its 0.4.2 and earlier versions were jointly introduced by Alexander Kuhnle, Michael Schaarschmidt and Kai Fricke.

A brief introduction to Tensorforce and several such RL frameworks can be found in this article.

Highlighting Features of TensorForce

It supports TensorBoard.
It supports a wide range of neural network layers such as 1D and 2D convolutions, fully connected (FC) layers, pooling, embeddings and so on.
It enables usage of various optimization algorithms – Adam, RMSProp, AdaDelta and optimizer based on natural-gradient, to name a few.
It also supports L2 and entropy techniques of regularization.
It allows parallel execution of multiple RL environments.
It supports random replay memory and batch buffer memory.

What distinguishes TensorForce from similar RL libraries?

The whole RL logic of TensorForce is implemented using TensorFlow to enable deployment of TensorFlow-based models and employing portable computation graphs without requiring application programming language.
The modular design of the library has been made as easy as possible to apply and configure for general applications.
RL algorithms applied using the library are independent of the virtual agent’s interaction with the environment as well as the nature of input states and output actions.

Practical implementation

Here’s a demonstration of creating an RL environment and agent for a temperature-controller using TensorForce. The thermostat environment comprises a room having a heater. When the heater is switched on, room temperature will reach 1.0 and when it’s turned off, the temperature drops to 0.0. The exponential heat decay constant ‘tau’ determines how fast the heater’s temperature reaches 0.0 or 1.0. The change in temperature is computed as:

temp[i + 1] = h[i] + (temp[i] – h[i]) * exp(-1/tau) …(i)

where,

temp[i] denotes temperature (between 0 and 1) at ith timestamp

h[i] represents applied heater state (0 or 1)

The code has been implemented using Google colab with Python 3.7.10 and Tensorforce 0.6.3 versions. Step-wise explanation of the code is as follows:

Install tensorforce

!pip install tensorforce

Import required libraries

 import pandas as pd
 import matplotlib.pyplot as plt
 import numpy as np
 import math
 from tensorforce.environments import Environment
 from tensorforce.agents import Agent

Calculate response for current temperature and given action

  def respond(ac, curr_temp, tau):
     return ac + (curr_temp - ac) * math.exp(-1.0/tau)

Define a series of actions (1:on, 0:off)

act = pd.Series(np.array([1,1,1,1,1,1,1,1,0,1,0,0,0,0,0,1,0,0,0,0]))

Initialize array for responses with zeros

resp = np.zeros(act.size)

Update this array with response to each action

 for i in range(act.size):
 #for 1st action, last response will be 0 (‘off’)
      if i == 0:
         lastResp = 0
"""
for next attempts, record previous response and update latest response by calling #respond() with current action, last response and tau value as parameters
"""
     else:
         lastResp = resp[i - 1]
     resp[i] = respond(act[i], lastResp, 2.0)

Create dataframe of actions and corresponding responses

df = pd.DataFrame(list(zip(act, resp)), columns=['Action', 'Response'])

Sample condensed data frame:

Plot the actions and responses.

df.plot()

Create a reward function using which the agent tries to keep the temperature in [0.4,0.6] range.

 def reward(temperature):
         delta = abs(temperature - 0.5)
     #if the temperature in [0.4,0.6] range, set the reward to 0
          if delta < 0.1:
             return 0.0
 “””
 If it’s not in the range, the agent sets the reward as the negative distance of the temperature from the nearest end of the range e.g. if the temperature is 0.7, it is nearer to 0.6 than 0.4; the difference between 0.7 and 0. Is 0.1 so the reward is set to -0.1. If the temperature is 0.35, it is nearer to 0.4 than 0.6; the difference between 0.4 and 0.35 is 0.05 so the reward is set to -0.05
 “””    
          else:
             return -delta + 0.1

Create a list of temperatures from 0.0 to 1.0.

tmp = [t * 0.01 for t in range(100)]

Compute reward for each temperature value

rew = [reward(t) for t in tmp]

Plot temperature vs. reward graph

 fig=plt.figure(figsize=(12, 4))
 plt.scatter(tmp, rew)
 plt.xlabel('Temp')
 plt.ylabel('Reward')
 plt.title('Reward vs. Temp')

Output:

Create a class defining thermostat environment

 class TSEnv(Environment):
        def __init__(self):
       #Initialize tau and current temperature
         self.tau = 3.0
         self.curr_temp = np.random.random(size=(1,))
         super().__init__()
""" 
Define a function for state of the heater with minimum and maximum temperatures specified as 0.0 and 1.0 respectively
"""
     def states(self):
         return dict(type='float', shape=(1,), min_value=0.0, max_value=1.0)
 #Define a function to specify action (0:off, 1:on)
     def actions(self):
         return dict(type='int', num_values=2)
           #Define a function to set the heater’s state
     def reset(self):
         self.timestep = 0
         self.curr_temp = np.random.random(size=(1,))
         return self.curr_temp
 #Define a function for agent’s response to the action. 
     def response(self, action):
            return action + (self.curr_temp - action) * math.exp(-1.0 /  
            self.tau)
 #Compute reward using the same logic as done in step (7)
     def reward_compute(self):
         delta = abs(self.curr_temp - 0.5)
         if delta < 0.1:
             return 0.0
         else:
             return -delta[0] + 0.1
 #Define a function to execute the action 
     def execute(self, act):
         # Check the action (whether heater is on or off)
         assert act == 0 or act == 1
         #Advance the environment b one step
         self.timestep += 1
         # Update current temperature according to the agent’s response
         self.current_temp = self.response(actions)
         #Calculate the reward
         reward = self.reward_compute()
        terminal = False   #episode is not over
 #return the current temperature and computed reward
      return self.curr_temp, terminal, reward

Create environment by specifying the thermostat environment class defined above and the maximum number of timestamps in each episode

 environment = Environment.create( environment=TSEnv,
             max_episode_timesteps=150)

Configure an agent to learn responding in the thermostat environment

 ag = Agent.create(
     agent='tensorforce', environment=environment, update=64,
     optimizer=dict(optimizer='adam', learning_rate=1e-3),
     objective='policy_gradient', reward_estimation=dict(horizon=1)
 )

Train the agent for 150 episodes

 for _ in range(150):
 #reset the environment first
     states = environment.reset()
     terminal = False  
 #while the episode is not over
     while not terminal:
 #record agent’s action on the heater’s current state
         act = agent.act(states=states)
 #execute the agent’s actions
         states, terminal, rew = environment.execute(actions=act)
"""
act() method should be followed by observe() which observes the computed reward and checks whether the temperature has reached a terminal state 
"""
         agent.observe(terminal=terminal, reward=rew)

Check the trained agent’s performance

 #Reset the environment
 environment.reset()
 #Initialize the current temperature, state and terminal
 environment.curr_temp = np.array([1.0])
 states = environment.curr_temp
 intr = agent.initial_internals()
 terminal = False
 #Run one episode
 temperature = [environment.curr_temp[0]]
 #Till the episode is not over
 while not terminal:
 #Let the agent perform action on the current state
     ac, internals = agent.act(states=states, internals=intr,   
     independent=True)
 #Execute agents action and record rewars
     states, terminal, reward = environment.execute(actions=ac)
     temperature += [states[0]]
 #Plot the agent’s response
 plt.figure(figsize=(12, 4))
 ax=plt.subplot()
 #Limits of temperature
 ax.set_ylim([0.0, 1.0])
 #plot the temperature 
 plt.plot(range(len(temperature)), temperature)
 #Draw red lines at temperatures 0.4 and 0.6 to see if temperature  
 #remains in the [0.4,0.6] range
 plt.hlines(y=0.4, xmin=0, xmax=149, color='r')
 plt.hlines(y=0.6, xmin=0, xmax=149, color='r')
 plt.xlabel('Timestep')     #X-axis label
 plt.ylabel('Temperature')  #Y-axis label
 plt.title('Temperature vs. Timestep')  #Title of the plot
 plt.show()   #Display the plot

Output: