How difficult is reinforcement learning?
The aim of this applet is to investigate how easy it is to find the optimal policy for a very small fully-observable game with 6 states and 4 actions in each state. All you know is that there are six states and 4 actions; there is no meaning for the actions.
The Game
There are 6 states in a 2x3 grid. The agent is in the state with the yellow dot.
There are 4 actions available to the agent called: up, down, left, and right (the name of the action is arbitrary; it has nothing to do with the effect of the action). You can control the agent with the buttons on the top right. Note that the effect of an action in a state is stochastic.
The Policy
You can control the agent yourself (using the up, left, right, down buttons). At each step the last reward is shown below the action buttons.
The aim is to find the action for each state that optimizes the total reward received.
The applet reports the number of steps and the total reward received. It specifies the minimum accumulated reward (which indicates when it has started to improve performance), and the point at which the accumulated reward changes from negative to positive. Reset initializes these to zero (after one step). Trace on console lists the steps and rewards on the console, if you want to plot it.
You can change the size of the font and change the size of the grid.
Other Applets for a related, larger, game:
- hand-controller (a rather simplistic rule-based controller)
- Q-learning controller
- linear function controller
- Adversary controller (Q-learning, but where an adversary chooses the prize location).
You can get the code: TGameGUI.java is the GUI. The environment code is at TGameEnv.java. The controller is at TGameController.java, and TGameQController.java. You can also get the javadoc for a number of my applets. This applet comes with ABSOLUTELY NO WARRANTY. This is free software, and you are welcome to redistribute it under certain conditions, see the code for more details. Copyright © David Poole, 2010.