How to Write a Customized Environment in ReinforcementLearning.jl?
Last Update: 2021-01-30T21:08:01.778
Julia Version: v"1.5.3"
ReinforcementLearning.jl Version: v"0.8.0"
The first step to apply algorithms in ReinforcementLearning.jl is to define the problem you want to solve in a recognizable way. Here we'll demonstrate how to write many different kinds of environments based on interfaces defined in ReinforcementLearningBase.jl
The most commonly used interfaces to describe reinforcement learning tasks is OpenAI/Gym. Inspired by it, we expand those interfaces a little to utilize the multiple-dispatch in Julia and to cover multi-agent environments.
The Minimal Interfaces to Implement
Many interfaces in ReinforcementLearningBase.jl have a default implementation. So in most cases, you only need to implement the following functions to define a customized environment:
action_space(env::YourEnv)
state(env::YourEnv)
state_space(env::YourEnv)
reward(env::YourEnv)
is_terminated(env::YourEnv)
reset!(env::YourEnv)
(env::YourEnv)(action)
An Example: The LotteryEnv
Here we use an example introduced in Monte Carlo Tree Search: A Tutorial to demonstrate how to write a simple environment.
The game is defined like this: assume you have $10 in your pocket, and you are faced with the following three choices:
Buy a PowerRich lottery ticket (win $100M w.p. 0.01; nothing otherwise);
Buy a MegaHaul lottery ticket (win $1M w.p. 0.05; nothing otherwise);
Do not buy a lottery ticket.
This game is a one-shot game. It terminates immediately after taking an action and a reward is received. First we define a concrete subtype of AbstractEnv
named LotteryEnv
:
xxxxxxxxxx
using ReinforcementLearning
LotteryEnv
xxxxxxxxxx
Base. mutable struct LotteryEnv <: AbstractEnv
reward::Union{Nothing, Int} = nothing
end
LotteryEnv
has only one field named reward
, by default it is initialized with nothing
. Now let's implement the necessary interfaces:
xxxxxxxxxx
RLBase.action_space(env::LotteryEnv) = (:PowerRich, :MegaHaul, nothing)
Here RLBase
is just an alias for ReinforcementLearningBase
.
xxxxxxxxxx
begin
RLBase.reward(env::LotteryEnv) = env.reward
RLBase.state(env::LotteryEnv) = !isnothing(env.reward)
RLBase.state_space(env::LotteryEnv) = [false, true]
RLBase.is_terminated(env::LotteryEnv) = !isnothing(env.reward)
RLBase.reset!(env::LotteryEnv) = env.reward = nothing
end
Because the lottery game is just a simple one-shot game. If the reward
is nothing
then the game is not started yet and we say the game is in state false
, otherwise the game is terminated and the state is true
. So the result of state_space(env)
describes the possible states of this environment. By reset!
the game, we simply assign the reward with nothing
, meaning that it's in the initial state again.
The only left one is to implement the game logic:
xxxxxxxxxx
function (x::LotteryEnv)(action)
if action == :PowerRich
x.reward = rand() < 0.01 ? 100_000_000 : -10
elseif action == :MegaHaul
x.reward = rand() < 0.05 ? 1_000_000 : -10
elseif isnothing(action)
x.reward = 0
else
"unknown action of $action"
end
end
Test Your Environment
A method named RLBase.test_runnable!
is provided to rollout several simulations and see whether the environment we defined is functional.
# LotteryEnv
## Traits
| Trait Type | Value |
|:----------------- | ------------------------------------------------:|
| NumAgentStyle | ReinforcementLearningBase.SingleAgent() |
| DynamicStyle | ReinforcementLearningBase.Sequential() |
| InformationStyle | ReinforcementLearningBase.ImperfectInformation() |
| ChanceStyle | ReinforcementLearningBase.Stochastic() |
| RewardStyle | ReinforcementLearningBase.StepReward() |
| UtilityStyle | ReinforcementLearningBase.GeneralSum() |
| ActionStyle | ReinforcementLearningBase.MinimalActionSet() |
| StateStyle | ReinforcementLearningBase.Observation{Any}() |
| DefaultStateStyle | ReinforcementLearningBase.Observation{Any}() |
## Is Environment Terminated?
No
## State Space
`Bool[0, 1]`
## Action Space
`(:PowerRich, :MegaHaul, nothing)`
## Current State
```
false
```
xxxxxxxxxx
env = LotteryEnv()
"random policy with LotteryEnv"
2000
false
xxxxxxxxxx
RLBase.test_runnable!(env)
It is a simple smell test which works like this:
for _ in 1:n_episode
reset!(env)
while !is_terminated(env)
env |> action_space |> rand |> env
end
end
One step further is to test that other components in ReinforcementLearning.jl also work. Similar to the test above, let's try the RandomPolicy
first:
xxxxxxxxxx
using Random
xxxxxxxxxx
run(RandomPolicy(action_space(env)), env, StopAfterEpisode(1_000))
If no error shows up, then it means our environment at least works with the RandomPolicy
🎉🎉🎉. Next, we can add a hook to collect the reward in each episode to see the performance of the RandomPolicy
.
xxxxxxxxxx
using Plots
xxxxxxxxxx
begin
hook = TotalRewardPerEpisode()
run(RandomPolicy(action_space(env)), env, StopAfterEpisode(1_000), hook)
plot(hook.rewards)
end
A random policy is usually not very meaningful. Here we'll use a tabular based monte carlo method to estimate the state-action value. (You may choose appropriate algorithms based on the problem you're dealing with.)
xxxxxxxxxx
using Flux:InvDecay
QBasedPolicy
├─ learner => MonteCarloLearner
│ ├─ approximator => TabularApproximator
│ │ ├─ table => 3×2 Array{Float64,2}
│ │ └─ optimizer => InvDecay
│ │ ├─ gamma => 1.0
│ │ └─ state => IdDict
│ ├─ γ => 1.0
│ ├─ kind => ReinforcementLearningZoo.FirstVisit
│ └─ sampling => ReinforcementLearningZoo.NoSampling
└─ explorer => EpsilonGreedyExplorer
├─ ϵ_stable => 0.1
├─ ϵ_init => 1.0
├─ warmup_steps => 0
├─ decay_steps => 0
├─ step => 1
├─ rng => Random._GLOBAL_RNG
└─ is_training => true
xxxxxxxxxx
p = QBasedPolicy(
learner = MonteCarloLearner(;
approximator=TabularQApproximator(
;n_state = length(state_space(env)),
n_action = length(action_space(env)),
opt = InvDecay(1.0)
)
),
explorer = EpsilonGreedyExplorer(0.1)
)
MethodError: no method matching (::ReinforcementLearningCore.TabularApproximator{2,Array{Float64,2},Flux.Optimise.InvDecay})(::Bool)
Closest candidates are:
Any(!Matched::Int64) at /home/tj/.julia/packages/ReinforcementLearningCore/LcIgw/src/policies/q_based_policies/learners/approximators/tabular_approximator.jl:30
Any(!Matched::Int64, !Matched::Int64) at /home/tj/.julia/packages/ReinforcementLearningCore/LcIgw/src/policies/q_based_policies/learners/approximators/tabular_approximator.jl:31
- (::ReinforcementLearningZoo.MonteCarloLearner{ReinforcementLearningCore.TabularApproximator{2,Array{Float64,2},Flux.Optimise.InvDecay},ReinforcementLearningZoo.FirstVisit,ReinforcementLearningZoo.NoSampling})(::Bool)@monte_carlo_learner.jl:45
- (::ReinforcementLearningZoo.MonteCarloLearner{ReinforcementLearningCore.TabularApproximator{2,Array{Float64,2},Flux.Optimise.InvDecay},ReinforcementLearningZoo.FirstVisit,ReinforcementLearningZoo.NoSampling})(::Main.workspace3.LotteryEnv)@monte_carlo_learner.jl:44
- (::ReinforcementLearningCore.QBasedPolicy{ReinforcementLearningZoo.MonteCarloLearner{ReinforcementLearningCore.TabularApproximator{2,Array{Float64,2},Flux.Optimise.InvDecay},ReinforcementLearningZoo.FirstVisit,ReinforcementLearningZoo.NoSampling},ReinforcementLearningCore.EpsilonGreedyExplorer{:linear,false,Random._GLOBAL_RNG}})(::Main.workspace3.LotteryEnv, ::ReinforcementLearningBase.MinimalActionSet, ::Tuple{Symbol,Symbol,Nothing})@q_based_policy.jl:27
- (::ReinforcementLearningCore.QBasedPolicy{ReinforcementLearningZoo.MonteCarloLearner{ReinforcementLearningCore.TabularApproximator{2,Array{Float64,2},Flux.Optimise.InvDecay},ReinforcementLearningZoo.FirstVisit,ReinforcementLearningZoo.NoSampling},ReinforcementLearningCore.EpsilonGreedyExplorer{:linear,false,Random._GLOBAL_RNG}})(::Main.workspace3.LotteryEnv)@q_based_policy.jl:21
- top-level scope@Local: 1[inlined]
xxxxxxxxxx
p(env)
Oops, we get an error here. So what does it mean?
Before answering this question, let's spend some time on understanding the policy we defined above. A QBasedPolicy
contains two parts: a learner
and an explorer
. The learner
learn the state-action value function (aka Q function) duiring interactions with the env
. The explorer
is used to select an action based on the Q value returned by the learner
. Here the EpsilonGreedyExplorer(0.1)
will select the action of the largest value with probability 0.9
and select a random one with probability 0.1
. Inside of the MonteCarloLearner
, a TabularQApproximator
is used to estimate the Q value.
That's the problem! A TabularQApproximator
only accepts states of type Int
.
0.0
xxxxxxxxxx
p.learner.approximator(1, 1) # Q(s, a)
0.0
0.0
0.0
xxxxxxxxxx
p.learner.approximator(1) # [Q(s, a) for a in action_space(env)]
MethodError: no method matching (::ReinforcementLearningCore.TabularApproximator{2,Array{Float64,2},Flux.Optimise.InvDecay})(::Bool)
Closest candidates are:
Any(!Matched::Int64) at /home/tj/.julia/packages/ReinforcementLearningCore/LcIgw/src/policies/q_based_policies/learners/approximators/tabular_approximator.jl:30
Any(!Matched::Int64, !Matched::Int64) at /home/tj/.julia/packages/ReinforcementLearningCore/LcIgw/src/policies/q_based_policies/learners/approximators/tabular_approximator.jl:31
- top-level scope@Local: 1[inlined]
xxxxxxxxxx
p.learner.approximator(false)
OK, now we know where the problem is. But how to fix it?
A initial idea is to rewrite the RLBase.state(env::LotteryEnv)
function to force it return an Int
. That's workable. But in some cases, we may be using environments written by others and it's not very easy to modify the code directly. Fortunatelly, some built-in wrappers are provided to help us transform the environment.
# LotteryEnv |> StateOverriddenEnv |> ActionTransformedEnv
## Traits
| Trait Type | Value |
|:----------------- | ------------------------------------------------:|
| NumAgentStyle | ReinforcementLearningBase.SingleAgent() |
| DynamicStyle | ReinforcementLearningBase.Sequential() |
| InformationStyle | ReinforcementLearningBase.ImperfectInformation() |
| ChanceStyle | ReinforcementLearningBase.Stochastic() |
| RewardStyle | ReinforcementLearningBase.StepReward() |
| UtilityStyle | ReinforcementLearningBase.GeneralSum() |
| ActionStyle | ReinforcementLearningBase.MinimalActionSet() |
| StateStyle | ReinforcementLearningBase.Observation{Any}() |
| DefaultStateStyle | ReinforcementLearningBase.Observation{Any}() |
## Is Environment Terminated?
Yes
## State Space
`Bool[0, 1]`
## Action Space
`Base.OneTo(3)`
## Current State
```
1
```
xxxxxxxxxx
wrapped_env = ActionTransformedEnv(
StateOverriddenEnv(
env,
s -> s ? 1 : 2
),
action_space_mapping = _ -> Base.OneTo(3),
action_mapping = i -> action_space(env)[i]
)
1
xxxxxxxxxx
p(wrapped_env)
Nice job! Now we are ready to run the experiment:
xxxxxxxxxx
begin
h = TotalRewardPerEpisode()
run(p, wrapped_env, StopAfterEpisode(1_000), h)
plot(h.rewards)
end
If you are observant enough, you'll find that our policy is not updating at all!!!
3×2 Array{Float64,2}:
0.0 0.0
0.0 0.0
0.0 0.0
xxxxxxxxxx
p.learner.approximator.table
Well, actually the policy is running in the evaluation mode here. We'll explain it in another blog. For now, you only need to know that we can wrap the policy in an Agent
to train the policy.
Agent
├─ policy => QBasedPolicy
│ ├─ learner => MonteCarloLearner
│ │ ├─ approximator => TabularApproximator
│ │ │ ├─ table => 3×2 Array{Float64,2}
│ │ │ └─ optimizer => InvDecay
│ │ │ ├─ gamma => 1.0
│ │ │ └─ state => IdDict
│ │ ├─ γ => 1.0
│ │ ├─ kind => ReinforcementLearningZoo.FirstVisit
│ │ └─ sampling => ReinforcementLearningZoo.NoSampling
│ └─ explorer => EpsilonGreedyExplorer
│ ├─ ϵ_stable => 0.1
│ ├─ ϵ_init => 1.0
│ ├─ warmup_steps => 0
│ ├─ decay_steps => 0
│ ├─ step => 1002
│ ├─ rng => Random._GLOBAL_RNG
│ └─ is_training => true
└─ trajectory => Trajectory
└─ traces => NamedTuple
├─ state => 0-element Array{Int64,1}
├─ action => 0-element Array{Int64,1}
├─ reward => 0-element Array{Float32,1}
└─ terminal => 0-element Array{Bool,1}
xxxxxxxxxx
agent = Agent(;
policy=p,
trajectory=VectorSARTTrajectory()
)
0.0
xxxxxxxxxx
new_hook = TotalRewardPerEpisode()
-10.0
-10.0
0.0
0.0
-10.0
-10.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
-10.0
0.0
0.0
0.0
0.0
1.0e6
-10.0
-10.0
-10.0
-10.0
-10.0
-10.0
-10.0
-10.0
-10.0
-10.0
-10.0
0.0
xxxxxxxxxx
run(agent, wrapped_env, StopAfterStep(100_000), new_hook)
3×2 Array{Float64,2}:
0.0 1.00773e6
0.0 47660.8
0.0 0.0
xxxxxxxxxx
p.learner.approximator.table
Note
Always remember that each algorithm usually only works in some specific environments, just like the `QBasedPolicy` above. So choose the right tool wisely 😉.
More Complicated Environments
The above LotteryEnv
is quite simple. Many environments we are interested in fall in the same category. Beyond that, there're still many other kinds of environments. You may take a glimpse at the table to see how many different types of environments are supported in ReinforcementLearningZoo.jl.
To distinguish different kinds of environments, some common traits are defined in ReinforcementLearningBase.jl. Now we'll explain them one-by-one.
StateStyle
In the above LotteryEnv
, state(env::LotteryEnv)
simply returns a true
or false
. But in some other environments, the function name state
may be kind of vague. People with different background often talk about the same thing with different names. You may be interested in this discussion: What is the difference between an observation and a state in reinforcement learning? To avoid confusion when executing state(env)
, the environment designer can explicitly define state(::AbstractStateStyle, env::YourEnv)
. So that users can fetch necessary information on demand. Following are some built-in state styles:
GoalState
InformationSet
InternalState
Observation
xxxxxxxxxx
subtypes(RLBase.AbstractStateStyle)
Note that every state style may have different representations, String
, Array
, Graph
and so on. All the above state styles can accept a data type as parameter. For example:
xxxxxxxxxx
RLBase.state(::Observation{String}, env::LotteryEnv) = is_terminated(env) ? "Game Over" : "Game Start"
For environments which support many different kinds of states, developers should specify all the supported state styles. For example:
xxxxxxxxxx
tp = TigerProblemEnv();
xxxxxxxxxx
StateStyle(tp)
1
xxxxxxxxxx
state(tp, Observation{Int64}())
2
xxxxxxxxxx
state(tp, InternalState{Int64}())
1
xxxxxxxxxx
state(tp)
xxxxxxxxxx
DefaultStateStyle(tp)
DefaultStateStyle
The DefaultStateStyle
trait returns the first element in the result of StateStyle
by default.
For algorithm developers, they usually don't care about the state style. They can assume that the default state style is always well defined and simply call state(env)
to get the right representation. So for environments of many different representations, state(env)
will be dispatched to state(DefaultStateStyle(env), env)
. And we can use the DefaultStateStyleEnv
wrapper to override the pre-defined DefaultStateStyle(::YourEnv)
.
RewardStyle
For games like Chess, Go or many card game, we only get the reward at the end of an game. We say this kind of games is of TerminalReward
, otherwise we define it as StepReward
. Actually the TerminalReward
is a special case of StepReward
(for non-terminal steps, the reward is 0
). The reason we still want to distinguish these two cases is that, for some algorithms there may be a more efficient implementation for TerminalReward
style games.
xxxxxxxxxx
RewardStyle(tp)
xxxxxxxxxx
RewardStyle(MontyHallEnv())
ActionStyle
For some environments, the valid actions in each step may be different. We call this kind of environments are of FullActionSet
. Otherwise, we say the environment is of MinimalActionSet
. A typical built-in environment with FullActionSet
is the TicTacToeEnv
. Two extra methods must be implemented:
xxxxxxxxxx
ttt = TicTacToeEnv();
xxxxxxxxxx
ActionStyle(ttt)
1
2
3
4
5
6
7
8
9
xxxxxxxxxx
legal_action_space(ttt)
true
true
true
true
true
true
true
true
true
xxxxxxxxxx
legal_action_space_mask(ttt)
NumAgentStyle
In the above LotteryEnv
, only one player is involved in the environment. In many board games, usually multiple players are engaged.
xxxxxxxxxx
NumAgentStyle(env)
xxxxxxxxxx
NumAgentStyle(ttt)
For multi-agent environments, some new APIs are introduced. The meaning of some APIs we've seen are also extended.
First, multi-agent environment developers must implement players
to distinguish different players.
xxxxxxxxxx
players(ttt)
xxxxxxxxxx
current_player(ttt)
Single Agent | Multi-Agent |
---|---|
state(env) | state(env, player) |
reward(env) | reward(env, player) |
env(action) | env(action, player) |
action_space(env) | action_space(env, player) |
state_space(env) | state_space(env, player) |
is_terminated(env) | is_terminated(env, player) |
Note that the APIs in single agent is still valid, only that they all fall back to the perspective from the current_player(env)
.
UtilityStyle
In multi-agent environments, sometimes the sum of rewards from all players are always 0
. We call the UtilityStyle
of these environments ZeroSum
. ZeroSum
is a special case of ConstantSum
. In cooperational games, the reward of each player are the same. In this case, they are called IdenticalUtility
. Other cases fall back to GeneralSum
.
InformationStyle
If all players can see the same state, then we say the InformationStyle
of these environments are of PerfectInformation
. They are a special case of ImperfectInformation
environments.
DynamicStyle
All the environments we've seen so far were of Sequential
style, meaning that at each step, only ONE player was allowed to take an action. Alternatively there are Simultaneous
environments, where all the players take actions simultaneously without seeing each other's action in advance. Simultaneous environments must take a collection of actions from different players as input.
xxxxxxxxxx
rps = RockPaperScissorsEnv();
'💎'
'💎'
'💎'
'📃'
'💎'
'✂'
'📃'
'💎'
'📃'
'📃'
'📃'
'✂'
'✂'
'💎'
'✂'
'📃'
'✂'
'✂'
xxxxxxxxxx
action_space(rps)
true
xxxxxxxxxx
rps(rand(action_space(rps)))
ChanceStyle
If there's no rng
in the environment, everything is deterministic afer taking each action, then we call the ChanceStyle
of these environments are of Deterministic
. Otherwise, we call them Stochastic
. One special case is that, in Extensive Form Games, a chance node is envolved. And the action probability of this special player is known. For these environments, we need to have the following methods defined:
xxxxxxxxxx
kp = KuhnPokerEnv();
xxxxxxxxxx
chance_player(kp)
0.333333
0.333333
0.333333
xxxxxxxxxx
prob(kp, chance_player(kp))
true
xxxxxxxxxx
chance_player(kp) in players(kp)
Examples
Finally we've gone through all the details you need to know for how to write a customized environment. You're encouraged to take a look at the examples provided in ReinforcementLearningEnvironments.jl. Feel free to create an issue there if you're still not sure how to describe your problem with the interfaces defined in this package.