Project 4: Reinforcement Learning of Behavior
Sample Program
Q-Learning Parameters
pause_ms
- Number of milliseconds from the start of an action
until the state is updated.
actions
- List of possible actions
state_func
- Encodes sensor values into discrete states
reward_func
- Determines the robot’s reward
num_states
- Number of distinct states assigned by state_func
target_visits
- Number of visits to a state/action pair required
to complete the exploration phase
epsilon
- Probability of exploring during the exploitation phase
discount
- Ranges from 0.0 to 1.0, representing how much anticipated
future rewards are used in updating Q values
rate_constant
- Determines how quickly the learning rate decreases
max_steps
- Number of steps the robot will run until it stops
Assignment
- Implement the following programs:
Learning to Avoid Obstacles
For this task, your robot’s goals are:
- Drive forward as much as possible
- Avoid hitting objects
Develop a first learning robot as follows:
- Include
lib.py
in
your project without any modifications.
- Include
robot.py
in your project.
- Modify it as needed to correspond to your robot’s configuration.
- Add one or more functions to represent possible robot turns.
- Base your solution on
forward_back.py
.
- Replace the
robot.go_back
function with one of your turn functions.
- Write a
main.py
that sets everything up and calls lib.run_q()
- Run three experiments with the robot in the same obstacle-strewn
area with the same starting point.
- For each experiment, run the robot for 200 iterations.
- Record the total reward the robot earns for each experiment.
For your second learning robot:
- Replace the turn function you employed on the previous step with
a different turn function.
- Again, run three experiments for 200 iterations each, recording
the total reward the robot earns.
For your third learning robot:
- Replace the
BUMPED
state with two distinct alternatives:
- Modify
find_state()
to return the correct state for each of
the given bump sensor configurations.
- Update
num_states
in the parameters to be 4
.
- Modify
reward()
accordingly as well.
- The
actions
should include a forward move, a left turn, and a
right turn. The forward move should be first.
- Consider setting
epsilon
to a positive value, perhaps 0.05
or 0.1
.
It is easy for the robot to get “stuck” on a suboptimal action in this
task.
- Run three experiments for 200 iterations each, recording the total
reward the robot earns.
For your fourth learning robot:
- Use a single
BUMPED
state, as you did in the first two variations.
- Replace the
OBJECT
state, modifying find_state()
accordingly, with the following three states:
CLOSE_OBJECT
(0-200 mm)
MID_OBJECT
(200-400 mm)
FAR_OBJECT
(400-600 mm)
- Update
num_states
in the parameters to be 5
.
actions
should include a forward move and one turn.
- Run three experiments for 200 iterations each, recording the total
reward the robot earns.
For your fifth learning robot:
- Put together any combination of ideas, either from the above variations or
your own new ideas, that you hypothesize will result
in an effective obstacle avoider.
- Draw whatever inspiration you like from the previous four versions.
- Run three experiments for 200 iterations each, recording the total
reward the robot earns.
Your choice
- Devise a task for your robot that is amenable to being solved by
using reinforcement learning.
- You will need to create a
find_state()
function and a reward()
function suitable for the task.
- You will also need to select which actions will be available for
the robot to employ.
- The color sensor can be a valuable source of additional sensor inputs
to create a more interesting task.
- Create at least two variations of the task parameters, and compare
across three experiments the total reward earned with each variation.
Questions
- What were the total rewards earned for each of your obstacle-avoiding
robots?
- What type of turn was most effective for your robot, in terms of the total
reward? Was the total reward a good metric for the effectiveness of
the turn?
- What impact did you observe from considering the bump sensors
separately?
- What impact did you observe from considering different sonar
distances separately?
- In general, is the total reward a useful metric for the effectiveness
of an obstacle avoider? If not, should the reward function be modified?
- How suitable was Q-Learning for your chosen task? Support your answer
with observations of its performance, including its total reward.
- What variations of parameters did you attempt for your chosen task?
What was the impact of those variations?
- Generally speaking, when do you think Q-Learning would be preferable
to hand-coding a robot’s behavior? When would hand-coding behavior
be preferable? Support your answer with observations from your
experiments.
- In comparison to our earlier implementations, to what extent does a
Q-Learning robot exhibit intelligence? Support your answer with observations
of your experiments.