MLNC – Machine Learning & Neural Computation – Dr Aldo Faisal
Coursework 1 – Grid World
To be returned via Blackboard as indicated online.
Your coursework should contain: your name, your CID and your degree course at the top of the first page. You text should provide brief analytical derivations and calculations as necessary inline, so that the markers can understand what you did. Please use succinct answers to the questions. Your final document should be submitted as a single .zip file, containing one single PDF file, in the format of CID FirstnameLastname.pdf (example: 012345678 JaneXu.pdf), and one single .m file, also in the format CID FirstnameLastname.m. Note, that therefore all code that you have written or modified must be within that one Matlab file. Do not submit multiple Matlab files, do not modify other Matlab files. Your Matlab script should contain a function that takes no arguments and is called RunCoursework(), that should produce all the Matlabbased results of your coursework (in clearly labelled text and/or figure output). This function should be able to run on its own, in a clean Matlab installation and directory with only the code we provided for the coursework present.
Please additionally paste the same fullycommented Matlab source code in the appendix of your PDF submission. You are allowed to use all builtin Matlab functions and any Matlab functions supplied by the course or written by you.
The markers may subtract points for badly commented code, coding that does not run and coding that does not follow the specifications. Figures should be clearly readable, labelled and visible – poor quality or difficult to understand figures may result in a loss of points.
Your coursework should not be longer than 4 single sided pages with 2 centimetre margins all around and 12pt font. You are encouraged to discuss with other students, but your answers should be yours, i.e., written by you, in your own words, showing your own understanding. You have to produce your own code. If you have questions about the coursework please make use of labs or Piazza, but note that GTAs cannot provide you with answers that directly solve the coursework.
Marks are shown next to each question. Note that the marks are only indicative.
Specification
s1 
G s2 
P s3 
s4 
s5 
s6 
s7 
s8 
s9 
s10 

s11 
s12 
s13 
s14 
Desired direction
Failed left
p 

(1−p) 2 
(1−p) 2 
Failed right
This coursework uses the simple Grid World shown in Figure 1. There are 14 states, corresponding to locations on a grid – two cells (marked in grey) are walls and therefore cannot be occupied. This Grid World has two terminal states, s2 (the Goal state, in green) and s3 (the Penalty state, in red).
4

The starting state can vary. In each simulation there is an equal probability of starting from one of the states s11, s12, s13, s14 (i.e. there is 1 probability of starting from any of these states).
 Possible actions in this world are N , E, S and W (North, East, South, West), which correspond to moving in the four cardinal directions of the compass.
2
 The effects of actions are not deterministic, and only succeed in moving in the desired direction with probability p. Alternatively, the agent will move perpendicular to its desired direction in either adjacent direction with probability (1−p) .

After the movement direction is determined, and if a wall blocks the agent’s path, then the agent will stay where it is, otherwise it will move to the corresponding adjacent. So for example, in the grid world where p = 0.8, an agent at state s5 which chooses to move north will move north to state s1 with probability 0.8; will move east to state s6 with probability 0.1; or will move west staying in state s5 with probability
0.1 (in which case it will bang into the wall and come to rest in state s5).
 The agent receives a reward of −1 for every transition (i.e. a movement cost), except those movements ending in state s3 (marked P for penalty) or state s2 (marked with G for goal). For transitioning to s3 there is a penalty of −10. For transitioning to s2 there is a reward of 0.
 We provide the code PersonalisedGridWorld.p. It contains the function that sets up the Grid World with p probability of successful transition and returns the full MDP information. Note that .p files are similar to normal Matlab functions/scripts, but are not humanreadable (i.e. you do not/should not edit it).
>> [NumStates, NumActions, TransitionMatrix, …
RewardMatrix, StateNames, ActionNames, AbsorbingStates] …
= PersonalisedGridWorld(p);
With NumStates being the number of states in the Grid World, and NumActions the number of ac tions the agent can take. The TransitionMatrix is a NumStates × NumStates × NumActions array of specified transition probabilities between (first dimension) successor state, (second dimen sion) prior state, and (third dimension) action. RewardMatrix is the NumStates × NumStates × NumActions array of reward values between (first dimension) successor state, (second dimension) prior state, and (third dimension) action. StateNames is a NumStates × 1 matrix containing the name of each state. ActionNames is a NumActions × 1 matrix containing the name of each action. Finally, AbsorbingStates is a NumStates × 1 matrix specifying which states are terminal.
10
The coursework is personalised by your CID number. Throughout the exercise we set p = 0.5 + 0.5 × x
and
10
γ = 0.2 + 0.5 × y , where x is the penultimate digit of your College ID (CID), and y is the last digit of your CID. If your CID is 876543210 we have X = 1 and y = 0 resulting in p = 0.55 and γ = 0.2.
Questions
Points per questions are indicative only. Questions become progressively more challenging.
 (1 point) State your CID and personalised p and γ (no need to show derivation).
 (15 points) Assume the MDP is operating under an unbiased policy πu, compute the value function
1
4
V πu (s) for every nonterminal state (s , s
. . . , s14
) by any dynamic programming method of your
choice. Report your result in the following format:
State
s1
s4
s5
s6
s7
s8
s9
s10
s11
s12
s13
s14
Value
2
2.54
1
3
. . .

(25 points) Assume you are observing the following state transitions from the above MDP: {s14, s10, s8, s4, s3},
{s11, s9, s5, s6, s6, s2}, {s12, s11, s11, s9, s5, s9, s5, s1, s2}.
 What is the likelihood that the above observed 3 sequences were generated by an unbiased policy
πu? Report the value of the likelihood.
 Find a policy πM for the observed 3 sequences that has higher likelihood than the likelihood of πu to have generated these sequences. Report it in the following table format. Note, that as not all states are visited by these 3 sequences you only have to report the policy for visited, nontransient states. Report your result using the following format:
State
s1
s4
s5
s6
s7
s8
s9
s10
s11
s12
s13
s14
Action
N
S
W
. . .
 What is the likelihood that the above observed 3 sequences were generated by an unbiased policy
 (39 points)
 Assume an unbiased policy πu in this MDP. Generate 10 traces from this MDP and write them out.When writing them out use one line for each trace, use symbols S1, S4, …, S14, actions N, E, S, W, and the rewards in the following format (please make sure we can easily copy and paste these values from the PDF in one go), e.g. the output must be in the following format (so that we can copy and paste the text from your PDF into our automatic testing software).
S12,W,1,S11,N,1,S9,N,1,S5,N,1,S1,N,1,S1,E,0
S14,E,1,S10,E,1,S8,W,1,S7,S,1,S6,N,0

Apply FirstVisit Batch MonteCarlo Policy Evaluation to estimate the value function Vˆ πu from these 10 traces alone. Report the value function for every nonterminal state (s1, s4 . . . , s14) using the format specified in Question 2.
 Quantify the difference between Vˆ πu obtained from Q4.b and V πu obtained from Q2 by defining a measure that reports in a single number how similar these two value functions are. Justify your choice of measure. Then, plot the value of the proposed similarity measure against the number of traces used. Start plotting the measure using the first trace, then the first and second trace, and so forth. Comment on how increasing the number of traces affects the similarity measure.
 Assume an unbiased policy πu in this MDP. Generate 10 traces from this MDP and write them out.When writing them out use one line for each trace, use symbols S1, S4, …, S14, actions N, E, S, W, and the rewards in the following format (please make sure we can easily copy and paste these values from the PDF in one go), e.g. the output must be in the following format (so that we can copy and paste the text from your PDF into our automatic testing software).
 (20 points)
 Implement sgreedy firstvisit Monte Carlo control. Evaluate the learning and control for two set tings of s, namely 0.1 and 0.75.For each setting of s, plot two types of learning curves:
 Plot reward against episodes.
 Plot trace length per episode against episodes.
 Implement sgreedy firstvisit Monte Carlo control. Evaluate the learning and control for two set tings of s, namely 0.1 and 0.75.For each setting of s, plot two types of learning curves:
Note: An episode is one complete trace. A trial is many episodes starting from an initialisation of the agent. The learning curves are stochastic quantities, you may need to run a good number of repeated learning experiments to average out the variability across trials. Specify the number of trials and plot mean ± standard deviation of your learning curves.