As an important element of this programming, I'm going to always make the opponent play randomly against whatever strategy is put forward by the learning algorithm. This makes the calculation of future states easier, and possibly something that can be solved analytically.
First, we define the Markov Decision Process (MDP). This is a 5-tuple containing the following information:
- S = Set of States -- These are all of the valid configurations of games.
- A = Set of Actions -- This would be the set of all valid plays at any given time. Since I'm only playing one player, this does not consider what the opponent's (random) moves might be. The example of the robot in the video was set up that the moves were fixed directions that could always be invoked. But I don't see any reason why this would need to be the case based on the formulas that are worked out.
- P_{s_a} = State Transition Probabilities -- For any given decision, there is a probabilistic outcome and this gives the probabilities given that you are in state s and take action a. (Note: For deterministic situations, we just set the probability to be 1 for the event that happens and 0 for everything else.) In the TTT game, the outcomes are probabilistic based on the number of available moves for the opponent that are available.
- gamma = Discount Factor -- Since these are finite length games, I can set this to be 1 without any problems. The reason that this number is less than 1 in the video is because it accounts for there being infinitely many steps each with finite value, which leads to nonconvergent sums.
- R = Reward function -- This is +1 when I win and -1 when I either lose or draw. This is an interesting point in the programming, as it might be that setting the reward to be 0 for a draw may result in a different strategy. But I won't know until I get there. For now, I want to play to win.
- V^{pi}(s) = The expected payoff of playing strategy pi starting from position s. (Note: Since the starting position is an empty board, one might think that one would only need to have this for that position. However, the calculations are done in an iterative/recursive manner, and so this value will need to be calculated at all states.
- Formula for V^{pi}(s): V^{pi}(s) = R(s) + (gamma) * \sum_{s'} P_{s pi(s)} (s') V^{pi}(s') -- These are known as Bellman's equations. For TTT, we have gamma = 1.
So this works given a particular strategy pi, which will be initialized as being random play. The next challenge comes at the learning stage. The goal is to find V^* = max_{pi} V^{pi}, which is the optimal value function, and then to find the strategy pi^* that achieves that. There's another Bellman equation for V^*: V^*(s) = R(s) + (gamma) * max_a \sum_{s'} P_{s_a}(s') V^*(s'). The strategy pi^* can be obtained (after V^* is known) by using pi^*(s) = argmax_{a} \sum_{s'} P_{s_a}(s') V^*(s').
Since the formula for V^*(s) is recursive, this is no longer something that can be calculated in a straightforward manner. This leads to the two algorithms that attempt to approximate it: Value Iteration (VI) and Policy Iteration (PI). Here's the VI algorithm:
- Initialize V(s) = 0 for all s.
- Repeatedly update using V(s) := R(s) + (gamma) * max_a \sum_{s'} P_{s_a}(s') V(s') -- This can be done synchronously (update all at once) or asynchronously (update V(s) for one s at a time). I want to try to program both.
- Initialize pi randomly.
- Repeat:
- V(s) := V^{pi}
- pi(s) := argmax_a \sum_{s'} P_{s_a}(s') V(s')
Here's my plan of attack:
- Step 1: Devise an indexing system for all possible games and game states so that I can cycle through them in some useful way.
- Step 2: Implement synchronous VI
- Step 3: Implement asynchronous VI (this should be easy after having done synchronous VI)
- Step 4: Learn how to solve large systems of equations numerically. (There might exist something in numpy that does this already. I don't know.)
- Step 5: Implement PI
No comments:
Post a Comment