Just Some Guy with a Raspberry Pi: Reinforcement Learning for Tic-Tac-Toe

This post is purely theoretical. I want to try to frame the idea of reinforcement learning for TTT. The setup here is based on Lecture 16 of the CS 229 Machine Learning Course: https://www.youtube.com/watch?v=RtxI449ZjSc.

As an important element of this programming, I'm going to always make the opponent play randomly against whatever strategy is put forward by the learning algorithm. This makes the calculation of future states easier, and possibly something that can be solved analytically.

First, we define the Markov Decision Process (MDP). This is a 5-tuple containing the following information:

S = Set of States -- These are all of the valid configurations of games.
A = Set of Actions -- This would be the set of all valid plays at any given time. Since I'm only playing one player, this does not consider what the opponent's (random) moves might be. The example of the robot in the video was set up that the moves were fixed directions that could always be invoked. But I don't see any reason why this would need to be the case based on the formulas that are worked out.
P_{s_a} = State Transition Probabilities -- For any given decision, there is a probabilistic outcome and this gives the probabilities given that you are in state s and take action a. (Note: For deterministic situations, we just set the probability to be 1 for the event that happens and 0 for everything else.) In the TTT game, the outcomes are probabilistic based on the number of available moves for the opponent that are available.
gamma = Discount Factor -- Since these are finite length games, I can set this to be 1 without any problems. The reason that this number is less than 1 in the video is because it accounts for there being infinitely many steps each with finite value, which leads to nonconvergent sums.
R = Reward function -- This is +1 when I win and -1 when I either lose or draw. This is an interesting point in the programming, as it might be that setting the reward to be 0 for a draw may result in a different strategy. But I won't know until I get there. For now, I want to play to win.

I will need a couple definitions and formulas here to describe the calculations that are about to happen:

V^{pi}(s) = The expected payoff of playing strategy pi starting from position s. (Note: Since the starting position is an empty board, one might think that one would only need to have this for that position. However, the calculations are done in an iterative/recursive manner, and so this value will need to be calculated at all states.
Formula for V^{pi}(s): V^{pi}(s) = R(s) + (gamma) * \sum_{s'} P_{s pi(s)} (s') V^{pi}(s') -- These are known as Bellman's equations. For TTT, we have gamma = 1.

Bellman's equations give us a linear system of equations that we could actually solve analytically. For the TTT game, we can think of the possible game states as picking the locations of 5 Xs out of the 9 positions and letting the rest be Os. Of course, many games end before that point, so this really creates an upper bound on the number of games as C(9,5) = 9!/(5! * 4!) = 126 games. So this should turn out to be an explicit calculation that I should be able to do. I think this will be my first project. The challenge seems to be indexing the possible game states in a useful way. I need to ensure that play does not go beyond the point that someone has won. There also seems to be some problems with the precise order of play. For example, if after three moves the board is XOX across the top row, there are two distinct ways that could have come about based on which order X made his move, but this game state has the same value either way. (I think this may be solved by coming up with a useful indexing of the games.)

So this works given a particular strategy pi, which will be initialized as being random play. The next challenge comes at the learning stage. The goal is to find V^* = max_{pi} V^{pi}, which is the optimal value function, and then to find the strategy pi^* that achieves that. There's another Bellman equation for V^*: V^*(s) = R(s) + (gamma) * max_a \sum_{s'} P_{s_a}(s') V^*(s'). The strategy pi^* can be obtained (after V^* is known) by using pi^*(s) = argmax_{a} \sum_{s'} P_{s_a}(s') V^*(s').

Since the formula for V^*(s) is recursive, this is no longer something that can be calculated in a straightforward manner. This leads to the two algorithms that attempt to approximate it: Value Iteration (VI) and Policy Iteration (PI). Here's the VI algorithm:

Initialize V(s) = 0 for all s.
Repeatedly update using V(s) := R(s) + (gamma) * max_a \sum_{s'} P_{s_a}(s') V(s') -- This can be done synchronously (update all at once) or asynchronously (update V(s) for one s at a time). I want to try to program both.

And here's the algorithm for Policy Iteration:

Initialize pi randomly.
Repeat:
- V(s) := V^{pi}
- pi(s) := argmax_a \sum_{s'} P_{s_a}(s') V(s')

This one seems a bit more difficult to program because I need to actually solve Bellman's equations to get V^{pi}. This means solving the large system of equations that I alluded to above. But in theory, that's it.

Here's my plan of attack:

Step 1: Devise an indexing system for all possible games and game states so that I can cycle through them in some useful way.
Step 2: Implement synchronous VI
Step 3: Implement asynchronous VI (this should be easy after having done synchronous VI)
Step 4: Learn how to solve large systems of equations numerically. (There might exist something in numpy that does this already. I don't know.)
Step 5: Implement PI

And this would take me to the first landing point for machine learning TTT. One "flaw" of this that I can see is that Step 1 is actually where all the real work lies. After telling the program exactly what's going to happen, it can figure out how to play to win. But eventually I want to do something where I don't enumerate all of the possibilities first. I'll have to think about how I can create some sort of game simulator. And then maybe set two different programs to play against each other and learn against each other. But that's a couple steps down the line.

Just Some Guy with a Raspberry Pi

Sunday, January 17, 2016

Reinforcement Learning for Tic-Tac-Toe

No comments:

Post a Comment