In August 2009, Sergey Karakovskiy and Julian Togelius held a competition on behalf of the ICE-GIC (IEEE Consumer Electronics Society Games Innovation Conference) for contestants to build a soft-bot that played Super Mario.
The Super Mario Engine they used was a modified version of one built in Java by the one and only Markus Persson , a.k.a. “Notch” of Minecraft fame. The Engine is named Infinite Mario Bros., and is still playable online for free. Entrants into the competition had to build an agent who would tell the engine Mario’s next move every 40 milliseconds.
The competition was held in 2009, and again in 2010. Robin Baumgarten won this competition with his bot, AStarAgent, both years. I stumbled upon this competition in late August 2010 and was immediately intrigued. I had wanted to build an agent that played a known game –any game, really—for some time. A modern first person shooter like Halo or RTS like Starcraft would be mind-blowing, exciting, and most of all, recognizable to family and friends.
(Warning, this is a very long article!)
Fortunately I was starting some research as an undergrad at the University of Dayton, but was lukewarm on my current idea – building a LEGO Mindstorm bot to find its way through a maze using genetic algorithms. Here, with Mario, I saw a bigger and better opportunity. I was going to build a Mario bot that was better than the best one, and do it with a Genetic Algorithm.
What is a Genetic Algorithm? Well, the entire field of Evolutionary Computation is relatively new, even for the field of Computer Science. Essentially Evolutionary Computation attempts to replicate, in software, portions of the biological phenomenon of evolution. If complex intelligent beings like Humans evolved out of some random primordial goop, then why couldn’t an adequate and capable program evolve out of some random operators?
The subset of Evolutionary Computation known as Genetic Algorithms or GA’s are, in their simplest forms, just evolution in code. Automated trial-and-error if you will. Start with a random population of chromosomes, use natural selection to choose the best ones, mate, produce children, have mutations in children DNA, use natural selection to choose the best children, rinse and repeat. That is virtually the entire logic behind a Genetic Algorithm, and it’s not extraordinarily difficult to implement in code, either.
The hard part is this thing we’re calling “natural selection”. In the real world, it’s obvious; if you can’t survive, your genes don’t get passed on to the next generation, you are “selected out”. In code it can be a little different. A programmer needs to discover a method for testing a member of the population to determine that member’s fitness level, and what makes a member “fit” isn’t always abundantly clear. Sure, if you’re trying to write a GA to evolve an organism to walk, then you can objectively say “If organism A is able to walk farther than organism B, organism A is more fit”. But it’s not always that easy. For example, if you were writing a GA to evolve a population of notes in various keys to create pleasant music, first you need to be able to define what “pleasant” music is. How do you write code that says “Yup, this music is pleasant.” Or “Nope, this music is discordant”?
Luckily for me (and Mario), one fitness function stands out: If Mario A can travel further in a level than Mario B, then Mario A is better. Of course, my eventual fitness function is slightly more complex, and on top of it, this type of fitness complicates things.
Mankind’s history of evolution goes back billions of years, through billions of generations of organisms. If I want to write a GA that runs even 100,000 generations, each generation should be evaluated pretty quickly, right? If one generation takes 2 full seconds to evaluate, then just 100K evaluations will take me just over 2 days. If my fitness function for Mario is directly related to how far the bot can travel in its world, we’ve got some issues.
First off, Mario’s world may be appear simple to us, but it’s way more complex than something like a Sliding Tiles game. The bot needs to think as much as it can during those 40 milliseconds. Since Mario can move forward, backwards, duck, jump, optionally “go faster”, and do many of these things on combination, he has more than a handful of possible moves he can choose to do at any one time. Plus, he needs to account for Koopa’s, Piranha Plants, and bottomless pits to top it off. How can a bot possibly make an intelligent decision given all these variables?
Robin’s bot used a sort of A* search, hence the name AStarAgent, so that seemed like a pretty good starting point. Lucky for me, I showed up to the Mario party late, and the competition had already run. This means that the source code for all the bots had been released, and I could actually go dissect Robin’s bot to see what made him tick, and how I think I could improve it.
By Robin’s own admission, his bot was hastily put together, and written for speed, not elegance. And it showed. After spending weeks pouring through his code slowly figuring out what everything meant and how it all worked together I threw Robin’s entire A* search out. Yup, sometimes you’re better off working with something you understand as if you had written (because you DID write it!).
I didn’t throw everything out, however. What I think made Robin’s bot better than other similar ones was the accuracy of his agent’s internal representation of the world.
Some Mario bots were neural nets that spewed out decisions based on the matrix of input, some used hand-written algorithms specific to Mario, but the better ones kept an internal representation and ran some sort of search on the information available to discover the best move available.
Boiled down to its basics, an A* search works as follows:
- Start with some sort of starting state (the root of a tree, or in this case, Mario’s current location).
- Generate all possible children states from the starting state, evaluate them, and place them on a list (the OPEN list).
- Take the best child from the OPEN list, and generate its possible children states. Evaluate those children, and place them on the OPEN list, too.
- Continue removing the best child from the OPEN list until you find a solution, run out of time, or you have exhausted all possibilities (no solution).
An implementation is a little more complex than that, but you get the idea. The hardest part about writing an A* search, just like for our GA, is this function that evaluates how “good” a state is. In a GA we’d call the function the fitness function. In the A* search we’ll call it the Static Evaluation Function (SEF).
So Robin’s bot used an A* search with a SEF function that favored states where Mario was not injured, and he had the most potential to move forward. Essentially, states that were farther right and moving faster on the screen were favorable so long as Mario wasn’t taking damage.
Well, crud, that’s a pretty good static evaluation function, in theory at least. Robin had placed some arbitrary, no doubt empirical weights on values of speed, position, and health. The first improvement I saw, outside of rewriting his code to run smoother, was to improve these weights. But why stop there?
In Mario, oftentimes if you can stay really high in the air, then you don’t have to worry about enemies or holes in the ground. What if Mario should be valuing a mix of vertical and horizontal positions? What if Mario should make an effort to grab a koopa troopa’s shell so that he can release it later to remove enemies in his path? What if Mario allowed himself to take some damage so long as he didn’t die and it made him move faster through the level? These are all good thoughts, but how do I weight these things against each other? It seemed Robin had discovered his weights through trial and error. Not only did I not want to spend the time running through trial and error myself, but with the possible combinations of all the weights I was thinking of, it could take years! Thus the motivation to automate it.
I said before that you could think of GA’s as automated trial and error, and that’s exactly what I needed. When I rewrote Robin’s A* algorithm, I needed to discover a better SEF, and what better way to do this than through evolution? Before I continue, I should probably make it abundantly clear exactly how this works:
- Mario receives a percept of his surroundings from the game engine
- Mario decides on his next course of action, and tells the engine his next action
- The engine steps forward, and gives Mario a new percept.
- Repeat until Mario dies or finishes the level.
In step 2, when Mario is deciding his next course of action, he is running an A* search, using his SEF to tell him which states are good and which are not. A Genetic Algorithm to evolve the SEF complicates things, turning the sequence into this:
- GA gets the next SEF from a population of SEFs waiting to be tested, and inserts the SEF into the Mario bot.
- GA tests the Mario bot (steps 1-4 above)
- GA places this evaluated SEF on the OPEN list
So you can see that the GA encompasses the game. Since the fitness of each SEF depends on how well it causes Mario to play a level, testing just one SEF will be tedious. That is, just one generation of five SEFs causes a bot to run through a level 5 times. Tedious and time consuming, right?
Right. If this is the route I was going to take (and I was), I needed to make sure that the evaluation of each Mario bot took as little time as possible. A normal length level could take more than 20 seconds to complete if the bot performed flawlessly, and if the bot simply stood still, then one evaluation would be over a minute long. Let’s see, 40 seconds per evaluation, 5 bots per generation, 10,000 generations, comes to… just over 23 days. Better computing power wasn’t going to help me here, I needed to make sure that I had a short, yet challenging test level, and a small time limit. Eventually I got down to a level that an optimal bot could complete in about 13 seconds. Running just 800 generations with a time limit capped at 25 seconds, the simulation ran for roughly 24 hours.
And the results were… interesting. There were roughly 3 types of bots: bots that didn’t get anywhere, bots that made it to some random point near the middle, and bots that finished fairly quickly. A typical distribution, right? Well yeah, except when you consider that this was the case for just about every generation, and doesn’t reflect the entire evolution. That is, some bots in the very first few generations showed this kind of distribution (which could be expected, since they start randomly), but some bots in the last few generations also show this kind of distribution.
Why is that? Well, I’ve decided it has to do with how I structured my genetic algorithm. In general, you want a genetic algorithm to converge to one or a few solutions, and rarely perform worse (I can’t say never, because random mutations can always result in an unfit member). I, however, did not write my program to converge so much as to explore more possible bots, given that I only had 800 generations to do it.
One way to perform the “mating” stage of a genetic algorithm is to allow every member in the population a chance to be chosen as a parent. However, these chances are weighted against each member’s score compared to the entire population. If the total population score is 100, and a member has a score of 10, that member has a 10% chance of being chosen. Likewise, a member with a score of 1 has a 1% chance to be chosen.
What I did was disallow identical members with identical scores in my population. That is, if an identical static evaluation function as one already in my population was created, and evaluated to the same score as the one in the population had, it was thrown out. Since more fit SEFs are chosen more often, I foresaw that more and more identical SEFs with high scores would be introduced to the population, and the GA would quickly converge, which is normal, but I favored more searching over convergence.
A future version of this program will be run on a Beowulf cluster to take advantage of parallel computing. In this way, I can evaluate multiple SEFs simultaneously instead of sequentially. Additionally, some tweaks to the GA algorithm itself (like allowing identical members in my population) will be had in the future.
After the first run of the GA, I ended up with some fairly well-performing bots. But they still weren’t better than Robin Baumgarten’s. I couldn’t accept defeat, so I made another, longer level, took the top 5 bots from the first GA, and used them as the starting population for round two. This time I only ran for 320 generations, as the level was nearly 3 times as long as the first one, and the execution took almost 50 hours!
The results from this run were impressive. The program finally evolved a bot or two that could best Robin’s on a few levels. Some manual tweaking later, and I have a bot that always performs as good or better than Robin’s!
This video at the top of the article is my bot running through a fairly lengthy level. The biggest difference between my bot and Robin’s is that my bot will gladly sacrifice some damage just to beat the level a little faster.
As this article is running on a little long, I’ll cut it off here. If you read this far, good for you! If there’s enough interest, I’ll post a follow up article on any aspect of the Mario bot I didn’t cover here (of which there are many). I’m glad to answer any questions as well. Since you got this far, here is a funny picture: