Synchronous RTS Engines 2: Sync Harder

July 24, 2011

This is a follow up to my last post, Synchronous RTS Engines and a Tale of Desyncs which, much to my surprise, was quite popular! It’s received at least four times as many views as all my other posts combined. Even better were the numerous user comments in the post itself, on Hacker News, and Reddit. If it generated interesting discussion else where please let me know as I’d love to see more reader comments.

For Part 2 I have two goals. First, to answer questions that frequently came up in the various comments sections. Second, to share some of the fantastic replies that came from various readers.

Before all that, let’s recap key points from the first post. If you haven’t read it yet I strongly recommend doing so now. My experience with synchronous engines is from working with the Supreme Commander engine at Gas Powered Games so that’s what I talked about.

If that’s too much to keep in the front lobes at once then I recommend re-reading the original post.

Command Messages

One area I glossed over was how command messages are handled across the network. Luckily, I work with the person who wrote it, William Howe-Lott, so I asked for details! Player input is handled in the form of commands to groups of units. The command (attack, move, stop, etc) for a group of units is bundled up and sent across the network to all other players. The message defines a Sim Tick slightly in the future to execute it on due to the synchronous engine.

SupCom uses what I shall unscientifically call the AckAck method. Assume four players A, B, C, and D. Player A issues an attack command sometime during Sim Tick 7. Player A sends the command to player B, C, and D to execute on Sim Tick 11. Player B sends an acknowledgement, an Ack, to player A to say that they got it. Player B also sends a message to players C and D — an AckAck for lack of a better term — to tell them that they got player A’s input. Players C and D do the same thing. Each player will only process a command when every player in the game has acknowledged that they have it. For example player B will not execute the player A command until they know that player C and player D also have the command. It’s a lot of messages, but it works and it shipped.

An alternative method, and popular suggestion, would be to process a tick as soon as you have commands for all other players. There’d be no need to wait on Acks or AckAcks. As soon as Player B has messages from players A (attack), C (empty), and D (empty) it’s good to go. This could work, but has a nasty edge case. Imagine that player A sends an attack command for Sim Tick 11 to players B and C and then disconnects. Player D is stuck while players B and C processed Sim Tick 11 and moved on to 12. This is recoverable as player B or C could send player A’s command to player D. This would allow player D to execute Sim Tick 11 and then everyone can boot player A from the game and carry on. It can become pretty messy.

Disconnect Recovery

Two of the most popular questions were on the topic of recovering from a game disconnect or from a desync. Let’s look at the disconnect case first. Say you are playing a skirmish with one or more friends and your internet connection blips out causing you to get booted. Can you rejoin? How can you get back in sync? There are two obvious approaches.

First, a full resync. Starting from scratch you need to receive a fully copy of the entire state from the other players. This is rather similar to loading a save game file. In SupCom all weapons are simulated projectiles with physics. With multiple massive armies this can lead to managing on the order of 8000 entities. Some forum searching tells me that save files for SupCom 1 were in the 50–200 mb range uncompressed. Multiplayer games may be even larger. Compressed size is on the order of tens of megabytes. Keeping in mind user 2006 upload speeds, not download, the time to transfer state will average a few minutes with many users taking up to half an hour if not longer.

Second, re-sim the game. This would be similar to playing a replay file. Reconnect to the server, receive all input commands, and run the game from scratch until you catch up. The input data is tiny so that’s good. You can also ignore the user layer and run the sim layer as fast as possible (keeping a per frame “time delta” of 100ms) which helps. Worst case scenario is still awful. For a game two hours deep it will take many users a full hour to resimulate.

A third option I haven’t put much thought into would be to keep the full game state when getting disconnected and use that as a base. I don’t think a delta-based resync from that point would be useful. Given the amount of data devoted to short lifespan projectiles and units that continuously move and die the savings wouldn’t be amazing. Running the sim from the disconnect point however would help a lot. The game could be saved on disconnect, reloaded on reconnect, and then only a few minutes of Sim Ticks to catch up on.

All of these methods however are far, far easier said that done. I have a question for the readers. How many RTS games have supported reconnections? I do not believe any Blizzard game has thus far. I’m sure others have, but I can’t name any.

Desync Recovery

Now to discuss desync recovery. It’s very much an extension of disconnect recovery so everything stated above applies. How could we implement it?

First thought is to find the offending bits of data and fix them. For a two user game it’s impossible to know which user has the correct state. For multiple users the incorrect user may be obvious. The issue with desyncs is that the bug is always there, it just shows up some of the time. If any user has desync’d it’s likely all users have as well. Worst case would be all users desync’d in a different manner. There is no correct state! Even worse would be two users who performed an undefined operation (using deleted memory) but did so in the same way! Technically they aren’t desync’d, but the game is possibly “wrong”. Desyncs are evil sons of bitches.

At this point we’re quickly going down the rabbit hole. No matter what you do there will always be unresolvable scenarios. Writing a recovery mechanism would take a huge amount of time to fix an issue that frankly shouldn’t have happened to begin with. Man months of time could be devoted to developing a recovery system, or you could just fix the desyncs.

Reader Highlights

Numerous game devs from other projects posted about their games working in a similar manner for multiplayer. Little Big Planet 1/2, Commandos, Praetorians, Command & Conquer, Age of Empires, Halo Wars, Starcraft, Warcraft, Madden/NCAA football, Halo co-op/firefight, and many more. There were also some posts discussing the pain required to create a deterministic game for replay purposes, particularly for physics based games. I apologize to all devs who had nightmare flashbacks due to my post, sorry about that. :)

Spring RTS is an open source RTS engine that I had never heard of. It’s quite cool looking. Support for online or LAN, massive numbers of units, giant maps, and other neat stuff. It even appears to support join in progress via re-simulation.

Back in 2001 Paul Bettner and Mark Terrano wrote an article similar to mine for Age of Empires — 1500 Archers on a 28.8: Network Programming in Age of Empires and Beyond Their implementation follows the same core ideas as SupCom. It’s an old article but highly relevant even today. It goes more in-depth than I did so if you are working on a synchronous type game I highly recommend given it a read.

My favorite comment, by far, is the following. “Posts like this make me want to switch fields again! Video games can have so many interesting problems to solve.” So very, very true. If there is anything video games do not have a shortage of it would be interesting problems to solve. :)