Deterministic Simulation Testing in Go with synctest // Guido Battiston

Concurrency bugs are miserable to test for. A test passes on your machine, fails once in CI, and you can’t reproduce it because it lived in some interleaving the scheduler picked that one time. Deterministic Simulation Testing (DST) is one way out of that. The idea is to take the things in your program that vary between runs and funnel them through a single bus, one operation at a time, all driven by a seed. Fix the seed and you get the same run back, so a failure stops being a fluke and becomes something you can replay and debug properly.

In practice that means getting a handle on four things: time, randomness, network and the scheduling of concurrent work. The first three are just regular interfaces you wrap and route through the bus, more or less the same in any language. Scheduling is the hard one, where Go has historically been awkward due to a few inherent language constraints. There are both commercial and open source solutions for implementing or retrofitting DST onto your project. However, since they are quite dependent on the language and tech stack, the implications for general architecture and coding discipline can be significant. A recent addition to the standard library, synctest, changes that.

What do we want to achieve?

Only one goroutine performs work at every single point in time
A single scheduler picks ops deterministically using a fixed seed
The delta between prod and test code is as small as possible

The easy part: interface injection

At its core DST is just wrapping the interfaces whose behavior differs across runs. Randomness is the trivial case, you hand around a rand.New(rand.NewSource(seed)) instead of the global functions. Network is the one worth showing, since hijacking it without touching call sites is the non-obvious part:

resp, err := http.Get("http://server/users/42")

To make it deterministic you swap the transport’s dialer. Every call site stays the same:

client := &http.Client{
    Transport: &http.Transport{
        DialContext: func(ctx context.Context, network, addr string) (net.Conn, error) {
            return bus.Dial(ctx, addr)
        },
        DisableKeepAlives: true, // optional 
    },
}
resp, err := client.Get("http://server/users/42")

That net.Conn which is returned from bus.Dial is hijacked by us to route ops to the scheduler. Write turns the bytes into an op, pushes it onto the bus, and parks until the scheduler delivers it; Read wakes when the scheduler hands over an op addressed to this conn. As a sketch:

func (c *busConn) Write(b []byte) (int, error) {
    c.bus.Push(&Op{From: c.from, To: c.to, Body: b}) // park until the scheduler delivers it
    return len(b), nil
}

func (c *busConn) Read(b []byte) (int, error) {
    return copy(b, <-c.inbox), nil // wakes when an op addressed here arrives
}

On the server side you hand the matching listener to a normal http.Server:

srv := &http.Server{Handler: mux}
srv.Serve(bus.Listener("server"))

Because everything rides on net.Conn, the standard library still serializes the HTTP request for you; each Write it makes becomes an op the scheduler orders. Any code using this client, including dependencies, is captured with no call-site changes.

SCHEDULER

BUS

net.http

CLIENT

SERVER

This visualization sums up the simplicity of DST at its core. The network op is identified by the tuple (from, to), which the bus uses to keep a deterministic ordering of the pending ops. The scheduler then draws from that ordered set with the seed to decide which op runs next. Depending on your architecture this can literally be the service’s name, or it might be more fine-grained. If you have services spawning multiple connections to the same endpoint, you must distinguish them somehow. I recommend doing that through an explicit marker in the production code. A prime subject for injecting this marker is context.Context, since it should probably be handed across cancellable functions anyway, including connections. You should keep in mind though that an ambiguous ordering of ops is a critical error leading to non-determinism. So panic whenever two pending ops compare as equal, forcing an ambiguous ordering to fail loudly instead of silently going non-deterministic. That lets you iteratively add deterministic tests, while making sure the premises don’t break.

Network and randomness are straightforward injections. Which leaves the one part Go genuinely makes hard.

The historically hard part: scheduling

Rust, for what it’s worth, gets this part almost for free. Its runtime is pluggable and its async model suits DST from the start, so instead of forcing the scheduler to behave you swap the executor for one you control. madsim works like that. Scheduling is just another interface you can control in Rust.

Go gives you no such seam, you can’t directly replace the scheduler, so the existing attempts have all had to work around it. Polar Signals forked the Go runtime to seed goroutine scheduling with GORANDSEED, then ran everything as single-threaded WASM with a syscall shim. gosim goes the other way and source-translates your program, so go foo() turns into gosimruntime.Go(foo) against a custom runtime. Both actually work, and both run into the same problem from different directions: what you test stops being what you ship. The fork pins you to single-threaded WASM, so you also inherit a different threading model and compilation target than production, on top of having to rebase the fork on every Go release and live with what wasip1 can’t do (no cgo, a thinned-out syscall surface). Source translation keeps native compilation but inserts a translator between your source and the binary, so the artifact under test is one the tool generated, and the translator itself becomes something you have to trust. Interface injection has a delta too, but it’s the smallest of the three: the business logic compiles and runs as written, and only the implementations behind a few interfaces change.

Controlling scheduling in Go is especially hard, because goroutines are a first-class primitive (for better or worse). The deterministic scheduler needs to determine which component is allowed to run next, assuming 1 component = 1 goroutine; but if components can spawn goroutines freely, we end up having hidden concurrency that can’t be controlled and will probably lead to a deadlock if the same injected interfaces are used. In addition to that, one of our premises was that only one goroutine at a time can truly execute logic; having no control over the number of routines running would break that. The solution to this could be to wrap the raw calls that spawn goroutines similar to gosim, tracking each one through a counter so the scheduler knows how many goroutines it has to account for. That’s basically a semaphore, which becomes quite brittle once you have external dependencies that decide to spawn goroutines on their own. Moreover, it imposes quite a lot on the actual business logic being written.

How do we detect if all goroutines are blocked?

You can only detect inactivity of goroutines you actually know about. If you are fully aware of the number x of all possible goroutines, you can just wait for x ops waiting to be unblocked in your bus, but if that number becomes > x due to goroutines spawning each other within a component, quiescence detection gets impossible. Fortunately, now there is synctest. synctest landed as an experiment in Go 1.24 and became stable in 1.25, giving us the ability to detect inactivity of Go code wrapped in synctest.Run.

Run calls a function in a new goroutine. This goroutine and any goroutines started by it exist in an isolated environment which we call a bubble. Wait waits for every goroutine in the current goroutine’s bubble to block on another goroutine in the bubble.

— The Go Blog

With this we can let goroutines freely spawn other routines as long as they use our hijacked deterministic interfaces, and then detect once each individual one is done and blocked/parked, waiting for our scheduler to choose an op. Which routine an op comes from doesn’t really matter anymore. Keep in mind though, if the op is indistinguishable from others (like in the network injection), we still have to find a key for it. Fortunately, this is hierachical, meaning you only need to assign a key to an op in a specific branch. For example, if all your go routines establish connections to different services, there is no need for keys at all.

Time

synctest also makes time deterministic. Inside a bubble the clock starts at a fixed instant (midnight UTC, Jan 1 2000) and stays there until every goroutine is blocked. Then it advances to the next pending timer and fires it.

If everything is blocked and there’s no timer to fire, synctest panics, it treats that as a deadlock. The scheduler gets around this with synctest.Wait, which blocks until all the other goroutines are parked and then returns. So the scheduler is the one routine still running once the workers go idle, and that’s when it delivers the next op. You only hit the panic if the workers are stuck and nothing is feeding them ops, which usually just means your scheduler is broken.

Only real timers move the clock, so Sleep, After, Tick, Timer/Ticker, AfterFunc and context deadlines. CPU work moves it by zero, which is great for reproducibility but means real durations are gone. An auth check and a database write both finish in the same instant unless a timer is involved. If you want the write to take longer you have to model that yourself, give the op a delay drawn from the seed and sleep on it so the clock moves.

Other sources of non-determinism

A couple of them are specific to Go:

map iteration

for id, peer := range peers {
    peer.Send(update) // each Send becomes an op
}

This unfortunately is not deterministic in Go, and has to be avoided, especially if an operation crossing the boundary between components is influenced by such an iteration. Go randomizes map iteration order on purpose, so the op stream above ends up seeded by the runtime instead of your scheduler. Iterate over a sorted slice of the keys instead, so the Send order is the same every run.

select case

select is a tricky one, because usually this shouldn’t cause any problems. select statements are often used in Go to terminate servers, waiting on a done channel.

select {
case <-done:
    return
case req := <-work:
    handle(req)
}

You might expect select to check its cases top to bottom and take the first ready one, like a switch does. It doesn’t. When more than one case is ready, Go picks between them at random. The reason is fairness: if select always took the first ready case, a channel that’s constantly ready would starve everything below it, and your done channel might never get a turn.

So running one goroutine at a time doesn’t save you here, the randomness is baked into select itself. In the termination pattern above you almost always have a single case ready at a time, so it’s fine in practice. But the moment two channels can be ready in the same step, the ordering is back in the runtime’s hands and you have to force it yourself, for example by draining work before you ever look at done.

Caveats

Using blocked channels as a coordination mechanism is not the most efficient method, and clearly beaten by injected runtimes like madsim in pure performance.
In this post I advocated for an application layer approach by working on an interface level. In an ideal world you would probably wish that everything is handled by the overarching runtime. This is the Antithesis approach, and it has the huge advantage that the boundaries of your exploration space can be stretched massively. Why mock out mongodb when you can just run it as is with the correct version? This isn’t an ad for Antithesis, and I won’t pretend the invisible approach is strictly worse, because being able to run the real dependency is a genuine advantage we’re giving up. The value proposition of DST lives at the application layer; how does your protocol respond under different orderings, how does your database state stay consistent under multiple writers and readers. And having to decide where to inject interfaces and where to draw boundaries does not only help you refine those, it helps you decide what interleavings or patterns even matter in the first place.

The seams are where the bug classes we care about live. Deciding that two writers to the same aggregate must be ordered, and then watching what the protocol does when the scheduler picks the losing order, is a test we only think to write because we were forced to name that boundary. A black-box explorer can reach the same state, but it won’t tell us which boundary mattered. And breadth isn’t the whole story: you can run your application deterministically a million times through automation, and a domain expert can still find a major bug by probing the right configuration. So using DST at the application layer is less of an ideal technical solution and more of a sophisticated way to approach development.
If your app falls into the territory of “classic CRUD”, you might get less mileage out of a DST approach, simply because ordering becomes close to irrelevant. One backend that receives requests in a single queue, performs a handful of direct operations and writes back to a database. That describes quite a few systems, which is fine. Keeping it simple can be a virtue here. However, if your system grows beyond that, for example if now the backend also signals multiple other services about the new data and these services influence the state as a whole, you are back into a design that could benefit from DST. A naive litmus test can be to consider if your whole system can be expressed through a set of pure functions. Input goes in (e.g. request and user data), output goes out (e.g. database write), with no side effects. Then coordination tests between components rarely matter, because they can each be clearly separated with individual tests.
DST can’t catch low-level race conditions. This is by design though, if only one goroutine is allowed to run at a time, there is no way for one to write data while another is reading (exception would be the first run to quiescence).

What now?

The rest is writing the test harness that provides the central bus, the scheduling algorithm that orders ops deterministically, and the interface wrappers that stay as close to the prod interface as possible. Quite a lot of work if not done in a fresh codebase but well worth it, especially if you use it as general playground for your development, which has the potential to reduce iteration time massively. Additionally it’s a one-time task with very little regression if your injection layer in the actual prod code is thin, which remains one of the key challenges.

Start with the network layer. That’s where many distributed bugs occur, and it’s usually the outermost layer you have to tackle.

DETERMINISTIC SIMULATION TESTING IN GO WITH SYNCTEST