Casuality Blog | Alexander Mikhailian

The Concurrency Problem: what's broken, what almost worked, and what a real fix might look like

Updated: 6 min 37 sec ago

What Async Promised and What it Delivered

Tue, 2026-04-21 02:00

OS threads are expensive: an operating system thread typically reserves a megabyte of stack space and takes roughly a millisecond to create. Context switches happen in kernel space and burn CPU cycles. A server handling thousands of concurrent connections and dedicating one thread per connection means thousands of threads each consuming memory and competing for scheduling. The system spends time managing threads that could be better spent doing useful work.

This is the C10K problem, named by Dan Kegel in 1999. If you were building a web server, a chat system, or anything with a large number of simultaneous connections, you needed a way to handle concurrency without a thread per connection.

The answer came in waves, each solving the previous wave’s worst problem while introducing new ones. Previously we’ve looked at channels in Go and actors in Erlang. Now we turn to async, which is everywhere these days.

Callbacks

The first wave was straightforward: don’t block the thread. Instead of waiting for an i/o operation to complete, register a function to be called when it finishes and move on to the next piece of work. Event loops (select, poll, epoll, kqueue) multiplexed thousands of connections onto a handful of threads, and callbacks were the programmer’s interface to this machinery.

Node.js built an entire ecosystem on this model, handling thousands of concurrent connections on a single thread. Nginx’s event-driven architecture was a major reason it displaced Apache for high-concurrency workloads.

This nicely solved the performance problem, but at a cost: callbacks invert control flow. Instead of writing “do A, then B, then C” as three sequential statements, you write “do A, and when it’s done call this function, which does B, and when that’s done call this other function, which does C.” The programmer’s intent becomes scattered across nested closures. JavaScript developers named this “callback hell” and built an entire website to commiserate.

Callbacks have deeper problems than aesthetics, such as fracturing error handling. Each callback needs its own error path. Errors can’t propagate naturally up the call stack because there is no call stack (callbacks run in a different context from where they are registered). Handling partial failure in a chain of callbacks means threading error state through every function in the chain.

Plus, callbacks have no notion of cancellation. If you start an asynchronous operation and then decide you don’t need the result, there’s no general way to stop it. The callback will fire eventually, and your code needs to handle the case where it no longer cares about the result.

Callbacks solved the resource problem (too many threads) by creating an ergonomics problem (code that’s hard to write, read, and get right).

Promises and Futures

The next wave started with a good idea: what if, instead of passing a callback for later invocation, an asynchronous operation immediately returned an object representing its eventual result?

This is a promise (JavaScript) or future (Java, Rust, etc). The concept dates to Baker and Hewitt in 1977, but it took the C10K pressure of the 2010s to push it into mainstream programming. JavaScript standardized native Promises in ES2015 following the community-driven Promises/A+ spec, and Java 8 introduced CompletableFuture.

Promises are more ergonomic than callbacks. First, promises are composable: promise.then(f).then(g) reads as a pipeline instead of a nested pyramid. Error handling also consolidates: a .catch() at the end of a chain handles failures from any step. And promises are values that you can store, pass around, and return from functions. A first-class handle to an eventual value moves the conversation away from raw threads and toward data dependencies. The idea that “this value depends on a computation that hasn’t finished yet” is a useful thing to be able to express.

Here’s JavaScript reading a user profile and then fetching their recent orders, first with callbacks, then with promises:

// Callbacks: nested, error handling at every level getUser(userId, (err, user) => { if (err) return handleError(err); getOrders(user.id, (err, orders) => { if (err) return handleError(err); render(user, orders); }); }); // Promises: chained, error handling consolidated getUser(userId) .then(user => getOrders(user.id).then(orders => [user, orders])) .then(([user, orders]) => render(user, orders)) .catch(handleError);

The promise-based version is not a huge improvement on this small example, but the difference grows with complexity: five steps deep in callbacks is nearly unreadable, while five .then() calls chained together are at least linear.

But promises introduced their own problems:

Promises are one-shot. A promise resolves exactly once. This makes them unsuitable for modeling streams, events, repeated messages, or any ongoing communication. A WebSocket that receives a stream of messages doesn’t map onto “a value that will exist later.” This forces a split: promises for request-response patterns, and something else (event emitters, observables, callbacks again) for everything else.

Composition is clunky. The example above hints at it: getting both user and orders into the final .then() requires nesting or awkward gymnastics with Promise.all. Two independent async operations are easy (Promise.all([a, b])), but anything more complex (conditional branching, loops over async operations, early exit) requires increasingly elaborate combinator patterns. These patterns work but they’re a functional programming idiom grafted onto an imperative language and they don’t feel natural.

Errors vanish silently. JavaScript promises that reject without a .catch() handler originally just swallowed the error. The value was lost causing failures to be invisible. This was bad enough that Node.js eventually changed unhandled rejections from a warning to a process crash, and browsers added unhandledrejection events. A feature designed to improve error handling managed to create an entirely new class of silent failures that didn’t exist with callbacks.

The type split. Every function now returns either a value or a promise of a value. So callers need to know which one they’re getting and libraries need to decide which one to provide. A function that was synchronous becomes asynchronous when you add a database call to it, and now every caller needs to handle a promise instead of a value. This is a mild form of the coloring problem that the next wave would make even worse.

Async/Await

Promise chains still looked nothing like the sequential code developers wrote for everything else. Async/await, pioneered by C# in 2012 and adopted by JavaScript (ES2017), Python (3.5), Rust (1.39), Kotlin, Swift, and Dart, delivered exactly that:

// Promise chains function loadDashboard(userId) { return getUser(userId) .then(user => getOrders(user.id) .then(orders => [user, orders])) .then(([user, orders]) => render(user, orders)); } // Async/await async function loadDashboard(userId) { const user = await getUser(userId); const orders = await getOrders(user.id); return render(user, orders); }

The async/await version reads like sequential code. Variables bind naturally. You can use try/catch instead of .catch(). Loops work with await inside them. It’s an ergonomic win for linear sequences of asynchronous operations.

The industry adopted it fast, with JavaScript frameworks going all-in, Python’s asyncio becoming the standard approach for concurrent i/o, and Rust stabilizing async/await as the path to high-performance networking. Within a few years, async/await was the default way to write concurrent i/o code in most mainstream languages.

Paying the Function Coloring Tax

In 2015, right as async/await was gaining steam, Bob Nystrom published “What Color is Your Function?”, a thought experiment about a language where every function is either “red” or “blue.” Red functions can call blue functions, but blue functions can’t call red functions without special ceremony. Every function must choose a color, and if you call a red function from a blue one, the blue one must become red, spreading virally throughout the codebase.

This was an analogy to async/await: async functions are red, sync functions are blue. An async function can call a sync function without issue, but calling an async function from a sync function requires blocking the thread or restructuring the code. Every function in your program must choose a color, and that choice propagates through every caller.

Nystrom’s post stuck because it put a name to something developers had been experiencing without a vocabulary for it. Function coloring reshapes entire codebases and ecosystems.

The Rust async ecosystem fragmented around competing runtimes (Tokio, async-std, smol) that provide incompatible implementations of fundamental types like TCP streams and timers. A library written for Tokio can’t easily be used with async-std. The popular HTTP client reqwest simply requires Tokio, and if your project uses a different runtime, that’s your problem. Now library authors either pick Tokio (locking out alternatives) or attempt runtime-agnostic abstractions (adding complexity and sometimes performance overhead).

Tokio’s dominance is function coloring at ecosystem scale. The tax shows up at other scales too:

At the function level, adding a single i/o call to a previously synchronous function changes its signature, its return type, and its calling convention. Every caller must be updated, and their callers must be updated. The change ripples through the call graph until it hits a framework entry point or a main function. A one-line database lookup can require modifying dozens of files.

At the library level, authors face a choice of writing a sync library and exclude async users, or writing an async library and force sync users to add runtime dependencies (or maintain both). Many choose “both,” doubling the API surface, the test matrix, and the maintenance burden. In Python, the requests library (sync) and aiohttp (async) are separate projects by separate authors doing the same thing. httpx eventually appeared to offer both interfaces from one package, which is an improvement only needed because of the split.

At the ecosystem level, the Rust example above is the norm, not the exception. Every library that touches i/o must choose a color, and that choice limits which other libraries it can work with. The Rust async book itself notes that “sync and async code also tend to promote different design patterns, which can make it difficult to compose code intended for the different environments.”

And the costs aren’t just logistical: async/await introduced entirely new categories of bugs that threads don’t have. O’Connor documents a class of async Rust deadlocks he calls “futurelocks”: a future acquires a lock, then stops being polled while another future tries to acquire the same lock. With threads, a thread holding a lock always makes progress toward releasing it (unless you do something everyone knows is dangerous, like SuspendThread). With async Rust, the standard tools like select!, buffered streams, and FuturesUnordered routinely stop polling futures that hold resources. The original futurelock at Oxide required core dumps and a disassembler to diagnose.

A Sequential Trap

A subtler cost that gets less attention is that async/await’s greatest strength, making asynchronous code look sequential, is also a cognitive trap.

async function loadDashboard(userId) { const user = await getUser(userId); const orders = await getOrders(user.id); const recommendations = await getRecommendations(user.id); return render(user, orders, recommendations); }

This fetches orders and recommendations sequentially: getRecommendations doesn’t start until getOrders finishes. But these two operations are independent, because recommendations don’t depend on orders. So they could run in parallel, but don’t. The code looks clean and correct while leaving performance on the table.

The parallel version requires the programmer to explicitly break out of sequential style:

async function loadDashboard(userId) { const user = await getUser(userId); const [orders, recommendations] = await Promise.all([ getOrders(user.id), getRecommendations(user.id) ]); return render(user, orders, recommendations); }

The pattern scales poorly beyond small examples. In a real application with dozens of async calls, determining which operations are independent and can be parallelized requires the programmer to manually analyze dependencies and restructure the code accordingly. The sequential syntax actively obscures the dependency structure, i.e. the one piece of information that would tell you what can run in parallel.

Async/await was introduced to make asynchronous code easier to write. It made “what can run concurrently” something the programmer must determine manually and express through combinator patterns that break the sequential flow that was the whole point.

What Async Got Right

To be fair, async abstractions did improve things.

Async/await’s ergonomics for linear sequences are better than callbacks or promise chains. For code that’s inherently sequential but happens to include i/o, async/await removes real syntactic noise. It’s easier to read and debug than callback-based code.

And some languages learned the right lessons from the coloring problem. For example, Go deliberately chose goroutines over async/await, accepting a heavier runtime in exchange for no function coloring at all. Java’s Project Loom (virtual threads in Java 21) made the same bet: lightweight threads that look and behave like regular threads, so no code needs to change color. The Loom team explicitly cited function coloring as a problem they wanted to avoid.

Zig went further: it removed its compiler-level async/await entirely and rebuilt around an Io interface parameter that i/o operations accept. The runtime (threaded, event-loop, whatever the user supplies) fulfills the interface. Function signatures don’t change based on how they’re scheduled, and async/await become library functions rather than language keywords. Though some argue that the Io parameter itself is a form of coloring.

Language designers who studied the async/await experience in other ecosystems concluded that the costs of function coloring outweigh the benefits and chose different paths.

Accumulating Costs

Each solution solved a problem but introduced new costs. And those costs are structural, affecting the shape of every program, library, and API in the codebase.

WaveSolvedIntroducedCallbacksThread-per-connection resource exhaustionInverted control flow, fragmented error handling, callback hellPromisesNesting, error consolidation, values over callbacksOne-shot limitation, silent error swallowing, mild type splitAsync/AwaitErgonomics for linear async sequencesFunction coloring, ecosystem fragmentation, new deadlock classes, sequential trap

Each wave made the local experience of writing async code more pleasant while making the global experience more complex. The developer writing a single async function has never had it better, while the team maintaining a large codebase with mixed sync/async code, managing dependency compatibility across runtimes, and trying to find parallelism opportunities hidden behind sequential-looking await chains are carrying a burden that didn’t exist before these abstractions were introduced.

This isn’t a case of bad engineering. The people who designed callbacks, promises, and async/await were solving real problems, and each step was a reasonable response to the previous step’s failures. But fifteen years and several iterations in, the accumulated tax is sizable, and a pattern is visible: each fix treats symptoms while leaving the structure intact.

The callbacks-to-promises-to-async/await arc may be the clearest illustration yet of a theme running through this series: approaches that start by asking “how do we manage concurrent execution?” keep generating new problems at every level of abstraction. You can watch this one play out in real time, across a single ecosystem, within a single decade.

References

Baker, Henry and Carl Hewitt. “The Incremental Garbage Collection of Processes.” ACM SIGART Bulletin 64 (1977): 55–59.
Kegel, Dan. “The C10K Problem.” 1999.
Nystrom, Bob. “What Color is Your Function?” February 1, 2015.
Elizarov, Roman. “How Do You Color Your Functions?” Medium, November 18, 2019.
Cro, Loris. “Zig’s New Async I/O.” Blog post, 2025.
“Virtual Threads in Java.” Oracle Java Magazine.
Corrode Rust Consulting. “The State of Async Rust: Runtimes.” Blog post.
O’Connor, Jack. “Never Snooze a Future.” Blog post, 2026.

Categories: Software

The Isolation Trap

Mon, 2026-03-09 01:00

We’re continuing to look at the state of concurrency in programming languages and identifying what’s wrong with it. As a reminder, in Message Passing Is Shared Mutable State I argued that Go’s channels are shared mutable state with extra steps. But you may think Go’s channels aren’t true message passing. The honor for “best case for message passing” probably falls to Erlang. So let’s look at Erlang.

The Best Case

The actor model has been a signficantly influential idea in concurrent programming since Carl Hewitt proposed it in 1973. The core idea is appealing: concurrent entities (actors) communicate exclusively through messages. Each actor has its own private state that no other actor can touch. If actors can’t share state, they can’t have shared-state bugs.

Various languages and frameworks have implemented versions of this idea. Akka brings actors to the JVM, but since they share a Java heap the isolation is enforced by convention, not by the runtime. Akka actors can pass a reference to a mutable object in a message, and now two actors share mutable state. Swift added actors in version 5.5 with stronger compile-time checks, but allows reentrant calls by default, reintroducing some of the problems actors are supposed to prevent. Orleans and Dapr offer “virtual actors” that solve lifecycle management but not the fundamental concurrency model.

If you were going to build the strongest possible version of the actor model, prioritizing safety and fault tolerance above everything else, you’d probably end up with something close to what Joe Armstrong and the Ericsson team built.

Erlang processes have separate heaps, so they can’t share memory. Messages are copied from one process to another, not shared by reference. If a process dies, its state dies with it, and supervision trees handle recovery. There’s no mechanism for one process to corrupt another process’s memory, because there’s no shared memory to corrupt.

This isn’t just academic elegance, it kept phone switches running with five nines of availability. It scaled WhatsApp to hundreds of millions of users on a small team. It has thirty years of production battle-testing in systems where downtime means real consequences.

Erlang is the strongest form of the isolation argument, and it deserves to be taken seriously, which is why what happens next matters.

The Familiar Problems

In the first essay I argued that message passing is shared mutable state. The communication mechanism itself (channel, mailbox, message queue) is a shared mutable resource and inherits the problems that shared mutable state has always had. Erlang’s mailboxes are no exception.

Erlang’s single-owner mailbox design is more disciplined than Go’s channels: only the owning process can read from a mailbox, and sends are asynchronous. Yet the four failure modes of shared mutable state still show up, just in different shapes.

Consider two Erlang servers that each need data from the other:

%% Server A handles a request by calling Server B handle_call(request, _From, State) -> Result = gen_server:call(server_b, sub_request), {reply, Result, State}. %% Server B handles a request by calling Server A handle_call(sub_request, _From, State) -> Result = gen_server:call(server_a, request), {reply, Result, State}.

This isn’t obviously wrong. Two servers collaborating is a normal architecture. But if a request arrives at Server A that triggers a call to Server B while Server B is already handling a request that calls Server A, then both block forever. Each is waiting on its own mailbox for a reply that will never arrive, because the other server is waiting too. This is a mutex deadlock expressed through message passing.

Erlang developers know this pattern and OTP design guidelines discourage it. But knowing about it doesn’t prevent it. Researchers found previously unknown instances in production OTP libraries written by experts following the guidelines. And a 2026 OOPSLA paper by Fowler and Hu proves a stronger result: two protocols that are individually deadlock-free can still combine to deadlock in an actor system. The only solutions are restricting each actor to a single session at a time (too limiting for real servers) or building a flow-sensitive type system to thread protocol state through every function call. The problem isn’t that developers write circular calls by accident. It’s that deadlock-freedom doesn’t compose.

The other three failure modes follow the same pattern. Erlang mailboxes are unbounded and provide no automatic backpressure, so if a process receives messages faster than it can handle them, the mailbox grows until the node runs out of memory and crashes. Fred Hébert (author of Erlang in Anger) built an entire library called pobox specifically for this problem, noting that “high throughput Erlang applications often get bitten by the fact that Erlang mailboxes are unbounded.” Message interleaving from multiple senders is nondeterministic, creating ordering races that the language can’t prevent. And Erlang messages are dynamically typed, so a process can send any term to any other process with no compile-time check that the receiver expects it.

These are real bugs in real Erlang systems. The mailbox design makes some of them less likely than their Go channel equivalents, but doesn’t make any of them structurally impossible.

The Mitigations

Erlang has answers for each of these problems, and they’re good answers:

ProblemErlang’s ShapeMitigationEnforced byDeadlockCircular gen_server:call chainsPrefer async casts, use timeoutsConvention at design timeLeakUnbounded mailbox growthMonitor sizes, use pobox, back-pressureMonitoring at runtimeRaceNondeterministic message interleavingCareful protocol design, testingDiscipline at design timeProtocol violationUntyped messages, unmatched clausesOTP behaviors, code reviewConvention at design time

These mitigations work. They’re a big part of why Erlang systems are as reliable as they are.

But look at the last column: convention, monitoring, discipline. Every one of these falls on the programmer. Not one is enforced by the language or the compiler, and one can’t even be enforced until the system is running in production under real load. Every mitigation depends on the programmer doing the right thing, and properties that aren’t guaranteed by the system will eventually be violated by the humans using it.

Skip the gen_server and use raw message passing? You lose the protocol structure. Use gen_server:call in a circular dependency? There’s a 5-second timeout by default so you won’t deadlock, but you’ll get a cascading timeout failure that’s hard to trace. Neglect to monitor mailbox sizes? An overflow crashes the node at 3 AM. Use the wrong receive clause pattern? Silent misbehavior.

(Edited Mar 14: Originally this paragraph said “Forget to set a timeout on a gen_server:call?” which implied the default has no timeout. This was incorrect, gen_server:call/2 has a 5-second timeout default. The circular call example would timeout rather than deadlock. Thanks to commenters on Hacker News for the correction.)

Each mitigation individually is reasonable, but they accumulate. A new developer joining an Erlang team doesn’t just need to learn the language, they need to learn which conventions are load-bearing, which tools to run, which patterns are safe, and which innocent-looking code has a deadlock hiding inside it. Each new thing the programmer has to remember is one more thing the programmer can forget.

This is the discipline tax. It works when the team is experienced, the codebase is well-maintained, and the conventions are followed consistently. It erodes when any of those conditions weaken, and given enough time and enough turnover they do.

The Bottleneck

Even when all the mitigations are in place and the team follows every convention, the isolation model has a structural performance limitation.

Every process’s state is accessed through its mailbox. One process, one mailbox, one message at a time. All access to that process’s data is serialized. If you want to read its state, you send it a message and wait for a reply.

This is fine under moderate load, but it becomes a problem when a thousand other processes all need to read from the same data: a routing table, a configuration store, a session registry, a shared lookup cache. The pure model says “send a message, wait for a reply.” Each reader waits in line, the mailbox becomes a funnel, and throughput collapses.

This isn’t a bug or a design oversight, but rather a direct consequence of the isolation model. If safety comes from “no process can access another process’s state directly,” then all state access must go through the owning process, and the owning process becomes a serialization bottleneck.

Safety through isolation means that safety and performance are in tension. I think Erlang’s creators understood this quite well.

The Escape Hatch

ETS (Erlang Term Storage) exists because of this bottleneck. ETS tables are mutable, concurrent, in-memory data structures sitting outside the process model. Any process can read from or write to a public ETS table without sending a message to anyone.

This was a principled engineering decision by the Erlang team, not a mistake. They recognized that pure isolation couldn’t meet real-world performance requirements and provided a carefully designed escape hatch.

And the pressure didn’t stop at ETS. OTP 21 added persistent_term, a global immutable store optimized for data that is read constantly and written rarely (e.g. configuration, routing tables, compiled regular expressions), because even ETS had too much overhead for those access patterns. OTP 22 added the atomics and counters modules: direct shared-memory operations with no copying, no message passing, and no process involvement at all. Each addition moved further from the isolation model, because each addressed a performance gap the previous escape hatch couldn’t close.

But an escape hatch is still an escape hatch. These mechanisms bypass the process isolation model entirely. They are shared state outside the process model, accessible concurrently by any process, with no mailbox serialization and no ownership semantics. (Edited Mar 14: originally this sentence included “no message copying”. ETS does copy terms on read and write, and individual key operations are atomic.) They are shared state outside the process model, accessible concurrently by any process, with no mailbox serialization and no ownership semantics. ETS does copy terms and individual key operations are atomic but multi-key operations are not. Traversals don’t provide snapshot consistency and none of these mechanisms participate in the process isolation model that Erlang’s safety story is built on.

And when you introduce shared state into a system built on the premise of having none, you reintroduce the bugs that premise was supposed to eliminate.

Experienced Erlang developers are well aware of this tradeoff. Large systems routinely shard state across many processes, combine actors with ETS for read-heavy workloads, and use persistent_term for global configuration. These are effective engineering patterns. But their existence is itself the point: they are ways of relaxing isolation when isolation becomes the bottleneck. The question isn’t whether Erlang engineers can work around the limitation, but what it means that they have to.

The Consequences

Maria Christakis and Konstantinos Sagonas built a static race detector for Erlang and integrated it into Dialyzer, Erlang’s standard static analysis tool. They ran it against OTP’s own libraries, which are heavily tested and widely deployed.

They found previously unknown race conditions. Not in obscure corners of the codebase. Not in exotic edge cases. In the kind of code that every Erlang application depends on, code that had been running in production for years.

The races clustered around three categories, all at the points where isolation breaks down:

ETS table races. Process A reads a key from a public table, decides to update it, but Process B modifies it between the read and the write. Classic check-then-act, also known as TOCTOU (“time of check to time of use”). ETS explicitly documents that table traversals provide no snapshot consistency: concurrent inserts during iteration can cause keys to be missed or visited twice. Individual key operations are atomic, but multi-key operations are not.

Process registration races. Erlang allows processes to be registered under a name in a global mutable namespace. Two processes racing to register the same name, or one process looking up a name and sending a message while the named process dies and the name gets re-registered to a different process. These are typical shared mutable state TOCTOU bugs.

Process dictionary races. The process dictionary is per-process mutable state, essentially a thread-local mutable hash map that breaks referential transparency and creates subtle ordering dependencies when combined with operations that cross process boundaries.

These are not Erlang-specific problems. They are precisely the same categories of bugs that shared mutable state has always produced: check-then-act races, concurrent modification without atomicity, TOCTOU on a global namespace. They were found in a language designed to address them.

The Pattern

Let’s step back and look at the full picture.

The actor model’s promise is concurrency through isolation. Erlang is its strongest implementation: separate heaps, copied messages, single-owner mailboxes. The community develops sophisticated mitigations for the problems that still leak through: OTP behaviors, supervision trees, cultural conventions, monitoring tools, static analysis. And then performance pressure forces the introduction of shared mutable state, which bypasses all those mitigations and reintroduces the problems that the model and all its accumulated safeguards were supposed to prevent.

Weaker actor implementations like Akka don’t even get this far. They start with shared mutable state available from day one and rely entirely on programmer discipline to avoid using it. Erlang at least enforces isolation at the runtime level before performance pressure erodes it.

This is pretty much the same cycle from the first essay, viewed from a different angle. Go’s channels looked like an escape from shared memory but turned out to be shared memory in disguise. Erlang’s isolation genuinely isn’t shared memory, until the real world forces shared memory back in through the door marked “performance.”

Different starting points leading to a similar destination.

And this isn’t a criticism of Erlang’s engineering or the actor model as a concept. The Erlang team made the right tradeoffs given their foundation. The problem is the foundation itself. Any concurrency model that achieves safety through isolation will face this pressure, because when multiple computations need the same data at the same time they need concurrent access to it. Isolation can only provide access through serialization. When serialization can’t keep up the choice is between safety and performance, and in production, performance often wins.

In these two essays we’ve seen two approaches both hitting the same-shaped wall. The common thread isn’t channels or actors or any specific mechanism. It’s that both approaches start from the same assumption: safety comes from controlling how threads interact. So far, that assumption has a perfect track record of leading back to the problems it was supposed to solve.

References

Hewitt, Carl, Peter Bishop, and Richard Steiger. “A Universal Modular ACTOR Formalism for Artificial Intelligence.” IJCAI 1973. PDF
Armstrong, Joe. “Making Reliable Distributed Systems in the Presence of Software Errors.” PhD thesis, Royal Institute of Technology, Stockholm, 2003. PDF
Christakis, Maria and Konstantinos Sagonas. “Static Detection of Race Conditions in Erlang.” PADL 2010. PDF
Christakis, Maria and Konstantinos Sagonas. “Static Detection of Deadlocks in Erlang.” TFP 2011. PDF
Fowler, Simon and Raymond Hu. “Speak Now: Safe Actor Programming with Multiparty Session Types.” OOPSLA 2026. arXiv
Hébert, Fred. pobox: External buffer processes to protect against mailbox overflow in Erlang. GitHub.
Hébert, Fred. “Handling Overload.” Blog post.

Categories: Software

Message Passing Is Shared Mutable State

Fri, 2026-02-20 01:00

Something about the way we write concurrent programs has always felt wrong to me. When I pick up a new language and look at its concurrency model I get the same uneasy feeling. The APIs change, the terminology changes, but the underlying patterns look strangely familiar.

Maybe you’ve felt this too. The tools get better, the abstractions get nicer, but the core problem never seems to go away.

Any software developer who has tackled concurrency in a serious project has the battle scars of dealing with the pitfalls of multi-threaded and concurrent programs: the touchy, often clunky APIs and synchronization mechanisms, the dread of debugging data races and deadlocks, and the brain-bending non-locality of it all.

It’s taken me a while to understand what feels so off about them to the point I can articulate it, but I think I’m finally ready. Let’s start with a somewhat recent language: Go.

The Prediction

In 2006, Edward Lee published The Problem with Threads. His argument was stark: threads are “wildly nondeterministic,” and the programmer’s job becomes pruning that nondeterminism rather than expressing computation.

But Lee went further than criticizing threads: he argued that the shared-memory vs. message-passing debate was a false dichotomy. Both approaches model concurrency as threads of execution that need to be coordinated. Switching the coordination mechanism from locks to messages doesn’t change the underlying model, it changes the syntax of failure.

At the time, this was a contrarian position, and the mainstream languages were moving the other way.

The Experiment

Three years later Go launched with a concurrency philosophy built on the opposite bet. “Do not communicate by sharing memory,” the Go documentation urged. “Instead, share memory by communicating.” Channels (typed, first-class message-passing primitives) were Go’s answer to the concurrency mess.

Go wasn’t a toy. Backed by Google, it was adopted by the infrastructure that runs the modern internet: Docker, Kubernetes, etcd, gRPC, CockroachDB. These systems are among the most heavily used Go codebases in existence and are maintained by experienced teams with extensive code review and testing practices. Tens of thousands of developers wrote concurrent code following Go’s guidance, using channels instead of mutexes or locks, sharing memory by communicating.

It was the most prominent, well-resourced, real-world test of the message-passing hypothesis the industry has ever run.

The Results

In 2019 Tengfei Tu and colleagues studied 171 real concurrency bugs across these flagship Go projects and published Understanding Real-World Concurrency Bugs in Go. The findings were striking: message-passing bugs were at least as common as shared-memory bugs.

Around 58% of blocking bugs (i.e. goroutines stuck, unable to make progress) were caused by message passing, not shared memory. The thing that was supposed to be the cure was producing the same problems as the disease.

To be clear, message passing does eliminate one important class of concurrency bugs: unsynchronized memory access. If two goroutines communicate only through channels, they cannot simultaneously mutate the same variable. But eliminating data races does not eliminate coordination failures. Deadlocks, leaks, protocol violations, and nondeterministic scheduling remain.

Go ships with a built-in deadlock detector, but it only caught 2 of the 21 blocking bugs the researchers tested. Two. The race detector fared better on non-blocking bugs, catching roughly half, which still means half the concurrency bugs in production Go code are invisible to the tools that were designed to find them.

Most of these bugs had long lifetimes. They were committed, shipped, ran in production, and weren’t discovered until someone happened to trigger the right interleaving. Testing didn’t find them, and code review didn’t find them. Instead they hid in some of the most heavily scrutinized Go codebases.

Lee’s prediction that switching the coordination mechanism wouldn’t address the root cause was confirmed.

The Code

Here’s a simplified bug in Kubernetes from the paper. A function spawns a goroutine to handle a request with a timeout:

func finishReq(timeout time.Duration) ob { ch := make(chan ob) go func() { result := fn() ch <- result // blocks forever if timeout wins }() select { case result = <-ch: return result case <-time.After(timeout): return nil } }

If fn() takes longer than the timeout then the parent returns nil and nobody ever reads from ch. The child goroutine blocks on ch <- result and will never be cleaned up. Go garbage-collects objects, but it doesn’t garbage-collect goroutines blocked on channels that will never be read.

In Kubernetes (the system managing your production container infrastructure) every one of these leaked goroutines hangs onto references and never lets the memory go. Under load, they accumulate, and the process will slowly degrade until it crashes or gets OOM-killed. This is a reliability failure in the software responsible for keeping your other software running, caused by a single missing buffer in a channel.

The fix is one character: change make(chan ob) to make(chan ob, 1).

Now look at the same logic in Java:

BlockingQueue<Result> queue = new ArrayBlockingQueue<>(1); new Thread(() -> { Result result = computeResult(); try { queue.put(result); } // blocks if queue is full catch (InterruptedException e) { } }).start(); try { Result result = queue.poll(timeout, TimeUnit.SECONDS); if (result != null) { return result; } else { return null; // thread still running, still blocked on put() // queue object still holds a reference // nothing will ever clean this up } } catch (InterruptedException e) { return null; }

No Java developer would look at this and say “I’m doing message passing.” They’d say “I’m using a shared concurrent queue,” because BlockingQueue lives in java.util.concurrent, right next to Mutex and Semaphore. They’d know it carries all the risks of shared mutable state.

But this is the Go channel code. Same shared mutable data structure, same blocking semantics, same bug. If the timeout fires then nobody consumes from the queue and the producer blocks forever. The thread leaks. The structure is identical, the only thing that changed is the vocabulary.

In Java, we call this a shared concurrent queue and we understand the risks. In Go, we call it a channel and pretend it’s something different.

Message passing is often presented as an alternative to shared mutable state, but in practice it frequently reintroduces shared coordination structures under another name.

Why This Keeps Happening

Arthur O’Dwyer, writing about the paper, identified what he called the “original sin” of Go channels: they aren’t really channels at all. A channel has two distinct endpoints, a producer end and a consumer end, with different types and capabilities. If the last consumer disappears, the runtime can detect it, unblock producers, and clean up.

A Go channel has none of this. It’s a single object, a concurrent queue, shared between however many goroutines happen to hold a reference. Any goroutine can send, and any goroutine can receive. There are no distinct endpoints, no directional typing, no way for the runtime to detect when one side is gone. It is a mutable data structure shared between multiple threads, where any thread can mutate the shared state by pushing or popping.

Once you see this, the bug categories in the study become predictable rather than surprising. Every classic failure mode of shared mutable state has a channel equivalent:

Deadlock. Goroutine A sends to a channel and waits for a response on another. Goroutine B does the reverse. Both block. This is a circular dependency on shared state, i.e. the same structure as a mutex deadlock but expressed through queues instead of locks. These issues were found in Docker, Kubernetes, and gRPC.

Leak. Nobody reads from a channel, so the sender blocks forever. The shared queue retains a reference to the goroutine, preventing cleanup. The Kubernetes bug above is this pattern: a resource leak caused by a dangling reference to shared state.

Race. If multiple goroutines read from the same channel, which one gets each message? The answer is nondeterministic: the runtime’s scheduler picks one. This is concurrent access to a shared resource, with the nondeterminism mediated by the scheduler instead of explicit locking. The paper documents these in etcd and CockroachDB.

Protocol violation. A goroutine sends a message the receiver doesn’t expect, or sends on a closed channel (which panics in Go), or closes an already-closed channel. The shared object’s implicit contract was violated, the same category of bug that shared mutable state has always produced.

Every one of these is a classic shared-mutable-state bug wearing a message-passing costume.

And this isn’t just a Go problem. Message passing as a concurrency model doesn’t eliminate shared state, it relocates it. The data being communicated may transfer cleanly from sender to receiver, but the communication mechanism itself (channel, mailbox, or message queue) is a shared mutable resource. And that resource inherits every problem shared mutable state has always had.

Even Erlang demonstrates this. Erlang processes are genuinely isolated with separate heaps, no shared references, and messages copied between processes. These are the strongest form of the message-passing guarantee available anywhere, and yet researchers found previously unknown race conditions hiding in Erlang’s own heavily-tested standard library.

The races clustered around ETS tables, Erlang’s escape hatch from pure actor isolation, which are shared mutable storage that exists because the pure actor model didn’t meet performance requirements. The safety model promised isolation, yet reality demanded a shared mutable escape hatch. The escape hatch reintroduced exactly the bugs the model was supposed to prevent.

Message passing solves concurrency bugs the way moving your mess from one room to another solves clutter.

So Now What?

When a Go programmer hits a channel deadlock and considers reaching for a mutex, they’re choosing between two approaches that fail for the same structural reason. “Go channels are fine if you use them correctly” is a true statement. So is “mutexes are fine if you use them correctly.” They’re the same statement.

Lee saw this in 2006. The shared-memory vs. message-passing debate is an argument about which coordination mechanism to use. It has never questioned whether we’re even asking the right question.

If both sides of the dichotomy fail then maybe the dichotomy itself is wrong. Maybe the problem isn’t which tool we use to coordinate concurrent execution. Maybe there’s something deeper about the foundation that both approaches share, something we haven’t questioned yet.

I think there is. Some languages have tried different foundations and attacked aspects of the problem with real insight, but none of them have fully broken through to the mainstream. It’s worth exploring why, so that’s where we’re headed.

References

Lee, Edward A. “The Problem with Threads.” IEEE Computer 39.5 (2006): 33–42. PDF
Tu, Tengfei, et al. “Understanding Real-World Concurrency Bugs in Go.” ASPLOS 2019. PDF
O’Dwyer, Arthur. “Understanding Real-World Concurrency Bugs in Go.” Blog post, June 6, 2019.
Christakis, Maria and Konstantinos Sagonas. “Static Detection of Race Conditions in Erlang.” PADL 2010. PDF

Categories: Software