Your event-driven architecture is a distributed monolith — and that's fine, if you measure it

Async messaging didn't decouple your services; it hid the coupling. Here's how to find the seams again with traces, schemas, and a healthy fear of fan-out.

So you rewrote the monolith into events.

Every team now owns a service. Every service owns its own database. The architecture diagram has 47 arrows on it and a legend that says “asynchronous, fire-and-forget”. Standups are shorter. Deploys are calmer. The principal engineer who pushed for this rewrite has, in fact, been promoted.

And then on a Tuesday around 3pm somebody from billing pings you on Slack and asks why the dashboard team’s deploy is causing refund events to vanish.

Welcome to the distributed monolith. The good news is you’re not alone: this is, in my completely unscientific estimate, what about three out of every four “event-driven” architectures look like in practice. The bad news is the people who built it almost never realise it until something breaks loudly. And the worse news is that the usual fix — “let’s just move some of these back to synchronous calls” — is, in my experience, the wrong fix.

This post is about the right fix, which is mostly boring and mostly about discipline.

What actually happened

The pitch for going event-driven was: services are decoupled, deploys are independent, blast radius is small. That pitch is sometimes true. It is true when the contract between services is explicit, versioned, and validated. It is almost never the default state.

What actually happens, more often:

order-service ──▶  orders.created  ─┐
                                    ├──▶ inventory-service
                                    ├──▶ shipping-service
                                    ├──▶ billing-service
                                    ├──▶ analytics
                                    └──▶ that one Lambda nobody owns

The bus is technically decoupling those services. The schema of the event is not. Inventory, shipping, billing, analytics, and the mystery Lambda all parse the same JSON blob, and they all parse it in slightly different ways. Field gets renamed. Field gets repurposed. Field gets a new value that the old consumers interpret as “unknown, proceed with defaults”. Defaults are wrong. Refunds vanish.

You didn’t decouple anything. You hid the coupling behind a message broker, where nobody can see it without grep and a war room.

The thing about message brokers

A message broker is a transport. It is not a contract, it is not a schema, it is not an interface. It is a pipe. If you stick a typed RPC call into a pipe and call the result “decoupled”, you’ve changed the shape of the coupling from “synchronous and observable” to “asynchronous and silent”. The total amount of coupling didn’t go down. It just got harder to find.

If you only take one thing from this post: the bus is not the contract.

Step one: schemas are not folklore

A schema is folklore if the only way to find out what a orders.created event looks like in 2026 is to ask the engineer who originally wrote the producer. A schema is not folklore if you can pip install it, or at least open a JSON Schema file in a registry and diff two versions.

The smallest possible registry that works is a directory of files in a repo every service depends on:

// schemas/orders.created/v2.json
{
  "$schema": "https://json-schema.org/draft/2020-12/schema",
  "title": "orders.created.v2",
  "type": "object",
  "required": ["order_id", "customer_id", "created_at", "total_cents", "currency"],
  "properties": {
    "order_id":    { "type": "string", "format": "uuid" },
    "customer_id": { "type": "string", "format": "uuid" },
    "created_at":  { "type": "string", "format": "date-time" },
    "total_cents": { "type": "integer", "minimum": 0 },
    "currency":    { "type": "string", "pattern": "^[A-Z]{3}$" }
  },
  "additionalProperties": false
}

Two important things in there.

The first is additionalProperties: false. Yes, really. I know this is unfashionable. I know “be liberal in what you accept” is the default advice. It is also the reason nobody can ever delete a field from a payload, and the reason your bus carries 14kB of legacy junk per event seven years after launch. Set it to false. Bump the version when you need a new field. Pay the cost up front.

The second thing is that the schema lives in a repo, not in a wiki. Wikis go stale. Repos have CI. You want a CI job that fails when a producer ships a payload that doesn’t validate against the registered schema, and you want that job to be louder than a green checkmark.

This is unglamorous work. It is the single most valuable hour-per-week you can spend on an event-driven system, and it almost never gets prioritized until after the first big incident. Prioritize it before.

Step two: subscriptions are not folklore either

Question: which services consume orders.created?

If you can’t answer that without grep-ing across every repo in your org, the answer is “you have no idea, and neither does anyone else”. Which means you also have no idea what breaks when you change the event.

The fix is some mechanism — any mechanism — that makes consumer subscriptions explicit and discoverable. I have, at various employers, shipped three versions of this:

  1. A subscriptions.yaml at the root of every service repo. Cheap, ugly, works. The downside is everyone forgets to update it.
  2. A decorator on the handler function that registers it in a tiny internal library. Less forgetful, but only works inside one language ecosystem.
  3. A handler-naming convention strict enough to grep for with ripgrep. Honestly? My favourite of the three. The constraint is the documentation.

Whatever you pick, run a scheduled job that crawls the org, extracts the declarations, and writes them into a graph. Render that graph somewhere visible. I once had a printed version of this graph on the wall outside my desk. People stopped asking me questions about ownership. Highly recommend.

Step three: trace context, on every message, no exceptions

The thing that made the monolith debuggable was the call stack. The thing that makes a distributed system debuggable is distributed tracing. There is no third option. There is, in particular, no “we’ll just check the logs in each service” option — anyone who has tried to reconstruct a multi-service flow from interleaved Loki queries at 2am can tell you exactly how that goes.

Every message has to carry trace context end-to-end. W3C trace context, OpenTelemetry baggage, whatever your stack speaks natively. The wrapper is about ten lines:

# producer side
from opentelemetry import propagate

def publish(topic, payload):
    headers = {}
    propagate.inject(headers)
    bus.publish(topic, payload, headers=headers)

# consumer side
def handle(msg):
    ctx = propagate.extract(msg.headers)
    with tracer.start_as_current_span(f"handle.{msg.topic}", context=ctx):
        process(msg.payload)

Once that’s deployed everywhere, the question “what happens when I publish orders.created?” becomes a single query in your tracing tool, and you can stop being the team’s human DAG diagram.

If you only have time to do one of the three things in this post, do this one. Schemas help you ship safely. Subscriptions help you plan. Trace context is what saves your weekend.

The fan-out problem

When you finally have visibility, you will notice a thing that may make you uncomfortable: the average user request, traced end-to-end, produces approximately 40 spans. Sometimes 80. Once, on a particularly memorable Black Friday morning, I watched a single “add to cart” produce 312.

This isn’t inherently a problem. Fan-out is fine. Fan-out is sometimes the whole reason to use events. But unbounded, unwatched fan-out is how your tail latency budget evaporates without anyone making a decision about it.

A rough rubric I use, with no claim to it being scientific:

fan-out per request what I do
under 10 nothing, ship it
10–30 sample traces, watch the p99.9 tail
30–80 start asking who’s publishing inside a loop
80+ treat as an incident, even if nothing’s on fire yet

The biggest culprit, in basically every postmortem I’ve been part of, is a service that publishes events inside a loop iterating over a collection. The author insists this is fine, because “we’re just emitting domain events”. You are not. You are emitting 4,000 domain events, each of which fans out to seven consumers, each of which writes a row to a database. The math doesn’t care about your intent.

If you find one of these in code review, kill it. There is almost always a single “batch” event that captures the same intent. If there genuinely isn’t, that’s a design smell worth pulling on for half an hour.

The thing nobody tells you

Here’s the part the “events are decoupled!” pitch leaves out: an event-driven system is strictly more work to operate than a synchronous one. Not less. More.

It has more failure modes. It has worse default error messages. It needs idempotent consumers. It needs replay tooling. It needs schema discipline, subscription discipline, fan-out discipline, trace context discipline, dead-letter discipline. Most of those words are unglamorous and none of them ship features.

In exchange, you get a system where teams can deploy independently without coordinating, where you can replay a day of traffic to recover from a bug, and where the blast radius of any one component failing is contained. Those are real wins. They are not free.

The mistake is going event-driven and then pretending you’re not running a distributed system. You are. Pay the operational tax, keep paying it, and the architecture will keep paying you back. Skip it and, well. See you in the war room.