ekidd an hour ago

Let's assume you're not a FAANG, and you don't have a billion customers.

If you're gluing microservices together using distributed transactions (or durable event queues plus eventual consistency, or whatever), the odds are good that you've gone far down the wrong path.

For many applications, it's easiest to start with a modular monolith talking to a shared database, one that natively supports transactions. When this becomes too expensive to scale, the next step may be sharding your backend. (It depends on whether you have a system where users mostly live in their silos, or where everyone talks to everyone. If your users are siloed, you can shard at almost any scale.)

Microservices make sense when they're "natural". A video encoder makes a great microservice. So does a map tile generator.

Distributed systems are expensive and complicated, and they kill your team's development velocity. I've built several of them. Sometimes, they turned out to be serious mistakes.

As a rule of thumb:

1. Design for 10x your current scale, not 1000x. 10x your scale allows for 3 consecutive years of 100% growth before you need to rebuild. Designing for 1000x your scale usually means you're sacrificing development velocity to cosplay as a FAANG.

2. You will want transactions in places that you didn't expect.

3. If you need transactions between two microservices, strongly consider merging them and having them talk to the same database.

Sometimes you'll have no better choice than to use distributed transactions or durable event queues. They are inherent in some problems. But they should be treated as a giant flashing "danger" sign.

  • hggigg 35 minutes ago

    I would add that just because you add these things it does not mean you can scale afterwards. All microservices implementations I've seen so far are bolted on top of some existing layer of mud and serve only to make function calls that were inside processes run over the network with added latency and other overheads. The end game is the aggregate latency and cost increases only with no functional scalability improvements.

    Various engineering leads, happy with what went on their resume, leave and tell everyone how they increased scalability. And persuade another generation of failures.

  • xyzzy_plugh 15 minutes ago

    I frequently see folks fail to understand that when the unicorn rocketship spends a month and ten of their hundreds of engineers replacing their sharded mysql from setting ablaze daily due to overwhelming load, it is actually pretty close to the correct time for that work. Sure it may have been stressful, and customers may have been impacted, but it's a good problem to have. Conversely not having that problem maybe doesn't really mean anything at all, but there's a good chance it means you were solving these scaling problems prematurely.

    It's a balancing act, but putting out the fires before they even begin is often the wrong approach. Often a little fire is good for growth.

atombender a few seconds ago

I find the "forwarder" system here a rather awkward way to bridge the database and Pub/Sub system.

A better way to do this, I think, is to ignore the term "transaction," which overloaded with too many concepts (such as transactional isolation), and instead to consider the desired behaviour, namely atomicity: You want two updates to happen together, and (1) if one or both fail you want to retry until they are both successful, and (2) if the two updates cannot both be successfully applied within a certain time limit, they should both be undone, or at least flagged for manual intervention.

A solution to both (1) and (2) is to bundle both updates into a single action that you retry. You can execute this with a queue-based system. You don't need an outbox for this, because you don't need to create a "bridge" between the database and the following update. Just use Pub/Sub or whatever to enqueue an "update user and apply discount" action. Using acks and nacks, the Pub/Sub worker system can ensure the action is repeatedly retried until both updates complete as a whole.

You can build this from basic components like Redis yourself, or you can use a system meant for this type of execution, such as Temporal.

To achieve (2), you extend the action's execution with knowledge about whether it should retry or undo its work. For such a simple action as described above, "undo" means taking away the discount and removing the user points, which are just the opposite of the normal action. A durable execution system such as Temporal can help you do that, too. You simply decide, on error, whether to return a "please retry" error, or roll back the previous steps and return a "permanent failure, don't retry" error.

To tie this together with an HTTP API that pretends to be synchronous, have the API handler enqueue the task, then wait for its completion. The completion can be a separate queue keyed by a unique ID, so each API request filters on just that completion event. If you're using Redis, you could create a separate Pub/Sub per request. With Temporal, it's simpler: The API handler just starts a workflow and asks for its result, which is a poll operation.

The outbox pattern is better in cases where you simply want to bridge between two data processing systems, but where the consumers aren't known. For example, you want all orders to create a Kafka message. The outbox ensures all database changes are eventually guaranteed to land in Kafka, but doesn't know anything about what happens next in Kafka land, which could be stuff that is managed by a different team within the same company, or stuff related to a completely different part of the app, like billing or ops telemetry. But if your app already knows specifically what should happen (because it's a single app with a known data model), the outbox pattern is unnecessary, I think.

pjmlp an hour ago

As usual, don't try to use the network boundary to do what modules already offer in most languages.

Distributed systems spaghetti is much worse to deal with.

relistan 2 hours ago

This is a good summary of building an evented system. Having built and run one that scaled up to 130 services and nearly 60 engineers, I can say this solves a lot of problems. Our implementation was a bit different but in a similar vein. When the company didn’t do well in the market and scaled down, 9 engineers were (are) able to operate almost all of that same system. The decoupling and lack of synchronous dependencies means failures are fairly contained and easy to rectify. Replays can fix almost anything after the fact. Scheduled daily replays prevent drift across the system, helping guarantee that things are consistent… eventually.

liampulles 2 hours ago

This does a good job of encapsulating the considerations of an event driven system.

I think the author is a little too easily dismissive of sagas though - for starters, an event driven system is also still going to need to deal with compensating actions, its just going to result in a larger set of events that various systems potentially need to handle.

The virtue of a saga or some command driven orchestration approach is that the story of what happens when a user does X is plainly visible. The ability to dictate that upfront and figure it out easily later when diagnosing issues cannot be understated.

kunley 23 minutes ago

this is smart, but also: the overall design was so overengineed in the first place..

physicsguy 2 hours ago

> And if your product backlog is full and people who designed the microservices are still around, it’s unlikely to happen.

Oh man I feel this