Every backend team eventually hits the distributed transaction wall. One service deducts inventory, another charges a payment, a third updates a loyalty ledger — and then the network blinks. The textbook answer is either two-phase commit (2PC) or sagas, but the real world is messier than textbooks admit. Teams often try one, fail, switch to the other, and still find themselves patching edge cases at 2 AM. This guide offers a practical lens for reconciling these two approaches, not as rivals but as tools for different parts of the same workflow.
Where the Battle Plays Out
Picture an e-commerce checkout flow. The order service creates an order record, the payment service authorizes a charge, the inventory service reserves items, and the shipping service schedules a pickup. If any step fails, the whole operation must either roll back cleanly or compensate. In a monolithic system, a local database transaction handles this atomically. In a microservices architecture, no single database spans all services, so you need a coordination mechanism.
Two-phase commit was the original solution. It uses a coordinator that asks every participant to prepare (phase one) and then commit (phase two). If any participant votes no, the coordinator tells everyone to abort. This guarantees atomicity — all services either commit or all abort — but it comes with a cost: the coordinator is a single point of failure, participants hold locks during phase one, and the protocol blocks if the coordinator crashes after sending commit.
Sagas emerged as a more resilient alternative. Instead of locking resources, a saga breaks the transaction into a sequence of local transactions, each with a compensating action. If a step fails, the saga runs the compensations for all completed steps. No locks, no blocking coordinator — but atomicity is eventual, not immediate. Other services can see intermediate states, and compensations must be idempotent and correctly ordered.
Which one should you use? The answer depends on your tolerance for inconsistency, your failure recovery time, and how much you trust your network. We have seen teams choose sagas for speed, only to discover that writing correct compensations is harder than they expected. Others pick 2PC for correctness, then hit availability problems when the coordinator goes down.
The Real Cost of Choosing Wrong
Choosing the wrong pattern can haunt a system for years. A team that picks 2PC for a high-throughput order flow will see latency spikes during prepare phases, especially under load. Another team that picks sagas for a financial settlement system may struggle to reconcile accounts when compensations fail silently. The cost is not just engineering time — it is lost revenue, angry customers, and audit headaches.
Foundations Readers Confuse
Many developers conflate sagas with eventual consistency and 2PC with strong consistency, but the reality is more nuanced. A saga can achieve strong consistency if compensations are synchronous and the saga coordinator waits for all outcomes. Conversely, 2PC can degrade to eventual consistency if the coordinator uses a recovery protocol that allows participants to independently resolve pending transactions. The labels matter less than the guarantees your system actually provides.
Another common confusion is between sagas and distributed transactions. A saga is a pattern for managing a long-lived business transaction across services, but it does not provide the ACID guarantees of a database transaction. The A in ACID (atomicity) is replaced by a weaker guarantee: either all steps complete successfully, or all completed steps are undone via compensations. The C (consistency) is maintained by the application, not the infrastructure. The I (isolation) is sacrificed — sagas allow intermediate states to be visible. The D (durability) depends on each local transaction's database.
Two-phase commit, on the other hand, provides full ACID within the scope of the transaction, but only if all participants support the XA protocol and the coordinator is highly available. Many modern databases and message brokers do not support XA, or they support it with performance penalties that make it impractical for high-throughput workloads.
When Isolation Matters
Isolation is the most overlooked difference. In a saga, another service might read an order as 'pending payment' while the saga is still running. If the saga later fails and compensates, that service has seen an inconsistent state. To prevent this, you need semantic locking or versioned states. In 2PC, isolation is guaranteed because participants hold locks until the coordinator decides. But those locks can cause contention and deadlocks in high-concurrency systems.
Patterns That Usually Work
After working with dozens of teams, we have seen a few patterns that consistently deliver good results. The first is the choreography-based saga, where each service emits events that trigger the next step. This pattern works well when the workflow is simple and the number of participants is small. The order service emits an 'OrderCreated' event, the payment service listens and emits 'PaymentProcessed', and so on. If a step fails, the service emits a failure event that triggers compensations.
The second pattern is the orchestrator-based saga, where a dedicated saga coordinator manages the workflow. The coordinator sends commands to each service, tracks responses, and calls compensations on failure. This pattern is easier to monitor and debug because the coordinator holds the entire state. It also handles complex workflows with conditional branches and retries more gracefully than choreography.
For two-phase commit, the most reliable pattern is to use a lightweight coordinator that runs as a sidecar or embedded library, rather than a standalone service. This avoids the single-point-of-failure problem because each service has its own coordinator instance. If the coordinator crashes, the participant can recover by consulting a shared transaction log. This approach is used by some distributed databases and message queues.
Hybrid Approaches That Work
Some teams successfully combine both patterns: use 2PC for the critical core of a transaction (like payment and inventory deduction) and sagas for the less critical periphery (like sending emails or updating analytics). This gives you atomicity where it matters most and resilience everywhere else. The key is to keep the 2PC scope small and short-lived, and to design compensations for the saga parts that are idempotent and retryable.
Anti-Patterns and Why Teams Revert
The most common anti-pattern is using sagas for operations that genuinely need atomicity. We have seen teams implement a saga for a funds transfer between two accounts, only to discover that if the credit step succeeds but the debit compensation fails, money is created out of thin air. The correct solution for that case is 2PC or a distributed ledger that supports atomic multi-record updates.
Another anti-pattern is using 2PC for long-running transactions. The prepare phase holds locks that block other operations. If the transaction takes minutes (because it involves human approval or external API calls), the locks cause contention that brings the system to a crawl. Teams that fall into this trap often revert to sagas, but they could have avoided the problem by splitting the long transaction into smaller 2PC segments.
A third anti-pattern is ignoring idempotency in compensations. If a compensation runs twice (due to a retry), it might double-refund a customer or double-cancel an order. Every compensation must be designed to be safe to run multiple times. This is harder than it sounds, especially when compensations involve external systems that do not support idempotency keys.
The Revert Loop
We have seen teams cycle between 2PC and sagas multiple times on the same project. They start with 2PC, hit availability problems, switch to sagas, hit inconsistency problems, then try to add 2PC back for the critical parts. This revert loop is a symptom of not analyzing the failure modes upfront. A better approach is to map out all possible failure scenarios and decide which pattern handles each one, rather than committing to a single pattern for the entire workflow.
Maintenance, Drift, and Long-Term Costs
Both patterns incur maintenance costs that grow over time. For sagas, the biggest cost is compensating logic. Every new step in the saga requires a corresponding compensation, and those compensations must be kept in sync with the forward logic. If a business rule changes, both the forward action and the compensation may need updating. Teams often forget to update compensations, leading to drift where the compensation no longer correctly undoes the forward action.
For 2PC, the biggest cost is coordinator availability and transaction log management. The coordinator must be highly available, which often means running it in a cluster with consensus. The transaction log grows with every transaction, and cleaning it up requires careful compaction. If the log grows too large, recovery times increase, and the coordinator becomes a bottleneck.
Monitoring is another hidden cost. Sagas require tracking the state of each saga instance, detecting stalled instances, and implementing timeout-based recovery. 2PC requires monitoring the coordinator's health, the transaction log size, and the number of in-doubt transactions. Both patterns need dashboards and alerts, but the metrics are different.
Drift in Practice
We have seen a team that built a saga for user registration. The saga had steps for creating a user profile, sending a welcome email, and initializing a trial account. Over time, the team added a step for syncing with a CRM system but forgot to add a compensation. When the saga failed after the CRM sync, the compensation ran for the previous steps but left the CRM in an inconsistent state. The bug was not caught until a customer complained about duplicate records. This kind of drift is inevitable without automated tests that verify compensations.
When Not to Use This Approach
There are situations where neither 2PC nor sagas is the right answer. If your workflow involves a single database, use a local transaction. If your workflow is purely asynchronous and eventual consistency is acceptable, consider an event-driven approach with outbox patterns and idempotent consumers, without formal saga coordination.
If your workflow involves external systems that do not support rollback or compensation (like sending a physical letter), neither pattern can help. In that case, you need a different strategy, such as idempotent delivery with manual reconciliation.
If your workflow has tight latency requirements (under 10 ms), 2PC is likely too slow because of the two-round-trip overhead. Sagas can be faster, but they add complexity that may not be justified if the workflow is simple.
If your team is small and lacks experience with distributed transactions, starting with either pattern is risky. A simpler alternative is to redesign the workflow to avoid distributed transactions altogether. For example, you can use a single service that owns all the data needed for the transaction, or you can use a message queue with exactly-once semantics and let each service handle its own consistency.
The No-Coordination Alternative
Some teams achieve acceptable consistency without any coordination by using idempotent operations and conflict resolution. For example, an inventory reservation can be implemented as a 'reserve' operation that is idempotent, and conflicts (like overselling) are resolved by a periodic reconciliation job. This approach avoids both 2PC and sagas, at the cost of eventual consistency and manual intervention when conflicts arise.
Open Questions and FAQ
Can I use sagas with a relational database?
Yes, but you need to store saga state in a durable store, typically a database table. Each saga instance has a row that tracks its current step and status. The saga coordinator reads and updates this table on each step. This works well with relational databases, but you must ensure the saga state table is highly available and not a bottleneck.
How do I test compensations?
Compensations should be tested with the same rigor as forward actions. Use integration tests that simulate failures at every step and verify that the system ends in a consistent state. Property-based testing can help generate random failure sequences. Also, test that compensations are idempotent by running them twice and checking that the second run has no effect.
What is the best way to handle timeouts in sagas?
Set a timeout for each step in the saga. If a step does not respond within the timeout, treat it as a failure and run compensations for all completed steps. The timeout should be generous enough to avoid false positives but short enough to keep the system responsive. Use a timeout scheduler that checks for stalled sagas periodically.
Is two-phase commit still relevant in 2025?
Yes, but its role has narrowed. It is best suited for short-lived, high-value transactions where atomicity is non-negotiable, such as financial transfers, inventory reservations, and multi-record updates in distributed databases. For most other cases, sagas or eventual consistency patterns are a better fit.
How do I choose between choreography and orchestration for sagas?
Use choreography when the workflow is linear and the number of participants is small (fewer than five). Use orchestration when the workflow has conditional branches, retries, or complex error handling. Orchestration is easier to monitor and debug, but it introduces a single point of coordination. Choreography is more decentralized but harder to trace.
Can I mix 2PC and sagas in the same workflow?
Yes, and this is often the best approach. Use 2PC for the critical subset of steps that require atomicity, and wrap the entire workflow in a saga that manages the non-critical steps. The saga coordinator can trigger a 2PC transaction for the critical steps and then continue with the saga steps. This hybrid approach gives you the best of both worlds if implemented carefully.
What should I do if a compensation fails?
A failed compensation should be retried with exponential backoff. If it continues to fail after a maximum number of retries, the saga should be marked as 'failed' and an alert should be sent to an operator. The operator can then manually resolve the inconsistency. To minimize manual intervention, design compensations to be as robust as possible and include fallback actions.
Comments (0)
Please sign in to post a comment.
Don't have an account? Create one
No comments yet. Be the first to comment!