Skip to main content
Data Flow Orchestration

The Tempox Workflow Compass: Comparing Retry vs. Compensation Strategies

Introduction: Navigating Failure in Workflow OrchestrationEvery workflow eventually encounters failure—a network timeout, a service rejection, a data validation error. How you respond determines whether your system degrades gracefully or collapses under pressure. Two fundamental strategies—retry and compensation—form the core of failure handling in workflow orchestration. Yet many teams apply them inconsistently, leading to brittle systems that either retry endlessly or fail without cleanup. Thi

Introduction: Navigating Failure in Workflow Orchestration

Every workflow eventually encounters failure—a network timeout, a service rejection, a data validation error. How you respond determines whether your system degrades gracefully or collapses under pressure. Two fundamental strategies—retry and compensation—form the core of failure handling in workflow orchestration. Yet many teams apply them inconsistently, leading to brittle systems that either retry endlessly or fail without cleanup. This guide provides a conceptual compass, helping you decide when to retry (attempt the same operation again) versus when to compensate (undo already-completed steps). We focus on the underlying principles rather than tool-specific syntax, making this framework applicable across platforms like Temporal, Camunda, AWS Step Functions, and custom orchestrators. By understanding the nature of failures, idempotency constraints, and business consistency requirements, you can design workflows that are both resilient and reliable. This overview reflects widely shared professional practices as of April 2026; verify critical details against current official guidance where applicable.

Core Concepts: Understanding Retry and Compensation

To choose between retry and compensation, you must first understand their core mechanisms. A retry simply re-executes a failed operation, hoping that the underlying issue has resolved. Retries rely on idempotency—repeated execution producing the same result as a single execution. Compensation, in contrast, executes a separate undo action for each completed step, rolling back the workflow to a previous consistent state. This section defines these concepts, explains why they work, and highlights when each is appropriate. We also address common misconceptions, such as the belief that retries are always safer or that compensation eliminates all side effects. By grasping the foundational differences, you can apply these strategies with clarity and confidence.

How Retries Work: Transparency and Safety Nets

A retry is the simplest failure-handling mechanism. When an activity fails (e.g., a database write times out), the orchestrator schedules another attempt, usually with exponential backoff to avoid overwhelming the service. For retries to be safe, the operation must be idempotent: executing it twice must have the same effect as executing it once. For example, setting a user's email address to '[email protected]' is idempotent—repeating it doesn't change the outcome. In contrast, incrementing a counter is not idempotent without careful design. Teams often implement retries without verifying idempotency, leading to duplicated charges, duplicate orders, or data corruption. A common pattern is to assign a unique idempotency key to each request, allowing the downstream service to deduplicate. When designing workflows, always ask: 'If this operation runs twice, will the system remain consistent?' If the answer is yes, retries are a viable option. If no, you must either redesign the operation to be idempotent or use compensation instead. Retries are best suited for transient failures—network glitches, temporary service unavailability, or resource contention—where the underlying issue is likely to resolve quickly.

Compensation: Undoing Completed Work

Compensation is the opposite of retry: rather than re-attempting a failed step, you undo any steps that have already succeeded, returning the workflow to a consistent state. This is essential for non-idempotent operations or when a failure violates business rules. For instance, consider an e-commerce workflow: charge the customer, reserve inventory, ship the item. If inventory reservation fails, you must refund the charge—a compensation action. Compensation actions must be designed to be reliable and often idempotent themselves, because they may be retried if they fail. A key design principle is that compensation should leave the system in a state indistinguishable from before the workflow started, as far as possible. However, some side effects (like sending an email) cannot be fully undone; in such cases, compensation might involve sending a corrective email. The compensation pattern is central to the Saga pattern, where each step has a corresponding compensating action. Workflow engines like Temporal and Camunda provide built-in support for compensation, allowing you to define undo logic declaratively. When designing compensation, consider the business impact: is it acceptable to have a short window of inconsistency? Often, compensation is executed asynchronously, meaning that users may briefly see an inconsistent state until the compensation completes. This trade-off is acceptable for many business processes but must be communicated to stakeholders.

When to Retry: Scenarios and Criteria

Retries are the default choice for many teams, but they are not always appropriate. This section defines clear criteria for when retries are the right strategy, based on the nature of the failure, the idempotency of the operation, and the business impact. We also discuss common anti-patterns, such as retrying indefinitely or using retries for business logic errors. By applying these criteria, you can avoid the trap of over-relying on retries and use them only where they add value.

Transient Failures: The Natural Domain of Retries

Transient failures are temporary and self-correcting—network timeouts, database deadlocks, service throttling. In these cases, retries are often the most efficient response because the underlying issue is likely to resolve within seconds. For example, an HTTP 503 (Service Unavailable) response often indicates that a server is temporarily overloaded; retrying with exponential backoff gives it time to recover. A well-designed retry policy includes a maximum number of attempts (e.g., 3-5) and a backoff strategy (exponential with jitter). It also considers the failure type: some errors (like 400 Bad Request) are permanent and should never be retried. Many workflow engines allow you to configure retry policies per activity, specifying which errors are retryable. A common mistake is to retry all errors, including validation failures, which masks underlying issues and wastes resources. Instead, classify failures into retryable and non-retryable. For retryable failures, also consider the operation's impact: if the operation has side effects (like sending an email), retries may cause duplicates unless the email service is idempotent. In practice, teams often combine retries with idempotency keys to ensure safety. For instance, a payment service might accept a unique idempotency key per request, allowing it to safely process duplicate requests without double-charging. When designing retry logic, always ask: 'What happens if this operation succeeds after a previous failure that appeared to fail?' The answer should be: 'Nothing harmful.'

Idempotency: The Critical Enabler

Idempotency is the property that an operation can be applied multiple times without changing the result beyond the initial application. In the context of retries, idempotency ensures that re-executing a failed activity does not cause unintended side effects. For example, setting a database record status to 'processed' is idempotent because repeating it leaves the status unchanged. However, appending to a log is not idempotent; each retry adds another entry. To make operations idempotent, you can use mechanisms like: (1) idempotency keys—unique tokens that the downstream service uses to deduplicate requests; (2) conditional updates—e.g., 'SET status = 'processed' WHERE status = 'pending' so that only the first attempt changes the state; (3) upsert semantics—insert if not exists, else update. When designing workflows, build idempotency into every external-facing operation. This not only enables safe retries but also simplifies error handling overall. If you cannot make an operation idempotent, retries are risky, and compensation becomes necessary. For instance, a money transfer operation (debit from account A, credit to account B) is not idempotent because repeating it would duplicate the transfer. In such cases, you must either use a two-phase commit (which has its own problems) or implement a compensation that reverses the first part if the second fails. The choice between retry and compensation often reduces to a single question: 'Is the operation idempotent?' If yes, retry. If no, compensate.

When to Compensate: Scenarios and Criteria

Compensation is the preferred strategy when an operation is not idempotent or when a failure cannot be resolved by retrying. This section outlines the specific scenarios where compensation is necessary, including business rule violations, non-idempotent side effects, and long-running workflows. We also discuss how to design effective compensation actions and common pitfalls.

Non-idempotent Operations and Business Rules

Any operation that produces a unique effect per execution—such as sending a one-time coupon, creating a unique record, or incrementing a counter—is non-idempotent. Retrying such operations would create duplicates or incorrect state. For example, if a workflow sends a welcome email on user registration, retrying the email send would result in multiple emails. Instead, the workflow should either avoid retrying (by marking the email as sent before attempting) or compensate by sending a follow-up email acknowledging the duplicate. Business rule violations also necessitate compensation. For instance, a travel booking workflow might reserve a hotel room and then book a flight; if the flight booking fails because the customer's credit card is declined, the hotel reservation must be cancelled—a compensating action. Compensation is also used in sagas where each step has a defined undo. The key design principle is that compensation must be idempotent itself, because it may be retried if it fails. For example, cancelling a reservation multiple times should be safe (e.g., the second cancellation returns success without error). When designing compensation, consider the order of execution: compensations should run in reverse order of the original steps. Also, consider the timing: some compensations may need to be delayed (e.g., if the original step has a side effect that takes time to propagate). In practice, compensation often involves calling an API that reverses the previous action, such as refunding a payment or releasing a hold. Teams should test compensation paths thoroughly, as they are often overlooked during happy-path development.

Long-Running Workflows and Human Intervention

In long-running workflows that span hours or days, failures may occur long after a step completed, making retries impractical. For example, a loan approval workflow might check credit, verify employment, and then underwrite the loan. If underwriting fails a week later, you cannot simply retry the earlier steps; instead, you must compensate by notifying the customer and cancelling any pending actions. Compensation in such cases often involves manual intervention—a human operator reviewing the situation and executing corrective actions. Workflow engines typically support 'compensation scopes' that define which steps are undone together. For long-running workflows, consider the 'compensating transaction' pattern where each step's compensation is stored in a persistent log. If the workflow fails, the engine executes the compensations in reverse order. One challenge is that some compensations may fail (e.g., refund API is down); in such cases, the workflow enters a 'compensation failed' state that requires human intervention. Teams should design dashboards and alerts for such scenarios. Another consideration is that compensation may have business cost—refunding a customer, for instance, creates a negative financial impact. Therefore, compensation should be a deliberate choice, not a default. Use retries first for transient issues, and escalate to compensation only when retries are exhausted or inappropriate. The decision often involves trade-offs between consistency and cost: compensating earlier steps is expensive but maintains data integrity, while leaving incomplete workflows may cause customer frustration. There's no one-size-fits-all answer; each workflow must be analyzed in its business context.

Comparison Table: Retry vs. Compensation

DimensionRetryCompensation
ObjectiveComplete the same operation successfullyUndo previously completed operations
When usedTransient failures, idempotent operationsNon-idempotent operations, business rule violations, retries exhausted
ComplexityLow to moderate (backoff, max attempts)High (define per-step undo logic, handle partial failures)
Idempotency requirementOperation must be idempotent (or use idempotency keys)Compensation action should be idempotent
Resource costLow (additional attempts)Higher (additional calls to undo services)
Business impactMinimal if idempotent; duplicates otherwiseCan be disruptive (e.g., refunds, cancellations) but preserves consistency
Failure modeRetries exhausted → workflow fails unresolvedCompensation fails → manual intervention needed
Tool supportBuilt-in in most workflow engines (e.g., Temporal, Camunda)Built-in in engines that support Sagas (e.g., Temporal, Camunda, Axon)

Step-by-Step Guide: Designing Your Failure Strategy

Designing a failure strategy requires a structured approach. This step-by-step guide walks you through analyzing your workflow, classifying operations, and deciding between retry and compensation. By following these steps, you can create a consistent and resilient system.

Step 1: Classify Each Operation's Idempotency

For each operation in your workflow, determine if it is idempotent. If the operation is a read or a deterministic write (e.g., 'set status to X'), it is likely idempotent. If it creates unique resources (e.g., 'insert new order') or has side effects (e.g., 'send email'), it is not. Document your findings in a table. For non-idempotent operations, consider redesigning them to be idempotent (e.g., using idempotency keys). If redesign is not feasible, plan for compensation.

Step 2: Identify Transient vs. Permanent Failures

Classify each failure mode as transient (network timeout, 503) or permanent (validation error, 400). Transient failures are candidates for retry; permanent failures should trigger compensation or workflow failure. Also consider partial failures—where some steps succeeded before a failure—as these require compensation for the completed steps.

Step 3: Define Retry Policies

For operations that are idempotent and subject to transient failures, define a retry policy: initial interval (e.g., 1 second), backoff multiplier (e.g., 2), maximum interval (e.g., 60 seconds), and maximum attempts (e.g., 5). Include jitter to prevent thundering herd. Ensure that the policy respects downstream service limits (e.g., rate limits).

Step 4: Define Compensation Actions

For each non-idempotent operation, define a compensating action that reverses its effect. Ensure the compensation is idempotent and reliable. For example, if the operation is 'charge credit card', the compensation is 'refund charge'. Test compensation paths thoroughly, including scenarios where the compensation itself fails.

Step 5: Implement and Monitor

Implement the retry and compensation logic in your workflow engine. Monitor retry attempts, compensation executions, and failure rates. Set up alerts for workflows that enter compensation-failed states. Regularly review and adjust policies based on observed failure patterns.

Real-World Example: E-Commerce Order Workflow

Consider a typical e-commerce order workflow: (1) validate payment, (2) reserve inventory, (3) ship order, (4) send confirmation email. Each step has different failure characteristics. This example illustrates how retry and compensation can be applied together.

Step 1: Validate Payment

This step involves calling a payment gateway. If the gateway returns a transient error (e.g., timeout), retry with exponential backoff (up to 3 times). If the payment is declined (permanent error), do not retry; instead, cancel the order and notify the customer. The payment validation itself is idempotent if we use an idempotency key (e.g., order ID).

Step 2: Reserve Inventory

Reserving inventory is idempotent (reserving the same item twice is safe if the system checks availability). Transient failures (e.g., database timeout) can be retried. However, if inventory is insufficient (business rule violation), the workflow must compensate by cancelling the payment (step 1's compensation). This is a classic compensation scenario.

Step 3: Ship Order

Shipping is non-idempotent; you cannot ship the same order twice. If the shipping service fails transiently, you can retry (with idempotency key). If it fails permanently (e.g., invalid address), you must compensate: refund payment and release inventory reservation. Note that releasing inventory is compensation for step 2, and refunding is compensation for step 1.

Step 4: Send Confirmation Email

Email sending is non-idempotent; retrying would send multiple emails. Instead, mark the email as sent in the database before calling the email service. If the email service fails transiently, you can retry (since the database flag prevents duplicate sends). If it fails permanently, you may accept the inconsistency (email not sent) or compensate by sending a notification via another channel. This example shows how compensation can be used selectively, and how idempotency design reduces the need for compensation.

Common Pitfalls and How to Avoid Them

Even experienced teams encounter pitfalls when implementing retry and compensation strategies. This section highlights the most common mistakes and provides guidance on how to avoid them.

Pitfall 1: Retrying Non-idempotent Operations

The most common mistake is retrying operations that are not idempotent, leading to duplicated side effects. For example, retrying a 'charge customer' operation without an idempotency key can double-charge. To avoid this, always use idempotency keys for any operation that could be retried. Alternatively, design operations to be naturally idempotent (e.g., using upsert semantics). If you cannot guarantee idempotency, do not retry; use compensation instead.

Pitfall 2: Infinite Retries Without Backoff

Some teams configure retries without a maximum attempt limit or without exponential backoff, causing the system to retry infinitely and overload downstream services. Always set a maximum number of attempts (e.g., 5) and use exponential backoff with jitter. Also, consider circuit breaker patterns to stop retries when a service is consistently failing.

Pitfall 3: Ignoring Compensation Failures

Compensation actions can themselves fail, leaving the system in an inconsistent state. Teams often overlook this scenario. Design compensation to be as reliable as possible, and implement a manual intervention process for cases where compensation fails. Use dead letter queues or alerts to notify operators.

Pitfall 4: Mixing Retry and Compensation Incorrectly

Some workflows attempt to retry after compensation has started, leading to confusion. Define clear boundaries: retry only before any compensation has been executed. Once compensation begins, do not retry the original operation. Workflow engines typically enforce this automatically, but custom orchestrators may need explicit guards.

FAQ: Retry vs. Compensation

Q: Can I use both retry and compensation for the same step?
A: Yes. Typically, you retry a step first (for transient failures), and if retries are exhausted, you trigger compensation for all previously completed steps. This is a common pattern in sagas.

Q: How do I decide the maximum number of retries?
A: Consider the business impact and the typical duration of transient failures. For example, if a service usually recovers within 5 seconds, 3 retries with exponential backoff (1s, 2s, 4s) should suffice. Avoid retrying beyond the user's patience threshold (e.g., 30 seconds for synchronous requests).

Q: What if compensation fails? Should I retry it?
A: Yes, compensation actions should be retried with the same principles as normal operations. They should be idempotent and resilient. If retries are exhausted, escalate to manual intervention.

Q: Is compensation always required for non-idempotent operations?
A: Not always. If the non-idempotent operation is the last step and its failure can be tolerated (e.g., sending a notification), you may choose to skip compensation and log the failure. However, for operations that affect data consistency (e.g., payments), compensation is essential.

Share this article:

Comments (0)

No comments yet. Be the first to comment!