Building Resilient Salesforce Integrations: Failure Modes and Recovery Patterns

Integrations fail. Networks partition, APIs go down, governor limits get hit at 3am when nobody is watching. The difference between a fragile integration and a resilient one is not whether it fails - it’s whether the failure is recoverable. Here are the patterns that apply across large-scale Salesforce integration projects.

The Failure Taxonomy

Before designing for resilience, categorise your failure modes:

Transient failures - temporary, self-resolving: network blip, 429 rate limit, 503 service unavailable. Correct response: retry with backoff.

Permanent failures - not going to resolve themselves: invalid data, authentication failure, business rule rejection (duplicate record, invalid field value). Correct response: route to dead letter queue and alert.

Partial failures - some records succeed, some fail within a batch. The most insidious type - easy to miss because the operation “succeeded” from a monitoring perspective.

Governor limit failures - Salesforce-specific: too many SOQL queries, heap size exceeded, callout limit reached. Correct response: redesign the architecture, not just retry.

Retry with Exponential Backoff

Never retry immediately on failure. Immediate retries amplify load on already-struggling systems.

public class RetryableCallout {
    private static final Integer MAX_RETRIES = 3;
    private static final Integer BASE_DELAY_MS = 1000;

    public static HttpResponse callWithRetry(HttpRequest req) {
        Integer attempt = 0;
        Exception lastException;

        while (attempt < MAX_RETRIES) {
            try {
                HttpResponse res = new Http().send(req);

                if (res.getStatusCode() == 429 || res.getStatusCode() >= 500) {
                    // Transient - retry
                    Integer delayMs = BASE_DELAY_MS * (Integer) Math.pow(2, attempt);
                    // Note: can't Thread.sleep in Apex - schedule a retry job
                    scheduleRetry(req, attempt + 1, delayMs);
                    return null;
                }

                return res; // success or permanent failure (4xx)

            } catch (CalloutException e) {
                lastException = e;
                attempt++;
            }
        }

        // All retries exhausted
        routeToDeadLetterQueue(req, lastException.getMessage());
        return null;
    }
}

The Thread.sleep limitation in Apex means true backoff requires a Queueable chain or a scheduled job. Use a RetryJob__c custom object to store the pending retry with its due timestamp.

Idempotency Keys

If a callout succeeds but the response doesn’t reach Salesforce (network failure between the external system sending 200 and Salesforce receiving it), a naive retry sends the same request twice. You need idempotency.

Pattern: include a stable, deterministic idempotency key with every mutating request.

public static String buildIdempotencyKey(Id recordId, String operation, Datetime timestamp) {
    // Same record + operation + timestamp = same key = safe to retry
    String raw = recordId + ':' + operation + ':' + timestamp.format('yyyyMMddHH');
    return EncodingUtil.convertToHex(
        Crypto.generateDigest('SHA-256', Blob.valueOf(raw))
    ).left(32);
}

// In the callout
req.setHeader('Idempotency-Key', buildIdempotencyKey(order.Id, 'create', order.CreatedDate));

The external system uses this key to detect duplicate requests and return the original response without processing twice. Ensure your integration partners support idempotency keys - this is a requirement to specify in integration contracts, not an assumption.

The Outbox Pattern

The most reliable pattern for Salesforce-to-external integrations: don’t call the external API in the same transaction that creates/updates the Salesforce record. Use an outbox.

Record saved ──► OutboxEvent__e fired ──► Platform Event Trigger ──► Callout Queueable

// In the trigger or flow
OutboxEvent__e event = new OutboxEvent__e(
    RecordId__c = order.Id,
    Payload__c = JSON.serialize(order),
    Operation__c = 'CREATE_ORDER',
    CreatedTimestamp__c = Datetime.now().getTime()
);
EventBus.publish(event);

// Platform Event trigger → Queueable
trigger OutboxEventTrigger on OutboxEvent__e (after insert) {
    List<OutboxMessage> messages = new List<OutboxMessage>();
    for (OutboxEvent__e event : Trigger.new) {
        messages.add(new OutboxMessage(event));
    }
    System.enqueueJob(new OutboxProcessor(messages));
}

Benefits:

The Salesforce record is saved regardless of external API availability
The Queueable handles the callout asynchronously with proper retry logic
Platform Events provide at-least-once delivery guarantee

Handling Partial Batch Failures

When sending a batch to an external API, you’ll often get back partial success:

{
  "results": [
    {"id": "order-001", "status": "success"},
    {"id": "order-002", "status": "error", "message": "duplicate order"},
    {"id": "order-003", "status": "success"}
  ]
}

Never treat this as a success. Never treat it as a failure. Process each result individually:

List<Id> successIds = new List<Id>();
List<String> failedIds = new List<String>();

for (IntegrationResult result : responseResults) {
    if (result.status == 'success') {
        successIds.add(result.salesforceId);
    } else if (isPermanentFailure(result.message)) {
        failedIds.add(result.id);
        logPermanentFailure(result);
    } else {
        // Transient - re-queue for retry
        requeueForRetry(result);
    }
}

// Update sync status on successfully processed records
updateSyncStatus(successIds, 'Synced');
updateSyncStatus(failedIds, 'Failed');

Governor Limit Management

Governor limits are Salesforce’s way of enforcing multi-tenancy. Fighting them is futile; working with them is the design goal.

SOQL in loops is the most common violation. Solution: collect IDs, query once, use maps.

Callout limits (100 per transaction) break batch scenarios. Solution: bulkify callouts using batch APIs. If the external system doesn’t support batch, use a Queueable chain that processes one record per execution.

Heap size limits (6MB sync, 12MB async) bite JSON serialisation of large datasets. Solution: process in chunks. Never load all records into memory at once.

// Chunked processing Queueable
public class ChunkedProcessor implements Queueable, Database.AllowsCallouts {
    private List<Id> recordIds;
    private Integer chunkSize = 50;
    private Integer offset;

    public void execute(QueueableContext ctx) {
        List<Id> chunk = recordIds.subList(offset, Math.min(offset + chunkSize, recordIds.size()));
        processChunk(chunk);

        if (offset + chunkSize < recordIds.size()) {
            System.enqueueJob(new ChunkedProcessor(recordIds, offset + chunkSize));
        }
    }
}

Observability: Knowing When Things Break

A resilient integration without observability is just a silent one. Instrument every integration path:

public static void logIntegrationEvent(
    String integrationName,
    String operation,
    String status,
    String errorMessage,
    Integer durationMs
) {
    Integration_Log__c log = new Integration_Log__c(
        Integration_Name__c = integrationName,
        Operation__c = operation,
        Status__c = status,
        Error_Message__c = errorMessage?.left(255),
        Duration_Ms__c = durationMs,
        Timestamp__c = Datetime.now()
    );
    // Use future/queueable to avoid DML limits in callout context
    insertLogAsync(JSON.serialize(log));
}

Build a dashboard on Integration_Log__c showing:

Error rate by integration in the last hour
Average latency trend
Dead letter queue depth (custom object with unprocessed failed records)

Alert when error rate exceeds 5% for any integration over a 15-minute window.

Contractual Resilience

Technical patterns only go so far. The other half of integration resilience is contractual:

SLA for your integration partner’s API - what’s their uptime commitment? Do they have a status page?
API versioning policy - how much notice do they give before deprecating an endpoint?
Rate limit documentation - exact limits, burst allowances, headers they return when limiting
Sandbox/staging environment - can you test failure scenarios (intentionally unavailable endpoint) without affecting production?

Document these in your integration design documentation and revisit them at each major release. An API you integrated two years ago may have changed its rate limits or versioning policy.