How to Integrate Third-Party APIs Reliably — Patterns We Use to Handle Failures

Every third-party API integration on this site started the same way: the external service went down, and our application did not handle it gracefully. The payment gateway returned 503, the email provider timed out, the mapping API started throwing 429s — and instead of failing one request, the failure cascaded through thread pools, consumed connections, and took down features that had nothing to do with the failing integration.
That is the problem these third-party API integration reliability patterns solve. Not the happy path — the unhappy path where a single slow external call threatens your entire application. After enough post-mortems, we settled on seven patterns that every integration in our codebase follows: timeout, retry, circuit breaker, fallback, abstraction, monitoring, and testing. This post covers all of them with NestJS code you can drop into a project today.
For over 90% of mid-size and large enterprises, a single hour of downtime now costs more than $300,000 (ITIC, 2024). A third-party API that your service calls on every user request can cause that hour of downtime if you treat it as reliable by default. These patterns are the insurance.

The Four Failure Modes
Before the patterns, understand what you are protecting against. Every third-party API integration faces four distinct failure modes:
Timeout. The service does not respond within a reasonable window. The request hangs, holding a connection, a thread, and memory until your server's patience (or resource limit) runs out. This is the most common failure mode and the one that causes the most cascading damage, because a slow service consumes resources without immediately signalling failure.
Error response. The service responds but with a non-2xx status. Some of these are your fault (400 bad request, 401 unauthorized), some are the service's fault (500 internal error, 503 unavailable). The response tells you something is wrong, but it does not tell you whether retrying will help.
Rate limit. The service says "slow down" — usually via a 429 status with a Retry-After header. Unlike a timeout or a 503, this failure mode has a known recovery window: wait the specified duration and try again. Ignoring rate limits gets your API key suspended.
Downtime. The service is completely unreachable — DNS fails, TLS handshake fails, or the TCP connection is refused. No response, no error body, no Retry-After header. Just silence. This is the hardest failure mode to distinguish from a transient network glitch, which is why the patterns below handle it differently than a clear error response.
Each failure mode needs a different response. That is what the seven patterns deliver. (For a deeper look at how this connects to event-driven error handling, our event-driven architecture in NestJS post covers the domain event side of this pattern ecosystem.)
Pattern 1: Timeout Configuration
The simplest pattern with the most impact. Every external HTTP call needs a timeout. Not a generous one — a tight one that fails fast when the service is slow.
In NestJS, the HttpModule from @nestjs/axios wraps Axios, which supports per-request timeouts:
1import { HttpService } from '@nestjs/axios';
2import { Injectable } from '@nestjs/common';
3import { catchError, timeout, TimeoutError } from 'rxjs';
4import { AxiosError } from 'axios';
5
6@Injectable()
7export class ExternalApiService {
8 constructor(private readonly httpService: HttpService) {}
9
10 async fetchData(endpoint: string) {
11 return this.httpService
12 .get(endpoint, { timeout: 5000 })
13 .pipe(
14 timeout(5000),
15 catchError((err) => {
16 if (err instanceof TimeoutError || err.code === 'ECONNABORTED') {
17 throw new Error(`API timeout after 5000ms: ${endpoint}`);
18 }
19 throw err;
20 }),
21 )
22 .toPromise();
23 }
24}Set timeouts at two levels: the HTTP client timeout (how long to wait for a response) and an overall operation timeout (in case the client timeout is bypassed by streaming or chunked responses). Five seconds is our default for synchronous user-facing calls. Background jobs get a longer leash — fifteen to thirty seconds — because they are not blocking a user request.
A timeout does not solve the failure. It contains it. Without a timeout, a slow API call holds that async context open indefinitely, and enough of those exhaust your event loop. With a timeout, the failure is bounded to five seconds and you move on to the next pattern.
Pattern 2: Retry With Exponential Backoff
When a call times out or returns a 503, the first instinct is to try again immediately. That instinct is wrong. Immediate retries on a struggling service add load at exactly the wrong moment, turning a transient blip into a sustained outage.
Exponential backoff means each successive retry waits longer than the last: 1 second, then 2, then 4, then 8. Add jitter — random variance — to prevent the thundering herd problem where every client retries at the same instant:
1import { Injectable } from '@nestjs/common';
2import { AxiosError } from 'axios';
3
4interface RetryConfig {
5 maxRetries: number;
6 baseDelayMs: number;
7 maxDelayMs: number;
8}
9
10@Injectable()
11export class RetryService {
12 async withRetry<T>(
13 fn: () => Promise<T>,
14 config: RetryConfig = { maxRetries: 3, baseDelayMs: 1000, maxDelayMs: 10000 },
15 ): Promise<T> {
16 let lastError: Error;
17
18 for (let attempt = 0; attempt <= config.maxRetries; attempt++) {
19 try {
20 return await fn();
21 } catch (err) {
22 lastError = err;
23
24 if (this.isNonRetryable(err)) {
25 throw err;
26 }
27
28 if (attempt < config.maxRetries) {
29 const delay = this.calculateDelay(attempt, config);
30 await this.sleep(delay);
31 }
32 }
33 }
34
35 throw lastError;
36 }
37
38 private isNonRetryable(err: unknown): boolean {
39 if (err instanceof AxiosError && err.response) {
40 const status = err.response.status;
41 return [400, 401, 403, 404, 422].includes(status);
42 }
43 return false;
44 }
45
46 private calculateDelay(attempt: number, config: RetryConfig): number {
47 const exponentialDelay = config.baseDelayMs * Math.pow(2, attempt);
48 const cappedDelay = Math.min(exponentialDelay, config.maxDelayMs);
49 const jitter = Math.random() * 1000;
50 return cappedDelay + jitter;
51 }
52
53 private sleep(ms: number): Promise<void> {
54 return new Promise((resolve) => setTimeout(resolve, ms));
55 }
56}The isNonRetryable check is critical. Never retry 400, 401, 403, 404, or 422 — those errors are client-side and no number of retries will fix them. Retry on 429 (rate limited), 503 (service unavailable), 504 (gateway timeout), and network-level errors (ECONNRESET, ETIMEDOUT). The distinction between retryable and non-retryable failures is the difference between a resilient integration and one that makes things worse under load.
Pattern 3: Circuit Breaker
Retry handles transient failures. Circuit breaker handles persistent ones — the service that has been returning 503 for five minutes and will keep returning 503 no matter how many times you retry.
The circuit breaker has three states, as described in the Azure Architecture Center's circuit breaker pattern, and every guide on this topic describes them because they are the whole mechanism:
- Closed. Normal operation. Requests pass through. Failures are counted against a threshold.
- Open. The threshold is exceeded. Requests are rejected immediately without calling the service. A fallback runs instead.
- Half-Open. After a cooldown period, one test request is allowed through. Success closes the circuit. Failure reopens it.
In Node.js, the standard library is opossum. Here is a NestJS wrapper:
1import { Injectable } from '@nestjs/common';
2import CircuitBreaker from 'opossum';
3
4interface BreakerOptions {
5 timeout: number;
6 errorThresholdPercentage: number;
7 resetTimeout: number;
8 name: string;
9}
10
11@Injectable()
12export class CircuitBreakerService {
13 private breakers = new Map<string, CircuitBreaker>();
14
15 createBreaker<T>(
16 fn: (...args: unknown[]) => Promise<T>,
17 options: BreakerOptions,
18 ): CircuitBreaker {
19 const breaker = new CircuitBreaker(fn, {
20 timeout: options.timeout,
21 errorThresholdPercentage: options.errorThresholdPercentage,
22 resetTimeout: options.resetTimeout,
23 name: options.name,
24 });
25
26 breaker.on('open', () =>
27 console.warn(`Circuit breaker ${options.name} opened`),
28 );
29 breaker.on('halfOpen', () =>
30 console.log(`Circuit breaker ${options.name} half-open`),
31 );
32 breaker.on('close', () =>
33 console.log(`Circuit breaker ${options.name} closed`),
34 );
35
36 this.breakers.set(options.name, breaker);
37 return breaker;
38 }
39
40 async fire<T>(name: string, ...args: unknown[]): Promise<T> {
41 const breaker = this.breakers.get(name);
42 if (!breaker) {
43 throw new Error(`No circuit breaker registered: ${name}`);
44 }
45 return breaker.fire(...args);
46 }
47
48 getStatus(name: string) {
49 return this.breakers.get(name)?.status.stats;
50 }
51}Use it to wrap any external API call:
1@Injectable()
2export class PaymentProviderService {
3 constructor(
4 private readonly circuitBreaker: CircuitBreakerService,
5 private readonly httpService: HttpService,
6 ) {
7 this.circuitBreaker.createBreaker(
8 (chargeId: string) =>
9 this.httpService
10 .get(`https://payment-provider.com/charges/${chargeId}`, {
11 timeout: 5000,
12 })
13 .toPromise(),
14 {
15 timeout: 5000,
16 errorThresholdPercentage: 50,
17 resetTimeout: 30000,
18 name: 'payment-provider',
19 },
20 );
21 }
22
23 async getCharge(chargeId: string) {
24 return this.circuitBreaker.fire('payment-provider', chargeId);
25 }
26}When fifty percent of requests fail within the window, the breaker trips. All subsequent requests fail immediately — no network call, no connection consumed, no timeout waited. After thirty seconds, one request is allowed through to test recovery. The breaker state machine handles the rest.
Pattern 4: Fallback — What to Do When the Service Is Down
When the circuit breaker is open or retries are exhausted, you cannot return nothing. The user needs a response — even if it is not the ideal one.
A fallback serves a degraded but acceptable response. The most common patterns:
Cached data. Return the last known good response from a cache. The data may be stale, but it is better than an error page. Set a TTL short enough that staleness is bounded (five to fifteen minutes) and long enough that a brief outage does not empty the cache.
1async getWeatherData(city: string): Promise<WeatherResponse> {
2 try {
3 const fresh = await this.weatherApi.fetchWeather(city);
4 await this.cacheService.set(`weather:${city}`, fresh, 300);
5 return fresh;
6 } catch {
7 const cached = await this.cacheService.get(`weather:${city}`);
8 if (cached) {
9 return { ...cached, _stale: true };
10 }
11 throw new Error('Weather data unavailable');
12 }
13}Queue for later. For non-blocking operations — sending emails, generating reports, syncing data — push the request to a queue and process it when the service recovers. The user gets a confirmation that the operation is scheduled, not an error.
Feature degradation. Disable the failing feature explicitly. Show a banner: "Live tracking is temporarily unavailable. Your order is confirmed." The user knows what is happening and is not left staring at a spinner.
The rule: a fallback should never surprise the user with a failure without telling them what happened. A degraded experience with a clear message is better than a 500 error with no context.

Pattern 5: Idempotency — Safe Retries for Write Operations
Retries for read operations are safe — reading the same data twice is harmless. Retries for write operations (creating a charge, sending an email, updating a record) must be idempotent: executing the same operation multiple times must produce the same result.
Stripe's API popularised the idempotency key pattern, and every write endpoint in a resilient third-party API integration should follow it. Generate a unique key per operation, send it with the request, and the provider deduplicates on their end:
1@Injectable()
2export class IdempotentApiService {
3 constructor(
4 private readonly httpService: HttpService,
5 @InjectRepository(IdempotencyKey)
6 private readonly idempotencyRepo: Repository<IdempotencyKey>,
7 ) {}
8
9 async executeWithIdempotency<T>(
10 operation: () => Promise<T>,
11 idempotencyKey: string,
12 ): Promise<T> {
13 const existing = await this.idempotencyRepo.findOneBy({ key: idempotencyKey });
14 if (existing) {
15 return existing.response as T;
16 }
17
18 const response = await operation();
19
20 await this.idempotencyRepo.save({
21 key: idempotencyKey,
22 response,
23 createdAt: new Date(),
24 });
25
26 return response;
27 }
28}If the external API also supports idempotency keys (Stripe, PayPal, most payment gateways do), pass the key in the request header. If it does not, track idempotency in your own database before calling the external API. The database write and the API call should be in the same transaction or paired with a compensating action.
Without idempotency, retries on write operations create duplicate charges, duplicate emails, and duplicate records. With idempotency, retries become safe — and a retry-safe system is a resilient system.
Pattern 6: Wrapping Third-Party Clients in an Abstraction Layer
Every third-party integration in our codebase is behind an interface. The service class that calls the external API is the only class that imports the provider's SDK or knows the endpoint URL. Every other part of the application goes through that interface.
This is not theoretical architecture — it is practical survivability. When the email provider changes its API format (which they do, often), you change one file. When the SMS provider goes down and you need to switch to a backup, you implement a second class behind the same interface and toggle it with a config flag:
1export interface PaymentGateway {
2 createCharge(amount: number, currency: string): Promise<ChargeResponse>;
3 getCharge(id: string): Promise<ChargeResponse>;
4 refundCharge(id: string): Promise<RefundResponse>;
5}
6
7@Injectable()
8export class StripeGateway implements PaymentGateway {
9 constructor(private readonly stripe: Stripe) {}
10
11 async createCharge(amount: number, currency: string): Promise<ChargeResponse> {
12 const charge = await this.stripe.charges.create({ amount, currency });
13 return { id: charge.id, status: charge.status, amount: charge.amount };
14 }
15
16 async getCharge(id: string): Promise<ChargeResponse> {
17 const charge = await this.stripe.charges.retrieve(id);
18 return { id: charge.id, status: charge.status, amount: charge.amount };
19 }
20
21 async refundCharge(id: string): Promise<RefundResponse> {
22 const refund = await this.stripe.refunds.create({ charge: id });
23 return { id: refund.id, status: refund.status };
24 }
25}Apply the reliability patterns at this abstraction layer. The timeout, retry, and circuit breaker wrap the interface, not the implementation. That way every provider gets the same resilience treatment automatically:
1@Injectable()
2export class PaymentService {
3 constructor(
4 private readonly gateway: PaymentGateway,
5 private readonly retryService: RetryService,
6 private readonly circuitBreaker: CircuitBreakerService,
7 ) {
8 this.circuitBreaker.createBreaker(
9 (amount: number, currency: string) =>
10 this.retryService.withRetry(() =>
11 this.gateway.createCharge(amount, currency),
12 ),
13 { timeout: 10000, errorThresholdPercentage: 50, resetTimeout: 30000, name: 'payments' },
14 );
15 }
16
17 async charge(amount: number, currency: string) {
18 return this.circuitBreaker.fire('payments', amount, currency);
19 }
20}A single interface, a single retry wrapper, a single circuit breaker. Swap the provider underneath without touching the resilience layer.
Pattern 7: Monitoring Third-Party API Health
You cannot fix what you cannot see. Every external API integration needs health tracking — not just "is it up?" but "is it slow, is it erroring, is it approaching rate limits?"
Track these metrics per integration:
- Request latency (p50, p95, p99). A provider whose p95 jumps from 200ms to 3s is failing, even if they are returning 200 OK.
- Error rate. Count 4xx and 5xx responses separately. A spike in 429s means you are approaching rate limits. A spike in 503s means the provider is having issues.
- Circuit breaker state. Log every open, half-open, and close event. An opening breaker is an early warning.
- Fallback invocation rate. How often are you serving stale data? If fallbacks fire more than 1% of requests, the integration needs attention.
A simple approach is a NestJS interceptor that wraps external API calls with metric collection:
1@Injectable()
2export class ApiMetricsInterceptor implements NestInterceptor {
3 constructor(
4 @InjectMetric('api_request_duration_seconds')
5 private readonly durationHistogram: Histogram<string>,
6 @InjectMetric('api_requests_total')
7 private readonly requestCounter: Counter<string>,
8 ) {}
9
10 intercept(context: ExecutionContext, next: CallHandler): Observable<unknown> {
11 const start = Date.now();
12 const apiName = context.getHandler().name;
13
14 return next.handle().pipe(
15 tap({
16 next: () => {
17 this.durationHistogram.observe({ api: apiName }, (Date.now() - start) / 1000);
18 this.requestCounter.inc({ api: apiName, status: 'success' });
19 },
20 error: () => {
21 this.durationHistogram.observe({ api: apiName }, (Date.now() - start) / 1000);
22 this.requestCounter.inc({ api: apiName, status: 'error' });
23 },
24 }),
25 );
26 }
27}Export these metrics to Prometheus (or your monitoring platform of choice) and set alerts. A p99 latency above your timeout threshold means requests are timing out. An error rate above 5% means the circuit breaker should have tripped — and if it did not, your threshold is set too high.
Monitor from your application, not from an external uptime checker. Uptime checkers tell you the service is reachable. Application metrics tell you the service is actually working — which is a very different thing.
Testing Integrations With Mock Servers
The final pattern is proving the first six work. You cannot test circuit breaker tripping, retry backoff, or fallback serving against a real API — the real API is either up (and passes all tests) or down (and fails all tests). Neither scenario proves the resilience logic works.
Use a mock HTTP server that simulates specific failure modes:
1// test setup
2import { HttpModule } from '@nestjs/axios';
3import { Test } from '@nestjs/testing';
4import { setupServer } from 'msw/node';
5import { http, HttpResponse } from 'msw';
6
7const handlers = [
8 http.get('https://api.example.com/data', ({ request }) => {
9 const attempt = request.headers.get('X-Retry-Attempt') || '0';
10 if (parseInt(attempt) < 2) {
11 return HttpResponse.json({ error: 'Service Unavailable' }, { status: 503 });
12 }
13 return HttpResponse.json({ data: 'success' });
14 }),
15];
16
17const server = setupServer(...handlers);
18
19describe('ExternalApiService', () => {
20 beforeAll(() => server.listen());
21 afterEach(() => server.resetHandlers());
22 afterAll(() => server.close());
23
24 it('retries on 503 and succeeds on third attempt', async () => {
25 const module = await Test.createTestingModule({
26 imports: [HttpModule],
27 providers: [ExternalApiService, RetryService],
28 }).compile();
29
30 const service = module.get(ExternalApiService);
31 const result = await service.fetchData('https://api.example.com/data');
32 expect(result.data).toBe('success');
33 });
34
35 it('throws after exhausting retries on persistent 503s', async () => {
36 server.use(
37 http.get('https://api.example.com/data', () =>
38 HttpResponse.json({ error: 'Service Unavailable' }, { status: 503 }),
39 ),
40 );
41
42 const module = await Test.createTestingModule({
43 imports: [HttpModule],
44 providers: [ExternalApiService, RetryService],
45 }).compile();
46
47 const service = module.get(ExternalApiService);
48 await expect(service.fetchData('https://api.example.com/data')).rejects.toThrow();
49 });
50});Write a test for each pattern: timeout triggers a timeout error, retry succeeds on the third attempt, circuit breaker opens after the threshold, fallback returns cached data when the breaker is open, idempotency returns the cached response on duplicate keys. Each test simulates the exact failure mode the pattern is designed to handle.
Composing the Third-Party API Integration Patterns
Here is how the patterns compose in a single third-party API integration:
Request flow:
- Call the external API through the abstraction layer (Pattern 6)
- Apply a timeout of 5 seconds (Pattern 1)
- On failure, retry up to 3 times with exponential backoff (Pattern 2)
- If retries fail, the circuit breaker tracks the failure count (Pattern 3)
- Once the threshold is exceeded, the breaker opens and a fallback runs — cached data, queued request, or degraded response (Pattern 4)
- For write operations, apply idempotency before retrying (Pattern 5)
- All failures and fallbacks are tracked by the monitoring layer (Pattern 7)
- The entire flow is tested with a mock server (bonus Pattern 8, if we are counting)

Conclusion
Third-party API integration reliability is not about choosing the right library or the right framework. NestJS handles the HTTP layer. Opossum, cockatiel, and the built-in retry utilities handle the individual patterns. The hard part is composing them in the right order with the right configuration for each integration.
We have burned through enough provider outages to know these patterns are not optional. They are the difference between a payment gateway going down and your checkout going down — between an email provider throttling your account and your notification system silently dropping messages. Every integration in our codebase follows this playbook now. It adds maybe a day of engineering per integration up front, and it has saved us weeks of midnight debugging since.
Need to implement these patterns across an entire API surface? We covered API design fundamentals — including how to make endpoints safe to retry — in our REST API design mistakes post. Between that post and this one, you have the patterns and the design decisions that make them work.
The next time a third-party API goes down, your application should not go down with it. That is what these seven patterns deliver. Pick the one you are missing and add it this sprint.
Frequently Asked Questions
The seven most important patterns are timeout configuration, retry with exponential backoff, circuit breaker, fallback responses, idempotency for safe retries, abstraction layer wrapping, and health monitoring. Together they handle transient failures, persistent outages, and cascading failures in distributed systems.
Use the opossum npm package, which provides a three-state circuit breaker (Closed, Open, Half-Open). Wrap each external API call in a breaker that monitors failure rates and trips open after a configurable threshold. NestJS is framework-agnostic at the HTTP layer so opossum integrates via a wrapper service or interceptor.
Exponential backoff increases the delay between retries exponentially (1s, 2s, 4s, 8s) rather than retrying at a constant interval. This prevents the thundering herd problem where retrying clients overwhelm a recovering service. Add jitter (random variance) to avoid synchronized retry waves.
Retry on 429 (rate limited), 503 (service unavailable), 504 (gateway timeout), and network-level errors. Do not retry on 400 (bad request), 401 (unauthorized), 403 (forbidden), 404 (not found), or 422 (unprocessable entity) because these errors are client-side and retrying will not change the outcome.
Use mock servers like MSW (Mock Service Worker) or nock to simulate specific failure scenarios — timeouts, 500 errors, rate limits — without calling the real API. Write integration tests that verify your circuit breaker trips correctly, retries execute with backoff, and fallbacks return appropriate responses.
A good fallback returns cached data if available, queued the request for later processing, or a degraded response that tells the user the feature is temporarily unavailable. Never return a generic 500 error. For example, if a weather API is down, serve the last cached forecast with a banner noting it may be stale.
