Failover URLs for when a third-party API goes dark

A failover URL list in a gateway beats retry loops in the browser. How Bridge failover works, where it fits, and the trade-offs to know first.

SaltingIO Team April 22, 2026 6 min read DevOps & CI/CD

failoverapi-gatewayresiliencebridgesuptimestripeopenaithird-party-apis

Failover URLs for when a third-party API goes dark

The first time Stripe threw a 503 at our checkout, we had already deployed the fix by the time the oncall engineer finished typing a Slack message. The fix was not clever. It was a retry loop in the frontend that waited two seconds and hit the same endpoint again. When Stripe came back up, the retries succeeded. When Stripe stayed down, we kept hammering a dead API.

That pattern works right up until the day it does not. I want to walk through a different approach, one where the frontend calls a single endpoint you own, and the decision of what to do when the upstream is down lives in a configuration record instead of a retry loop.

The usual failover attempts

Most teams layer three defenses in front of a third-party API, roughly in this order.

First, retries. The frontend catches a 5xx or a timeout and tries again. This helps with transient blips, but it does nothing when the upstream is deeply broken, and it can turn a minor outage into a self-inflicted load spike.

Second, circuit breakers. After N failures, stop calling the upstream for a cooldown period. Better behavior, but now the frontend carries state about whether a third-party API is healthy, and that state is per-tab, per-device, per-deploy. It will not be right.

Third, a swap-in alternative. Stripe goes down, call a backup payment processor. This is almost never implemented because the two APIs rarely match shape, and writing adapter code for a disaster you hope never happens is easy to skip.

The useful question is: where does the failover decision belong? Not the browser. Not the user's device. Somewhere in the request path that has a global view of upstream health and a single place to change behavior.

What Bridge failover looks like

A SaltingIO Bridge forwards requests from https://api.salting.io/r/{uuid} to an upstream URL. You can attach an ordered list of up to 20 failover URLs to the same Bridge. If the primary fails, the gateway tries each failover in order and returns the first successful response.

The frontend never changes. It hits one endpoint:

const response = await fetch(`https://api.salting.io/r/${BRIDGE_UUID}`, {
  method: 'POST',
  headers: { 'Content-Type': 'application/json' },
  body: JSON.stringify({ amount: 1999, currency: 'usd' })
});

Whether the request went to the primary or a failover is invisible from the browser. Same status code shape, same response body shape (with a caveat I will come back to).

Setting it up for a real service

Here is a concrete scenario. You have an OpenAI Bridge that proxies chat completions. You want to fall back to Anthropic's Messages API when OpenAI starts returning 429s or 5xxs faster than your quota recovers.

In the dashboard, your Bridge primary is:

https://api.openai.com/v1/chat/completions

Headers include the Authorization bearer token and Content-Type: application/json. The method allowlist has POST.

Failover URL 1 is an Anthropic passthrough. But the shapes do not match. The request body layouts are similar, and the response shapes differ enough that your frontend code will break if you blindly hand it an Anthropic payload where it expected an OpenAI one.

This is where ?select= earns its keep. On the primary Bridge, shape the response down to what the frontend actually needs:

GET https://api.salting.io/r/{uuid}?select={text:choices.0.message.content}

Then apply the same shape to the failover, pulling from Anthropic's response layout (content.0.text) instead. Both providers then return { "text": "..." } to the browser. The code that consumed it never knew the provider switched.

The part most failover strategies skip

A failover that returns a broken response shape is worse than no failover. Users see a cryptic error, your monitoring sees 200s, and you spend the first hour of the incident chasing a ghost. The ?select= transformation is how you enforce a contract at the gateway boundary instead of hoping every upstream agrees.

There is one more piece: the failover message. A Bridge lets you configure a custom failover message that fires when every URL in the chain has failed. Something like:

{ "error": "Upstream unavailable", "retry_in_seconds": 30 }

Put it in the shape your frontend already knows how to render. Do not leak whatever raw 503 HTML the primary sent.

Where failover does not help

A Bridge failover is a good fit for idempotent reads and for operations where the second provider can substitute for the first. It is a poor fit for a few situations worth calling out.

Writes with side effects are the obvious one. Charging a card twice because the first gateway slow-responded and the second gateway succeeded is a worse outcome than a failed checkout. Keep failover off for any POST where you cannot prove idempotency at the upstream level.

Latency-sensitive hot paths are another. The failover attempt costs at least the primary's timeout window plus the failover's own response time. If your p95 budget is 200ms, burning three seconds on a primary timeout before trying the failover will shred your SLO even when it eventually succeeds.

Binary payloads and streaming responses need more care. The ?select= transformation is JSON-shaped. If you are proxying file downloads or server-sent events, failover is still useful, but the shape-normalization story goes away and you are back to needing both upstreams to agree on their own output.

A note on configuration as code

The one thing I would change about most gateway setups is leaving the failover list in a single dashboard. Record which failovers exist in your repo alongside the code that calls them, even if the record is just a comment above the fetch. When a failover fires in production, the first question you will ask is "what is the order, and why did we pick these particular backups." Having that noted in version control, next to the frontend code, turns a 2am mystery into a ten-second grep.

Bridge failover is a small feature, but it moves a decision to the right layer. The browser should not be making strategic calls about whether a third-party API is alive. Your gateway should, and the rest of the system should not have to know.

If you want to see the failover list and the select paths wired up together, read the docs.