A 502 Bad Gateway means a reverse proxy or load balancer tried to forward your request to the actual application, didn't get a usable answer back, and gave up. The proxy is fine. The thing behind the proxy is the problem. That's the whole story — but the half-dozen ways "didn't get a usable answer" can happen are what make 502s feel mysterious.
The literal definition ¶
From RFC 9110: a 502 is returned when "the server, while acting as a gateway or proxy, received an invalid response from an inbound server it accessed." The page you're trying to load goes through more than one hop on its way back from the origin — your browser hits a CDN, the CDN hits a load balancer, the load balancer hits your application. A 502 is one of those middle hops saying "I asked the next link in the chain and what came back wasn't usable." It's a hop-level failure announcement.
What "not usable" means varies by proxy:
- The upstream returned no response at all (TCP RST, connection closed mid-stream).
- The upstream took longer than the proxy's read timeout (some proxies still call this 502; most return 504).
- The upstream returned malformed HTTP — bad framing, truncated headers, a binary payload claimed as text.
- The upstream wasn't reachable at all (DNS failure on the proxy side, no route to host).
- TLS handshake to the upstream failed (expired cert, hostname mismatch, mutual TLS misconfig).
502 vs 503 vs 504 ¶
These three get confused constantly, mostly because every one of them gets summarized as "the site is down." They mean different things:
| Code | What it means | Who's responsible |
|---|---|---|
| 502 Bad Gateway | Proxy tried to talk to upstream, upstream sent back nothing or garbage. | Upstream (the application). |
| 503 Service Unavailable | Server is up and answering, but is intentionally refusing to handle the request right now (rate-limited, overloaded, in maintenance, deliberately failing health checks). | The application itself, often deliberately. |
| 504 Gateway Timeout | Proxy waited for upstream, upstream never replied within the timeout. | Upstream — but the failure mode is "too slow," not "broken." |
Quick mental shortcut: 502 is "I asked, got nothing." 503 is "I asked, was told no." 504 is "I asked, gave up waiting."
What actually causes 502s in production ¶
1. The upstream application crashed
Most common cause. The Node/Python/Ruby/Go process behind the load balancer panicked and the OS reaped it. The proxy's connection drops, and the next request gets a 502 because there's nothing to connect to. Auto-restart usually brings it back within seconds. If you see steady 502s for more than a minute, the process is in a crash loop — coming up, immediately failing, getting reaped, repeating.
2. Health checks are failing
Most cloud load balancers (AWS ALB, GCP HTTPS LB, nginx upstream blocks) only forward traffic to upstream instances that are passing a periodic health check. If every backend in the pool is failing the check, the proxy has nowhere to send traffic — every request becomes a 502. The application itself may technically be running; it's just failing the health endpoint, often because the endpoint queries a database that's slow.
3. A deploy gone wrong
The new version of the application crashes on startup (missing env var, bad config, dependency mismatch) and the deploy strategy doesn't catch it before pulling old instances out of rotation. Rolling deploys can hit a window where every instance is the broken new version, and the load balancer has nothing healthy left. 502s spike, alerts fire, the team rolls back. This is the failure mode that makes engineers love deployment systems with health-gating built in (you don't pull an old instance until the new one passes its health check).
4. The proxy can't reach the upstream's network
VPC peering broken, security group changed, route table entry deleted, NAT gateway saturated. The proxy's TCP SYN never reaches the upstream, the connection fails, you see 502. This is the cascade-style 502 — affects every request equally, has nothing to do with the application code, and tends to come with peer outages because whatever broke routing usually broke other things too.
5. The upstream's SSL certificate expired
If the proxy talks to the upstream over HTTPS (common for service-mesh / mTLS setups), an expired upstream cert breaks the TLS handshake and the proxy returns 502. This shape of 502 happens predictably on a particular date, hits 100% of traffic at once, and gets fixed by renewing the cert. /ssl is the tool for this on the public-facing edge — for the internal cert it's the same protocol, just behind a private network.
6. CDN can't reach origin
Cloudflare returns its own branded 502 (the orange-cloud "Bad Gateway" page) when it can't reach the origin server. Causes are the same five above plus the new one of "the origin specifically is down to Cloudflare," which can happen if origin firewall rules dropped Cloudflare's IP ranges after a config change. Cloudflare also has its own internal 502 pages (Error 1014 / 1016 / 522 / 523) for specific failure modes — see our Cloudflare error code reference for the full taxonomy.
Diagnosing a 502 as a user ¶
If you're hitting a 502 visiting a site, the practical answer is: it's not your problem. 502s aren't caused by anything on your end. There's no DNS flush, browser reset, or VPN trick that changes them. Three useful actions:
- Wait a minute and refresh. Most 502s clear themselves within 30-90 seconds as auto-restart or auto-scaling brings capacity back online.
- Check from another region. If you're regionally affected, a multi-region check tells you fast — try the page on isitdown.io for a 4-region read.
- Stop retrying aggressively. Hammering refresh during a 502 storm makes the recovery slower for everyone — when the upstream comes back up, the queue of pent-up retries is what often takes it down again. This is the "thundering herd" problem; the polite move is to let it recover.
Diagnosing a 502 as an operator ¶
If you're the one running the site, the diagnostic flow is the same six causes in checklist form:
- Are the upstream processes alive?
kubectl get pods,docker ps,systemctl status, however you supervise them. Crash loops show up as restart counts climbing every few seconds. - Are health checks passing? Hit the health endpoint directly from inside the network. If it returns 200 from inside but the load balancer says it's failing, look at the load balancer's source IPs — they may be blocked at the firewall / security group layer.
- Did anyone deploy recently? Check your deploy history. 502s starting within five minutes of a release point at the release. Roll back first, debug second.
- Can the proxy actually reach the upstream? SSH onto the proxy host (or run a debug pod inside its network) and
curlthe upstream directly. Ifcurlhangs, you have a network problem; if it returns the expected response, the proxy is misconfigured. - Are upstream certs valid?
openssl s_client -connect upstream:443shows expiry; if it's past, you found it. - If you front through a CDN: check the CDN's status page first. A 502 from Cloudflare or Fastly that hits all of your traffic at once is almost always the CDN, not your origin.
Why 502 storms happen during big incidents ¶
The 2021 Fastly outage was a 502 storm: a single VCL push triggered a latent bug that took out 85% of Fastly's edge network. Every Fastly-fronted site (NYT, Reddit, GOV.UK, Stack Overflow, Spotify, Twitch) returned a 502 for about an hour. The origins were all fine — the proxy layer between origins and users was the failure. That's the canonical "502 means the proxy can't reach upstream" pattern at internet scale.
The 2017 AWS S3 outage produced a different kind of 502 cascade: services that didn't directly use S3 still returned 502s because their CI/CD pipelines, dependency mirrors, or container registries did. When you have a transitive dependency you're not aware of, a 502 from your own site is the surface symptom of an upstream that you didn't know you had. The AWS cascade field guide walks through how this kind of dependency entanglement plays out.
FAQ ¶
Is 502 always a server problem?
Yes. There's no client-side cause for a 502. It's a proxy reporting an upstream failure. Anything you do on your machine — DNS flush, browser reset, switching networks — won't change the response, because none of those actions touch the part of the system that's broken.
Can a 502 be persistent for hours?
It can, but it's unusual. The common cases (process crash, deploy gone wrong) self-recover within seconds to minutes because supervision systems are designed for it. A multi-hour 502 usually means the issue is at a layer the team didn't expect — a network policy that needs human approval to change, an expired cert nobody owns, or a downstream dependency the application can't restart without.
How do I tell if a 502 is Cloudflare-specific?
Cloudflare-fronted 502s render Cloudflare's branded error page with the orange-cloud logo and a Ray ID at the bottom. If you see that, the origin server may be fine — Cloudflare just couldn't reach it. Inspecting response headers (Server: cloudflare) confirms. /headers on isitdown.io will show you the actual response headers if you can't see them in your browser.
Can fixing 502s be a one-off thing or do they recur?
Both happen. A one-off 502 from a single crashed instance restarts within seconds and you never see it again. But if your application has memory leaks, ungated dependencies, or unbounded request handlers, you'll see 502s recur on a schedule — every few hours when memory exhausts, every few days when an upstream's health-check window flips. Recurring 502s point to operational debt; one-off 502s usually don't.