HTTP 504 Gateway Timeout: what causes it and how to fix it

A 504 Gateway Timeout means a reverse proxy or load balancer asked the upstream application to handle a request, waited the full configured timeout, and never got an answer. The upstream isn't down — it's just too slow for the proxy's patience. That difference matters because the fix is different from the fix for a 502.

The literal definition ¶

RFC 9110 defines 504 as: "the server, while acting as a gateway or proxy, did not receive a timely response from an upstream server." The word that does the work is timely. The proxy has a configured deadline — typically tens of seconds — for the upstream to start answering, and the upstream missed it. The TCP connection between proxy and upstream may still be open; nothing about it is broken; the response just hasn't come.

Why this matters: a 504 is fundamentally a latency problem, not a reachability problem. You can't fix it by restarting anything. You fix it by making the slow request faster, or by giving the proxy more patience, or by not making the request synchronous in the first place.

504 vs the family of look-alikes ¶

Code	What it means	Fix shape
504 Gateway Timeout	Proxy timed out waiting for upstream.	Speed up the upstream, or move the work async.
502 Bad Gateway	Upstream gave an empty or malformed response.	Restart / debug the upstream process.
503 Service Unavailable	Upstream is alive but explicitly refusing the request.	Take rate-limit, capacity, or maintenance flag off.
408 Request Timeout	Server timed out waiting for the client to finish sending the request.	Client-side problem — flaky network, large upload, slow JS.
Cloudflare 524	Cloudflare-specific: origin held the connection open but didn't send a HTTP response within 100 seconds.	Same fix as 504 — the upstream was too slow. Cloudflare separates this from 504 because they want to point the finger at the origin, not their network.

Quick mental shortcut: 504 is "I waited and gave up." 502 is "I asked, got nothing." 503 is "I asked, was told no." 408 is "you took too long to ask in the first place." 524 is Cloudflare's specific flavor of 504.

Common timeout values ¶

Knowing the default timeouts at each proxy layer helps you debug a 504 — the layer that times out first is what stamps the response. Approximate defaults you'll find in the wild:

nginx: proxy_read_timeout defaults to 60 seconds. Most production configs raise it for slow endpoints.
AWS ALB: idle timeout defaults to 60 seconds. NLB can be configured higher.
AWS API Gateway: hard 30-second integration timeout for REST APIs. Can't be raised. This is why long-running operations from API Gateway return 504 like clockwork at exactly 30s.
GCP HTTPS Load Balancer: backend service timeout defaults to 30 seconds.
Cloudflare (Free / Pro / Business): 100-second origin response timeout, surfaces as 524 not 504. Enterprise plans can extend to 6000 seconds.
HAProxy: timeout server defaults are config-dependent; most operators set 30-60 seconds.

If your 504s start hitting at exactly the 30-second mark, it's API Gateway or GCP. At 60s, nginx or ALB. At 100s, Cloudflare 524. The clock tells you which layer's timeout fired.

What actually causes 504s in production ¶

1. A slow database query

The single most common cause. The application is fine, the proxy is fine, but the request makes a database call that takes 90 seconds because the query is doing a sequential scan on a table that grew past the size where the planner used to use the index. The proxy's read timeout fires, returns 504, and the user never sees the (eventually-completed) query result. Classic pattern: the page works fine for years, then starts 504-ing once or twice a day, then constantly. EXPLAIN ANALYZE on the query and an index addition fixes most of these.

2. The upstream is CPU-saturated

Every request is fast in isolation, but with 100 simultaneous requests competing for 4 CPUs, individual request latency stretches to seconds and the proxy's timeout fires. The fix isn't faster code — it's more capacity, or per-request rate limiting, or moving CPU-heavy work off the request path.

3. Synchronous calls to slow third-party services

Your application calls Stripe, a geocoding service, an LLM provider, or the user's own webhook URL synchronously inside a request handler. The third party is having a slow day, the call takes 45 seconds, your proxy times out at 30 — even though your application would have returned the right answer eventually. The third party's slowness becomes your 504. The architectural fix is to make external calls async (queue + worker, not in-line in the request).

4. A database deadlock or lock contention

Two requests both want to update overlapping rows. Postgres / MySQL detect the deadlock and resolve one by killing it, but if the resolution itself takes long enough, the proxy gives up first. Heavy writes during a spike often look like 504 storms because lots of contending writers all stretch each other's latency.

5. The application is doing CPU-bound work in the request handler

Generating a PDF, encoding a video, running a large computation, calling an inference model. All of these regularly take longer than 30 seconds. The right pattern is to enqueue the work, return a 202 Accepted with a job ID, and have the client poll. Doing it inline guarantees occasional 504s — anyone with a slightly slower input will hit the timeout.

6. A startup race during deploy

The new application instance comes up and the load balancer routes traffic to it before the application has finished warming caches, opening database connections, or running migrations. First-request latency on a cold instance can be tens of seconds. If the proxy times out before the warmup completes, every request to that fresh instance becomes a 504 until it warms.

Diagnosing a 504 as a user ¶

Refresh once after a few seconds. Transient slowness clears itself; refreshing immediately just resubmits to the same overloaded backend.
Try a different page on the same site. If only one URL 504s and the rest of the site is fine, the issue is one slow endpoint — usually a search, a report, or a specific calculation. There's nothing you can do about it as a user; the site needs to fix the slow query.
Check our multi-region probe. A 504 visible from all four isitdown.io regions is a global slowness; one region only is regional load-balancer pressure.

Diagnosing a 504 as an operator ¶

The diagnostic flow follows the timeout-clock fingerprint plus the request fingerprint:

Which timeout fired? Look at the response time of the failing request. 30s = API Gateway or GCP backend. 60s = nginx / ALB. 100s = Cloudflare 524. That tells you which layer's patience ran out.
Is one endpoint disproportionately affected? If 504s are concentrated on /api/reports or /search, the slow code path is identified for you. Profile the upstream's logs for that endpoint.
Is the database the bottleneck? Check slow-query logs. Postgres' pg_stat_statements, MySQL's slow-log, the cloud DB's "Performance Insights" all surface this fast. A query that used to take 50ms now taking 30s usually means an index was dropped or data volume crossed a planner threshold.
Are external calls the bottleneck? Distributed tracing (OpenTelemetry, Datadog APM, etc.) shows you what fraction of request time is spent in external calls. If 95% of the latency is one third-party service, the upstream isn't slow — they are.
Did anyone deploy recently? Same as 502: a deploy in the last 5-15 minutes correlated with 504 spikes points at the deploy. Roll back, investigate after.
Are you behind a CDN with a 100s ceiling? If you're on Cloudflare Free / Pro / Business, no upstream-controlled change makes the 524 disappear if the origin really takes >100s. The fix is to make the request faster, not to argue with the timeout.

Why "just raise the timeout" is usually wrong ¶

The temptation when a 504 fires is to bump the proxy's timeout from 30s to 60s and move on. This works for one specific kind of 504 — when the upstream genuinely needs that long and the user is willing to wait. For most 504s it just hides the problem until the next time the slow path gets one increment slower. Worse, it makes outages bigger: a request now holds upstream resources for 60 seconds instead of 30, so capacity in a queue-driven system degrades faster under load.

The lasting fix is one of: making the slow operation faster (index, cache, query rewrite), making it async (job queue, return 202), or shedding load before it reaches the slow code (rate limit, circuit breaker). Raising the timeout is the answer when "this operation legitimately takes 45 seconds and there's no way to make it shorter" — for example, a one-off CSV export. For an everyday search endpoint, raising the timeout is a band-aid.

FAQ ¶

Is a 504 my fault as a user?

No. There's nothing on a user's side that causes a 504. You can't fix it by clearing cookies, flushing DNS, switching browsers, or changing networks. The slow operation is happening on the server's side regardless of which client made the request.

Why does the same site sometimes return 502 and sometimes 504 during an outage?

A spectrum: as upstream capacity degrades, requests first start slowing down (504 territory), then start crashing entirely as resources exhaust (502 territory). A site mid-incident often shows mixed 502s and 504s because some upstream instances are slow and others are dead. The mix shifts toward 502 as the incident worsens; toward 504 as it recovers.

How is Cloudflare 524 different from 504?

It's the same shape — origin too slow — but Cloudflare distinguishes "origin timed out" (524) from "Cloudflare itself had an issue" (504). When you see 524, the origin held the connection but never sent the HTTP response. When you see 504 from Cloudflare, something in Cloudflare's network couldn't reach the origin in time. As a debugger, 524 means "look at your origin"; 504 from Cloudflare means "check Cloudflare's status page."

Can a 504 be a security signal?

Indirectly. Slow-loris-style attacks try to keep many connections open without finishing requests, and a flood of 408s + 504s with low-throughput connections can be the surface symptom. More commonly, 504s come from honest overload, not adversarial. If you see 504s correlated with a traffic spike from a small number of IPs, it's worth a closer look.