"The site is down" is the user-facing summary. Underneath it sits a request path with eight or nine independent moving parts, any one of which can break and produce the same blank page. Most outages are not exotic. They cluster into about a dozen recognizable shapes, each with named historical examples and predictable recovery patterns. This post walks through what's actually breaking when a site goes down, why one failure cascades into others, and why the status page is almost always the last thing to update.
The request path, briefly ¶
Before anything can fail, it helps to know what has to work. When you load a website, your browser walks roughly this path:
- DNS resolution. Your machine asks a recursive resolver for the IP address that
example.commaps to. The resolver eventually reaches an authoritative nameserver and gets an answer. - TCP connect. Your machine opens a connection to that IP on port 443.
- TLS handshake. The server presents a certificate; your browser validates it against a trusted root and a non-revoked status (OCSP / CRL).
- Edge / CDN. Most production sites sit behind a CDN (Cloudflare, Fastly, Akamai). The edge node terminates TLS and either serves a cached response or proxies to the origin.
- Load balancer. The origin sits behind a layer-7 load balancer that picks a healthy app server.
- App server. Runs your code. Calls databases, caches, queues, and external APIs.
- Datastores + dependencies. The site's own database, its cache layer (Redis / Memcached), payment gateway, mail provider, search index, object storage.
Each line in that list is a place a failure can start. Most outages affect exactly one of them — but the user sees a single uniform symptom (a blank page or a 5xx error) regardless of which.
Failure categories ¶
Public post-mortems from the last decade cluster into a small number of repeatable patterns. The list below isn't exhaustive, but ~95% of real-world site outages fit one of these.
1. Bad deploy
The most common cause. Someone pushes a release that contains a defect that wasn't caught by tests. The defect can be straightforward (a null-pointer dereference on every request) or subtle (a query that's 10x slower than the previous version, which only matters at peak load).
The classic example is Knight Capital, August 2012: a deploy left old code running on one of eight servers; the old code path was reused for a new flag, and within 45 minutes the firm had bought $7B of equities it didn't intend to. They lost $440M and went out of business. Nobody dies in a website-down-page outage, but the same shape — bad deploy, narrow blast radius widens fast — repeats constantly at smaller scale.
2. Bad configuration push
Code is one form of change; configuration is another, and config changes get less scrutiny because they "aren't code." This is a recurring fooled-by-the-frame failure.
Cloudflare, July 2 2019: a single regex in the Web Application Firewall was deployed globally without staged rollout. The regex hit catastrophic backtracking on legitimate traffic and pinned 100% CPU on every edge server in the network. Cloudflare-fronted sites worldwide returned 502s for ~27 minutes. The fix was a kill-switch that disabled the WAF rule.
Fastly, June 8 2021: a customer pushed a valid VCL configuration that triggered a latent bug in Fastly's edge software, taking down 85% of the network. The New York Times, Reddit, GOV.UK, Stack Overflow, Twitch, Amazon, and Spotify all returned errors at the same time. Fastly recovered in about an hour.
3. Capacity exhaustion
The site is doing exactly what it's supposed to, just faster than it can keep up. A traffic spike — a viral post, a promo, a market open, a "thundering herd" of clients reconnecting — pushes load past what the system can serve. Connections queue, eventually time out, and the load balancer starts returning 503.
Capacity-exhaustion outages have a distinctive recovery curve: as soon as you add capacity (or the spike ends), recovery is immediate. Compare with code bugs, which need a deploy to fix.
4. Database overload or corruption
Databases are most outages' single point of failure even when the rest of the stack is replicated. Common triggers:
- Slow query at peak. A query that's fine at 10k QPS pegs CPU at 50k QPS. Connection pools fill, app servers can't get a connection, requests pile up.
- Replication lag. Read replicas fall behind the primary; reads see stale data, writes start failing health checks, the cluster fails over and the new primary needs to warm caches.
- Disk full. Logs, WAL, or table bloat fills the disk; the database goes read-only; every write fails.
- Locking storm. Long-running transactions hold locks that block everyone else; the lock graph cascades; the database stops making progress.
5. DNS failure
If DNS resolution fails, nothing else gets a chance to. Three common shapes:
- Authoritative nameserver outage. The site's DNS provider (Route 53, Cloudflare DNS, Dyn, etc.) is having a bad day and isn't answering queries.
- Misconfiguration. Someone changes a record incorrectly. Traffic gets sent to the wrong IP or — worse — to nothing.
- Withdrawn routes. If the network announcement that points the public internet at the nameservers gets withdrawn, queries can't reach the servers in the first place.
Facebook / Meta, October 4 2021 is the canonical example. A routine BGP update was misconfigured during an audit; the update withdrew the routes that pointed to Meta's authoritative DNS. Without DNS, every Meta property — Facebook, Instagram, WhatsApp, Oculus — became unreachable. Worse, Meta's internal systems used the same DNS, so the company's own engineers couldn't log in to fix it. Badge readers stopped working at some buildings. The outage lasted ~6 hours.
6. Expired or misissued certificate
TLS certificates expire. If the renewal job fails silently and nobody notices, the cert lapses and every browser starts refusing the connection. The site's HTML didn't change; it's a browser-side trust failure.
Expired certs hit smaller sites more than household-name ones (the big shops have monitoring), but a 2018 LinkedIn outage and a 2020 Microsoft Teams outage both included expired-cert components. The fingerprint is unmistakable: every browser shows a cert-error interstitial; curl from the command line confirms an expired cert; nothing on the server side has changed.
7. Upstream dependency failure (the cascade)
Modern sites depend on dozens of external services they don't control: payment processors, email senders, identity providers, object storage, search APIs, observability vendors. When one of those goes down, every site that depends on it goes down at the same time, and from the user's perspective it looks like a hundred unrelated services all broke simultaneously.
The largest examples in this category are the cloud-region outages we covered separately in What runs on AWS. The shorter version: when AWS us-east-1 has a bad day, Disney+, Slack, Coinbase, Reddit, Notion, Airbnb, and DoorDash go down together because they all share a Virginia data center they didn't tell their users about.
Live cascade tracking for the providers we cover sits at /infra.
8. Network-level failure
Below the application layer, the internet itself can break. The most consequential category here is BGP — the protocol that exchanges routing information between networks. A misconfigured BGP announcement can blackhole a network's traffic globally in minutes.
The Meta DNS outage above was technically a BGP outage. So was Google's June 2 2019 incident, when a misconfiguration in the network's control plane combined with a routine maintenance event produced congested links across multiple regions and degraded YouTube, Gmail, GCP, and Snapchat for ~4 hours. Network-level outages are particularly hard to diagnose because the operator's own monitoring often runs on the same network.
Also in this category: fiber cuts (a backhoe, a ship's anchor, a curious shark), undersea cable damage, and ISP-level peering disputes.
9. DDoS attack
Denial-of-service traffic overwhelms the site's capacity from the outside. Modern DDoS mitigation (Cloudflare, AWS Shield, Akamai Prolexic) handles most volumetric attacks transparently, but novel attack vectors occasionally land. Application-layer attacks — slow Loris, request floods to expensive endpoints — can degrade a site even when total bandwidth is fine, and they're harder to filter than raw bandwidth attacks.
10. Cascading client retry storms
This one is subtle and underrated. When a service has a brief blip, clients (mobile apps, browsers, other services) retry. If retries don't include exponential backoff and jitter, a one-second blip becomes a thirty-minute outage because the recovering service is hit by every client's retry the moment it comes back. Slack's January 4 2021 outage had a thundering-herd component on top of an underlying AWS networking issue: as Transit Gateway recovered, the reconnect surge made the recovery itself slow.
11. Software dependency in the boot path
The 2024 outage that put 'CrowdStrike' on the lips of every airline gate agent was technically a content-update push, not a code release. CrowdStrike pushed a malformed channel file (a configuration update, not signed code) to its kernel-level Falcon sensor. The sensor parsed it incorrectly and crashed the kernel, and because it ran early in boot the affected Windows machines couldn't recover without manual intervention. Hospitals, airlines, banks, and 911 dispatch went down for hours; some sites took days to recover because every machine needed in-person remediation.
The lesson generalized: any vendor that runs in your boot path can take you offline globally without you having any local control.
12. Physical / facility events
Power outages, cooling failures, fires, floods. Less common at hyperscale data centers because of redundant power and cooling, but they happen. OVHcloud, March 10 2021: a fire at the Strasbourg data center destroyed SBG2 and damaged SBG1 / SBG3 / SBG4. Customers without offsite backups lost data permanently. Recovery took weeks for some tenants.
Even the resilient providers aren't immune. Google's St. Ghislain, July 2022 faced a cooling event during a heatwave that tripped a portion of the data center offline.
The cascade — why one failure triggers others ¶
Most outages are not single-tier failures. The original failure pushes load or errors into adjacent tiers; those tiers fail in turn; the failure modes interact in ways that the architects of any single tier didn't anticipate.
Three patterns recur:
- Retry amplification. The original blip is brief. Clients retry. Retries multiply load 5–10x. The recovering service is overwhelmed by retries, prolonging the outage. Mitigated by exponential backoff with jitter and circuit breakers.
- Connection pool exhaustion. A slow downstream means each request holds a connection longer. The pool fills. New requests can't get a connection and fail. Even after the downstream recovers, the pool is still full of slow-then-stuck requests. Mitigated by aggressive request timeouts.
- Shared monitoring / control plane. The service that's failing also runs the dashboard you'd use to diagnose it. The status page that should explain the outage is itself returning 503. AWS has had this happen during multiple us-east-1 events; Meta had it during the October 2021 BGP outage; smaller shops hit it constantly when their observability vendor shares a region with their app.
Why status pages always lag ¶
The official status page is consistently the last to acknowledge an outage. Three reasons:
- Verification before announcement. Operators don't want to declare an outage based on a single noisy alert. They wait for confirmation. Confirmation usually arrives 5–15 minutes after the actual impact starts.
- The status-page system runs on the same infrastructure. If you're posting "us-east-1 is having issues" but the CMS that updates the status page is itself in us-east-1, you can't post.
- Approval chains. Public-facing posts often need PR / legal / executive sign-off, especially at large companies. The first draft of "we are experiencing an issue" is technical; the published version has been edited.
This is why a multi-region direct-probe tool catches outages earlier than the provider's own status page does, every time. We've never seen the provider's page beat our probes during a real incident.
What recovery looks like ¶
Recovery has stages. They go roughly:
- Stop the bleeding. Roll back the bad deploy, kill the malformed config, fail over to the healthy region. Goal: stop the failure from getting worse.
- Drain the queue. Backed-up requests time out or are rejected. The queue empties.
- Warm the caches. Cold caches generate database load that's 10–100x normal. Some operators deliberately ramp traffic back gradually to avoid re-triggering the original failure.
- Stabilize. Watch for secondary failures from the recovery itself.
- Update the status page. Mark resolved.
The retro / post-mortem follows in the days after. The post-mortems from AWS, Cloudflare, and GitHub are some of the best public engineering writing — the operator deconstructs their own failure with more honesty than most marketing copy ever achieves. Worth subscribing to.
How to read an outage in real time ¶
If you're staring at a "this site can't be reached" page right now, the diagnostic order is roughly:
- Multi-region check. Probe the target from multiple regions. If only some regions fail, it's a regional or routing issue. If all regions fail, the site is genuinely down. isitdown.io does this in one click.
- Look at adjacent services. If the target shares infrastructure with other big services (AWS, Cloudflare, Fastly), check whether those neighbors are also down. /infra shows the cascade view.
- Read the status page — but don't trust it for the first 15 minutes. Confirmed outages typically show up on the official page after their multi-region monitoring already would have.
- Wait, then retry. If you've confirmed it's not local to you, there is nothing else for an end user to do. Refreshing every 10 seconds doesn't help.
FAQ ¶
How long does the average outage last?
It depends on category. Bad deploys recover in minutes once a rollback is identified (median ~15–30 min). Bad configs the same. Capacity exhaustion recovers in seconds once capacity is added or the spike ends. DNS / BGP outages tend to last 30 min to 6 hours because the recovery path itself runs through the broken layer. Physical events (fires, floods) can leave permanent damage.
Why don't all outages have a public post-mortem?
The big infrastructure providers — AWS, Google Cloud, Cloudflare, GitHub, Fastly — publish detailed retros because their customers demand them and because the engineering brand benefits. Smaller SaaS vendors often don't because the cost-benefit is different (every disclosed bug is also a security signal to the next attacker). A post-mortem-less outage doesn't mean the operator is hiding something; it usually means the customer base isn't paying for that level of transparency.
What's the difference between "down" and "degraded"?
Down means requests are failing — 5xx errors, timeouts, blank pages. Degraded means requests are succeeding but slowly, or some features are working and others aren't. Most large outages start as degraded and only become down once retries cascade and queues fill. 503 vs. timeout vs. connection refused covers the user-facing differences.
Can a site go down "halfway"?
Yes — and that's the most confusing kind of outage. Read traffic might work while writes fail (database failover in progress). Logged-in users might be fine while logged-out visitors hit a CDN cache that's broken. Some regions might serve traffic while others can't. The user-facing complaint then becomes inconsistent ("it works for me, it's broken for them") and the operator has to debug a partial-outage shape, which is much harder than a clean down.