blog page
Business Impact Analysis

When Cloudflare Goes Dark: What the November 2025 Outage Really Exposed About Your Risk Posture

20
Nov

When Cloudflare Goes Dark: What the November 2025 Outage Really Exposed About Your Risk Posture

Thursday, November 20, 2025

When Your Edge Fails: Executive Lessons from the November 18 Cloudflare Outage

On the morning of November 18, 2025, a single configuration change at Cloudflare briefly took a significant chunk of the Internet offline. Major platforms and enterprises that depend on Cloudflare’s network saw 5xx errors and “internal server error” pages as traffic stopped flowing through a control plane that front-ends roughly a fifth of the web.

As outlets like Skymet have documented, millions of users suddenly began seeing the now-viral error message:
“Please unblock challenges.cloudflare.com to proceed.”

Cloudflare’s own human-verification pages broke, preventing websites from authenticating visitors. X, ChatGPT, Perplexity, PayPal, crypto exchanges, Canva, and thousands of other services stalled behind a failing challenge page. In a particularly on-the-nose twist, even Downdetector.com—the site everyone uses to confirm outages—was down.

This was not a nation-state attack or a record-breaking DDoS. An internal change to database permissions triggered a bot-management feature file to suddenly balloon in size, exceed an internal limit, and crash the core proxy infrastructure that routes traffic for Cloudflare customers globally.

If you are a CISO, CIO, or CEO, the real story is not “Cloudflare had an outage.” The real story is: the out-of-functioning, scheduled, and planned test of your third-party risk, architecture, and governance.

And many organizations failed that test without even realizing it.

What Actually Happened — The Executive Version

Cloudflare’s own timeline is straightforward:

  • A permissions change in a ClickHouse database was deployed.
  • A bot-management feature file, generated every few minutes from that database query, suddenly contained duplicate rows and doubled in size.
  • The bot module had a hard cap on the number of features it could ingest. When the oversized file hit that limit, the core proxy infrastructure panicked and failed.
  • Traffic entered a strange oscillation as “good” and “bad” feature files propagated, initially looking to internal teams like a possible large-scale attack.
  • Once Cloudflare stopped distributing the bad file, pushed a known-good version, and restarted the core proxy software, error rates began to drop and traffic normalized.

This was not a sophisticated adversary. It was a coupling problem: a critical control (bot management) tightly bound to a critical pipeline (the global proxy), with brittle assumptions around input size and behavior.

From a risk lens, that coupling is what should concern you, not just the outage itself.

The Outage as an Involuntary Penetration Test

Security journalist Brian Krebs framed the event in a way every executive should pay attention to: when some organizations bypassed Cloudflare during the outage to regain availability, they effectively removed their primary security control layer at the edge and exposed themselves. They exposed the infrastructure directly to the public Internet for several hours. In practice, that window looked a lot like an impromptu penetration test of their architecture and controls.

Several key dynamics emerged:

  • Some customers shifted DNS, temporarily took their sites off Cloudflare, and presented their origin services directly to the Internet.
  • Many of those organizations rely on Cloudflare to filter the OWASP Top 10 and high volumes of automated attack traffic. When that shield dropped, attackers watching DNS and routing changes suddenly had a clear shot.
  • Experts quoted by Brian Krebs noted that this incident operated like a “live tabletop exercise at Internet scale” — a real-time test of how organizations route around their own control plane when under time pressure, and how quickly “shadow IT” appears when official paths are blocked.
In other words, if your team pivoted away from Cloudflare on November 18, you likely ran an unplanned red-team event on yourself. The question is: did anyone treat it that way, or was it just “a bad morning”?

Over-Centralization: Your Edge Provider Is Now a Single Point of Failure

Cloudflare estimates that a substantial portion of the world’s Internet traffic passes through its infrastructure. Combined with the dominance of AWS, Azure, and other hyperscalers—which have had their own large-scale incidents—the pattern is clear: we have re-centralized critical Internet risk into a handful of providers.

Brian Krebs and other experts highlight the same structural issue:

  • Organizations increasingly outsource WAF, DDoS, bot management, and DNS to a single edge provider for simplicity.
  • When that provider stumbles, both availability and security can degrade simultaneously.
  • Many enterprises have never fully tested their ability to operate — securely — without that edge layer, even for a few hours.

What we saw in November was not just a Cloudflare incident. It was a concentration-risk incident for the modern Internet.

This Wasn’t Cloudflare’s First Wake-Up Call

It is also important to say this clearly: November 18 wasn’t the first time Cloudflare’s infrastructure hiccup echoed across the Internet — and it will not be the last. The company’s popularity and architectural centrality guarantee that.

In July 2025, Cloudflare’s public DNS resolver 1.1.1.1 went dark globally for about 62 minutes due to a legacy configuration error in its BGP routing systems. The outage disrupted services like 1.1.1.1, including 1.0.0.0/24 from the Internet, effectively cutting off name resolution for millions of users. Cloudflare’s own postmortem and external analyses all point to a familiar culprit: internal configuration in legacy systems, not an external attacker.

In September 2025, Cloudflare suffered a separate incident that hit its Tenant Service API, multiple public APIs, and the Cloudflare Dashboard. The root cause was a React “useEffect” bug in the dashboard: a dependency mistake that caused the hook to run repeatedly during a single render, flooding the Tenant Service API with unnecessary calls until the service—and then other APIs that depended on it—collapsed under the load. In effect, Cloudflare accidentally DDoS’d its own APIs.

Across all three events—July DNS resolver outage, September API/dashboard outage, and the November challenge-page meltdown—the pattern is striking:

  • Internal configuration and code issues in critical services.
  • Tightly coupled control planes (DNS, API gateways, challenge systems, proxies).
  • Global blast radius because the same platforms underpin a large share of the Internet.

From a board and regulator perspective, this is not about blaming one vendor. It is about recognizing that our dependence on a small number of deeply embedded infrastructure providers has turned their internal change management into a systemic risk.

–––––––––––––––––––––––––––––––––––––––––––––

Five Hard Lessons for CISOs and CEOs

Here is the actionable, board-level view of what this outage should be telling you.

Stop Confusing Outsourcing with Risk Transfer

Cloudflare mitigates risk; it does not own it.

If your internal posture is “Cloudflare handles that,” you have already accepted that:

  • Your origin applications may not be hardened to direct Internet exposure.
  • A routing or DNS change made under pressure can instantly collapse your defense-in-depth into a flat, exposed surface.

Regulators and boards will not accept “our vendor had a bad day” as a sufficient explanation.

Map Your “Cloudflare Blast Radius”

Most organizations cannot answer a fundamental question in under 30 minutes:

“If Cloudflare disappeared for six hours again tomorrow, which services, customers, and revenue streams are inside that blast radius — and how would we operate?”

You should be able to produce, on demand:

  • An inventory of all apps and APIs behind Cloudflare (including vanity domains and “temporary” projects).
  • A clear list of which protections are active: WAF, bot management, geo blocks, rate limiting, zero trust access, and so on.
  • Pre-approved playbooks for how to bypass the edge in a controlled way if you must — and what compensating controls to enable when you do.

If you had to improvise DNS or routing changes during the outage, that is a governance problem, not just an operations problem.

Design an Intentional “Plan B Edge”

The answer is not “dump Cloudflare.” The answer is to engineer away single points of control.

Practically, that means:

  • Multi-vendor DNS: At least one secondary provider you can fail over to without recreating records in a panic.
  • Tiered protection: Not every workload needs the same controls. Decide in advance which can temporarily accept lower security in exchange for availability—and which cannot.
  • Segmented architectures: Ensure your most critical customer-facing services do not all share a single Cloudflare configuration, rule set, or dependency pattern.
This is not about “Cloudflare versus competitor X.” It is about engineering out structural concentration risk in your edge and DNS layers.

Treat Outages as Live-Fire Exercises, Not Just Bad Days

Reporting from Brian Krebs made one uncomfortable truth clear: the outage showed how many organizations will happily trade security for availability when the pressure is on, often without logging or reviewing what they changed.

After-action, you should be asking:

  • What did we turn off or bypass (WAF rules, bot controls, geo restrictions), and for exactly how long?
  • Who made emergency DNS or routing changes, and under what authority?
  • Did staff move to personal devices, home Wi-Fi, or unsanctioned SaaS to keep work moving?
  • Were any “temporary” tunnels, shadow services, or vendor accounts spun up — and have they all been unwound?

If the outage ended and the organization simply breathed a sigh of relief and moved on, you wasted an invaluable signal about your true operational resilience.

Re-Baseline MTD, RTO, and RPO for Third-Party Edge Failures

In our work at CloudSkope — including prior analyses of significant financial services outages — we emphasize Mean Tolerable Downtime (MTD), Recovery Time Objective (RTO), and Recovery Point Objective (RPO) as non-negotiable metrics for critical services.

The Cloudflare incidents show that those metrics must explicitly account for external control planes:

  • What is your MTD if your primary edge provider fails during peak business hours?
  • How fast can you switch to a fallback configuration (RTO) without exposing untested origins?
  • What data and log visibility will you lose, even temporarily, when you bypass the edge (your effective RPO for telemetry and detection)?

If you do not have those numbers at your fingertips, your business continuity plan is telling a more optimistic story than your architecture can support.

What Cloudflare Got Right — and What Still Worries Me

To their credit, Cloudflare delivered candid, technically detailed postmortems for all three incidents, describing the July DNS misconfiguration, the September “useEffect” bug, and the November challenge-file and proxy failure.

However, from a risk and engineering standpoint, two concerns remain:

  • Tight coupling of critical paths: A bot-management feature file, a tenant API, or a DNS addressing system should not be able to crash global control planes that front-end such a large portion of the Internet. That indicates an architecture where convenience and performance optimizations were not sufficiently stress-tested against failure modes.
  • Zero-trust inconsistencies: These incidents are examples of situations where “trusting internal inputs” (developer-driven feature file generation, dashboard code, legacy configuration tooling) violated the very zero-trust principles Cloudflare—and most enterprises—advocate for external traffic.
Outages will always happen. What matters is whether your architecture fails gracefully and locally, or catastrophically and globally.

How CloudSkope Advises Clients After an Event Like This

At CloudSkope, we view incidents like the Cloudflare outages the same way we view central banking and infrastructure meltdowns: as public case studies in what happens when resilience, monitoring, and communication do not keep up with complexity.

For our clients, a typical response includes:

Edge & Third-Party Risk Audit

(Key Activities)

  • Map all dependencies on Cloudflare and similar providers (DNS, WAF, CDN, SSO, bot management).
  • Quantify the revenue, regulatory, and operational impact of six-hour and 24-hour outages for each dependency.

Business Impact Analysis (BIA) with External Control Planes

  • Integrate edge providers into MTD, RTO, and RPO calculations — the same way you would treat a Tier-1 data center or core banking system.

“Chaos at the Edge” Simulations

  • Controlled exercises where we intentionally simulate Cloudflare, Microsoft, or AWS edge outages and observe how teams route around the problem.
  • Identify where shadow IT, ad-hoc DNS changes, and one-off tunnels appear under pressure.

Architecture Hardening and Multi-Vendor Strategy

  • Design and test a pragmatic, cost-effective multi-edge and multi-DNS approach that reflects your risk appetite and regulatory profile.
  • Harden origin services to withstand temporary direct exposure without becoming trivial targets.

Board-Ready Narrative and Metrics

  • Translate deeply technical findings into the language of risk, resilience, and fiduciary duty — enabling your leadership team to make informed decisions, rather than simply accepting technical reassurances.

The Bottom Line

If the November Cloudflare outage was the first time your organization seriously asked, “What happens if our edge provider fails?”, you are already behind the curve.

The real question for executives is not “Will Cloudflare fail again?” but “How much of my business depends on Cloudflare being perfect — and what am I doing about that today?”

If your team cannot answer that in a single, coherent page, it is time for an audit. CloudSkope helps enterprises investigate, quantify, and mitigate precisely these kinds of structural risks before they become headlines. If you would like an audit-grade review of your Cloudflare and third-party dependency risk, we can start with a focused assessment and a board-level briefing that your leadership can act on immediately.

About the Author

Dipan Mann is the Founder & CEO of CloudSkope, a global risk, resilience, and technology advisory firm that partners with founder-led and enterprise organizations on high-consequence decisions. As a Sand Hill Road alum and former CIO and CISO of large enterprises, he brings more than three decades of experience leading cloud, security, and transformation programs across the public sector and high-growth technology companies, including senior roles at LogicMonitor, Kodiak, LiveVox, and other enterprise companies.

He currently sits on the advisory board of several companies, including The Channel Company, The University of Winsonsin- Plateville, and more recently, ThreatLocker.  

At CloudSkope, Dipan works with boards, CEOs, and CISOs to translate complex cyber and operational risk into clear, fiduciary-ready choices, with a focus on enterprise resilience, global security, and business continuity at scale.

Posted on:

Thursday, November 20, 2025

in

Business Impact Analysis

category

latest post

The blog

The Blog

See what Cloudskope can do for you

Explore our solutions, chat with an expert, and get help when you need it.

CONTACT US >