Index

This is some text inside of a div block.

When the Cloud Breaks: Lessons from the October 20, 2025 AWS Outage

A case study of the October 20, 2025 Amazon Web Services outage in US‑EAST‑1. This analysis looks at what caused the disruption, which services were affected, and why a single provider’s failure ripples across the globe. It also explores the implications for cloud resilience, digital sovereignty and the future of decentralised infrastructure.

Stefaan Vervaet

October 24, 2025

On 20 October 2025 a seemingly small technical fault in Amazon Web Services’ US‑EAST‑1 region cascaded into a global crisis. Popular apps including Snapchat, Zoom, Roblox, Signal, Coinbase, Canva, Duolingo, Slack and Atlassian ground to a halt. Bank customers couldn’t access their accounts; smart doorbells fell silent. For several hours the world’s digital nervous system flickered, exposing just how fragile our cloud‑dependent economy has become.

Amazon later acknowledged that the outage was triggered by DNS resolution problems linked to the DynamoDB API in its U.S. East (Northern Virginia) region. A misfiring internal subsystem responsible for monitoring network load balancers exacerbated the issue and more than 64 internal AWS services were affected. Outage trackers recorded over 11 million reports of connectivity issues worldwide. Although recovery began within a few hours, the incident highlighted the risks of centralising a third of the internet’s traffic on a single provider. It reignited calls for digital sovereignty, multicloud strategies, decentralised internet infrastructure and robust public oversight.

This post dissects the October 2025 AWS outage, examines its causes and consequences, and connects the event to broader debates about cloud concentration and resilience. We argue that while technical fixes can reduce the likelihood of similar incidents, the deeper solution is to diversify control over the internet’s backbone, invest in open and federated cloud architectures, and treat cloud services as critical infrastructure subject to democratic accountability.

Living in the Cloud’s Shadow

The modern web is built atop a handful of hyperscale clouds. Amazon Web Services (AWS), Microsoft Azure and Google Cloud together provide the compute, storage and networking that power streaming services, e‑commerce, financial systems, government operations and the digital tools that millions rely on each day. AWS alone commands roughly 30 percent of global cloud services, while Azure and Google hold 20 percent and 13 percent respectively. This concentration has allowed providers to achieve economies of scale and roll out innovations quickly. Yet it also means that when one region stumbles, the impact reverberates far beyond its data centre walls.

Outages are not new; they occur periodically as complex systems fail. AWS’s longest recent outage occurred in late 202, when a problem in the U.S. East region took down websites and apps for more than five hours. Smaller glitches happened in 2020 and 2017. But the October 2025 incident was particularly dramatic because of the exponential growth in dependency on cloud services after the pandemic. Remote work, telemedicine, digital payments, online education and AI‑assisted platforms have woven the cloud into daily life. The outage thus served as a wake‑up call for regulators, businesses and citizens who had come to regard the cloud as an invisible, infallible utility.

What happened on 20 October 2025?

A cascading failure in US‑EAST‑1

At around 07:55 UTC on 20 October 2025, internet performance monitors observed increased error rates and latency for multiple AWS services in the Northern Virginia region. Users began reporting that they could not load websites or send messages. According to Amazon’s service health updates and subsequent analyses, the problem was related to DNS (Domain Name System) resolution of the DynamoDB API endpoint. DNS functions as the internet’s address book, translating readable names into IP addresses. When DNS fails, services cannot find the resources they need, and requests time out.

Ladbible, citing AWS updates, reported that engineers “immediately engaged and were actively working on both mitigating the issue and fully understanding the root cause”.. Later statements clarified that the underlying issue was an internal subsystem responsible for monitoring the health of network load balancers, which in turn affected DNS resolution. This seemingly esoteric component at the heart of AWS’s networking stack triggered a domino effect across the region.

By mid‑morning U.S. Eastern Time (around 5 a.m. ET), AWS said it had identified the cause and was “taking additional mitigation steps”. Recovery was observed around 6 a.m. ET, and by 6:30 a.m. ET Amazon reported that “most AWS Service operations are succeeding normally now”. Thousands of queued requests had to be processed, and residual issues persisted for some customers, but core services returned to stability within a few hours.

The scale of the disruption

During the outage, over 64 internal AWS services were affected, meaning not only external customer‑facing products but also the internal building blocks that other AWS services depend on. Outage tracker Downdetector registered more than 11 million reports of connectivity issues globally. Social networks, messaging platforms, online games, finance apps and even doorbell cameras were disrupted. The Associated Press, via a regional news outlet, noted that Amazon’s own services — such as the Ring smart doorbell and Alexa — were impacted . Customers reported being unable to access the Amazon website or download books to their Kindle readers.

Major digital services that rely on AWS were hit. Slack and Atlassian, productivity tools used by millions of workers, experienced outages or performance degradation, as did social media platforms Snapchat and Roblox. Zoom meetings glitched, Duolingo lessons wouldn’t load, Canva’s design tools were inaccessible and AI startup Perplexity suffered downtime. Signal, Coinbase, Robinhood and other finance or messaging apps also reported issues . Even essential services like banking apps and the UK’s HMRC tax portal struggled. The Bank of Scotland publicly apologised on social media, acknowledging that AWS issues were causing service disruptions .

While some services rerouted traffic to alternative regions, many could not because US‑EAST‑1 is deeply integrated into global architectures. The region hosts critical control planes and global resources; migrating away quickly is nontrivial. As a result, the fragility of centralised cloud architectures became evident.

‍

The Broader Impact: Social, Economic and Political Ramifications

Productivity and economic losses

Mehdi Daoudi, CEO of internet performance firm Catchpoint, told CNN that the financial toll of the October outage would reach into the hundreds of billions of dollars because millions of workers could not do their jobs . Downtime costs accumulate through lost productivity, delayed business operations, and reputational damage. The 2025 outage coincided with peak working hours in Europe and early hours in the United States, amplifying the economic hit. Small businesses reliant on e‑commerce and digital tools faced particular strain.

Disruption of critical services and democratic implications

Digital platforms have become essential infrastructure for democratic participation, journalism and civil society. Corinne Cath‑Speth of Article 19 highlighted that cloud disruptions are not just technical issues; they are democratic failures . When one provider goes dark, media outlets become inaccessible, secure messaging like Signal stops functioning and the infrastructure that supports public discourse crumbles . The outage underscored the need for pluralistic and redundant infrastructure to safeguard freedom of expression and public information flows.

Consumer trust and the myth of cloud infallibility

For many consumers the cloud had become invisible. You open a banking app or order groceries without considering the network of servers behind the scenes. The outage shattered this illusion. People were reminded that the internet is a patchwork of services that rely on shared dependencies. As one Ladbible article mused, the incident raised questions about “the fragility of having a third of the world’s internet using the same source platform” . The realisation that even Amazon can falter may nudge businesses and individuals to rethink risk tolerance and backup plans.

Root Causes and Technical Analysis

Understanding DNS and DynamoDB

To grasp the root cause, it helps to understand the interplay between DNS and AWS’s DynamoDB service. DNS translates domain names into IP addresses. When you request an API endpoint like dynamodb.us-east-1.amazonaws.com, your computer queries DNS to find the server’s address. Amazon DynamoDB is a managed NoSQL database used by thousands of applications. If the DNS records for DynamoDB are misconfigured, unreachable or stale, applications cannot locate the database and fail to respond.

AWS’s status page said the outage stemmed from a DNS resolution issue for the DynamoDB API endpoint . This suggests that the Route 53 or internal DNS system responsible for mapping the DynamoDB endpoint to IP addresses malfunctioned, perhaps due to propagation errors or internal load balancer failures. The subsequent update that an “underlying internal subsystem responsible for monitoring the health of our network load balancers” was at fault implies that health checks feeding into DNS decisions were misreporting service status, causing traffic to be directed incorrectly.

Lessons from networked systems

The outage exemplifies how tight coupling and complex dependencies can create single points of failure. In distributed systems, monitoring components and configuration managers can have outsized influence. When health checks misfire, they can automatically trigger failovers, flush caches or update DNS records that then propagate erroneous information across the network. The key lesson for cloud architects is to design for graceful degradation: when one subsystem fails, others should continue operating. Implementing circuit breakers, multi‑region deployments, staged rollouts and manual holdback mechanisms can prevent automated systems from cascading failures.

Cloud Concentration and Systemic Risk

The Monopoly Problem

AWS’s October outage triggered a broader conversation about market concentration. Analysts noted that AWS accounts for about 30 percent of global cloud services . When combined with Microsoft and Google, these three providers control more than 60 percent of the market . The outage showed that a failure in one region of one provider could affect a significant portion of internet traffic.This raises questions about whether cloud infrastructure should be considered critical national infrastructure subject to stronger regulation and redundancy requirements.

Digital Sovereignty and Multicloud Resilience

For governments, especially in Europe, the outage served as a warning about over‑reliance on foreign hyperscalers. European institutions have been exploring sovereign cloud initiatives such as EuroStack, Gaia‑X and local providers like OVHcloud to reduce dependence on U.S. companies. The outage reinforced the urgency of these efforts. If a single U.S. region can bring down European services, then national security and democratic resilience are compromised. Some policymakers argue for multicloud strategies, requiring critical services to run across multiple providers or on federated open‑source infrastructure that is geographically distributed.

The Role of DePIN and Decentralised Infrastructures

Decentralised physical infrastructure networks (DePIN) as Akave offer another pathway toward resilience. Rather than centralising compute and storage in a few hyperscale data centres, DePIN distributes workloads across community‑owned or token‑incentivised nodes. By leveraging blockchain and peer‑to‑peer protocols, DePIN can provide redundancy and censorship resistance. In the context of the AWS outage, DePIN offers decentralised networks that maintain basic services even if a large provider fails.

Toward a Resilient Cloud Future

Technical Mitigation

From a technical standpoint, cloud providers can enhance resilience by:

Diversifying control planes: Ensuring that global services are not tied to a single region’s control plane. Distributing system management functions across regions reduces single points of failure.
Redundant DNS architectures: Using multiple DNS providers or independent resolution mechanisms to avoid dependence on one vendor’s infrastructure.
Graceful degradation strategies: Designing systems to operate in a reduced‑function mode when dependencies are unavailable rather than failing completely.
Enhanced testing and chaos engineering: Regularly simulating outages and injecting failures to test system responses and highlight hidden dependencies.
Clear communication: Providing timely and transparent updates to customers during incidents to maintain trust and allow them to enact contingency plans.

Policy and Governance Reforms

The outage underscores the need for public policy that treats cloud infrastructure as critical. Regulators could:

Mandate resilience standards for hyperscalers, including requirements for cross‑provider redundancy and disclosure of single points of failure.
Encourage multicloud adoption, especially for critical public services, through procurement policies and funding support.
Promote sovereign and open cloud initiatives that keep data within national or regional jurisdictions and foster competition.
Support decentralised and community networks that complement hyperscale providers and provide fallback during outages.
Enhance transparency and reporting, requiring providers to publish incident reports, root cause analyses and mitigation plans.

Corporate strategies

Businesses also share responsibility for building resilience. Many companies rely on AWS because of convenience and cost savings. However, the outage revealed the risks of single‑provider architectures. Organisations can:

Implement multicloud architectures, running workloads across at least two providers to avoid a single point of failure.
Invest in open standards and portability so that applications can move between clouds without major rewrites.
Perform regular disaster recovery tests, ensuring that failover mechanisms actually work.
Negotiate service‑level agreements (SLAs) that include penalties for downtime and clear escalation pathways.

Conclusion

The October 20, 2025 AWS outage was a stark reminder that our digital world rests on fragile foundations. A combination of DNS misconfiguration and internal monitoring failures brought down dozens of services, impacting millions of users and causing economic harm. The incident fuelled debates about market concentration, digital sovereignty, and the need for resilient and decentralised infrastructure.

From a technical perspective, the outage underscores the importance of robust DNS architectures, multiregion redundancy and chaos engineering. From a policy perspective, it raises questions about regulation, competition and the role of sovereign cloud initiatives. Businesses and governments cannot assume that the cloud is infallible; they must plan for failure and diversify their dependencies.

Ultimately, building a resilient digital future will require a plurality of infrastructures, including hyperscale clouds, sovereign and federated clouds, decentralised networks, and community‑owned nodes. The AWS outage should not simply be seen as a one‑off glitch but as a catalyst for rethinking how we architect, govern and distribute the internet’s critical services.

Connect with Us

Akave Cloud is an enterprise-grade, distributed and scalable object storage designed for large-scale datasets in AI, analytics, and enterprise pipelines. It offers S3 object compatibility, cryptographic verifiability, immutable audit trails, and SDKs for agentic agents; all with zero egress fees and no vendor lock-in saving up to 80% on storage costs vs. hyperscalers.

Akave Cloud works with a wide ecosystem of partners operating hundreds of petabytes of capacity, enabling deployments across multiple countries and powering sovereign data infrastructure. The stack is also pre-qualified with key enterprise apps such as Snowflake and others.

FAQ

What caused the Oct 20, 2025 incident?
A DNS/health-monitoring issue in US-EAST-1 that cascaded through dependencies.
Why did so many global apps fail?
Centralized control planes and heavy reliance on a single region/provider.
How do we reduce risk?
Multi-region/multi-cloud, redundant DNS, open data formats, and verifiable storage.
Where does Akave help?
S3-compatible, verifiable object storage with onchain auditability and zero egress fees, improving portability and forensic proof.

Get Started

Modern Infra. Verifiable By Design

Whether you're scaling your AI infrastructure, handling sensitive records, or modernizing your cloud stack, Akave Cloud is ready to plug in. It feels familiar, but works fundamentally better.

Get Started