Meta’s outage is a timely reminder of the need to engineer critical systems to maximise their reliability, including minimising central points of failure and employing engineering principles like fault isolation
Melbourne: A major outage is affecting users of popular social media and messaging services including Facebook, Instagram and WhatsApp around the globe.
All these platforms are run by the social media giant Meta.
As news of the outage spread, we learned that it affected almost all of Meta’s products, including Messenger and Threads, as well as Meta’s business products, such as Facebook Ads Manager and the Messenger API for Instagram.
Most services are beginning to come back online.
But what went wrong, and what can we learn from this massive outage?
The scope of the outage
Outages have been reported from the United Kingdom to Canada to the United States and beyond.
The outage was first reported in the US on Wednesday (around 12.30pm in New York, 5.30pm in London, or 4.30am Thursday in Sydney).
Five hours later, Meta posted to X to say it was 99 per cent of the way to resolving the outage.
What might have caused it?
At the moment, there has been no official word on the cause of the outage. However, we can make some educated guesses based on its scope.
From reporting so far, the outage covered not only Meta’s major social media platforms and messaging services, but also some of its business products. It also affected Meta’s Login with Facebook service, which allows users to log in to third-party sites using their Facebook username and password.
In other words, there seem to be very few Meta products this outage did not impact.
That suggests that whatever went wrong was a single point of failure: something relied upon by all of Meta’s services, without which the services can’t function.
Design for reliability
These kinds of outages are rare. That’s because major internet platforms are designed to be highly reliable.
The main way reliability is achieved is through replication. When you visit Instagram, for example, your computer connects to a server that sends back your Instagram feed. In fact, Instagram content is not stored on just one computer but is replicated across a massive array of computers known as a content delivery network (or CDN).
Practically all major web platforms, including news sites such as The Conversation, large companies, and online services such as YouTube and Google, use content delivery networks to increase the reliability and efficiency of their websites.
The idea behind a content delivery network is that if one computer in the network has a problem, another can take over in its place. This is what makes the networks reliable.
Content delivery networks also help when websites are under heavy demand. If many people are trying to request the same content, those requests can be spread out between many computers in the network, allowing each to be handled efficiently.
The widespread nature of Meta’s outage suggests it might have happened in a part of Meta’s systems that wasn’t replicated. However, we’ll have to wait for word from Meta on the causes before we will know for sure.
Lessons to be learned
Meta’s outage comes in the wake of the major outage caused earlier this year by CrowdStrike’s Falcon security software. Falcon’s design meant it was deeply entangled with Microsoft Windows. That made Falcon a single point of failure so that, when it crashed, it brought down Windows as well – in spectacular fashion.
A key lesson from this outage was that invasive security software such as Falcon should be re-engineered to operate at arm’s length of Windows. This idea is known as fault isolation, which says that systems should be built as a collection of separate components so that if one component fails it cannot cause the entire system to fail.
This is the reason why modern ships are designed to have multiple internal compartments, with mechanisms to try to make each compartment watertight. That way, if the ship’s hull is breached, water cannot flood the entire ship.
Meta’s outage is a timely reminder of the need to engineer critical systems to maximise their reliability, including minimising central points of failure and employing engineering principles like fault isolation.
Looking ahead
In the meantime, the precise cause of Meta’s outage remains to be determined.
Many people all over the world rely on Meta’s services. These include businesses using Instagram as their primary platform for engaging customers online, or merchants using Facebook Marketplace as a key revenue stream.
For many families, WhatsApp has become an indispensable way to keep in contact, especially during times of crisis.
One can only hope Meta will be forthcoming about the causes of this outage and the measures it will put in place to make sure it cannot happen again.