Facebook Outage Emphasizes Importance of Change Observability

Introduction

On October 4th, 2021, Facebook digital properties were unavailable for six hours. This culturally notable event reminds all networking professionals how often configuration changes and commands are the root cause of network downtime or degraded performance. In addition, the interrelationship of operations events, router security, and physical security is also a reminder of the increasing complexity of network operations, and the need to correlate across multiple different data sources to truly understand any significant issue.

In Facebook’s own diagnosis, they said: “a command was issued with the intention to assess the availability of global backbone capacity, which unintentionally took down all the connections in our backbone network.” Source: Facebook, “More details about the October 4 outage”, October 5th, 2021.

While Facebook’s outage was caused by a command, other major outages have been caused by configuration changes. For example, the 2020 Cloudflare Outage “Today a configuration error in our backbone network caused an outage for Internet properties and Cloudflare services that lasted 27 minutes. We saw traffic drop by about 50% across our network. Because of the architecture of our backbone this outage didn’t affect the entire Cloudflare network and was localized to certain geographies.” Source: Cloudflare, “Cloudflare outage on July 17, 2020”.

Configuration and Command Tracking

While events such as Facebook’s outage are the stuff of lead stories by news outlets, not all changes are so visible. Some changes cause subtle or limited disruption. Many of which may never be tracked and root cause analyzed like a major outage. Observing change — configuration tracking, command tracking, and anomaly correlation can expose many untracked and unnoticed anomalies. Change observability creates change timelines, correlates those changes with other operations data, and provides powerful search.

Network Orchestration and Automation

In a time when network automation and orchestration is becoming pervasive, frequency and volume of change is increasing, and hence a growing need to track configuration changes, commands, and current active configuration. Network operations leaders are now asking the question “what changes are the orchestration and automation systems making?”.

Configuration changes and Cyber Attacks

While major outages are not always caused by cyber attacks, as Facebook has stated theirs was not, it is still true that one way a malicious actor, internal or external, can disrupt an operation is through a configuration change or command. A hacker could also change configuration in such a way that worse attacks are made easier. This is another reason why configuration change and command tracking is important.

Increasing complexity

A configuration change or a command may not have an immediate impact on operational KPIs. The ability to correlate changes with other data is growing in importance as complexity increases. The subtle interaction across data, systems, and time, requires a new approach to anomaly detection and ranking. If two or more events are correlated in a data forest, and no one notices…then nothing can be done about it.

Conclusion

For multiple reasons, large visible outages as well as many undetected anomalies, and network security, configuration and command tracking is crucial. Selector is the leading supplier of usable, multivendor, network-focused AIOps including capabilities such as CMDB, configuration tracking, configuration timelines, configuration search, command tracking, and correlated anomaly detection and ranking. For more information on the Selector Change Observability, see Selector AIOps : Configuration Observability.

Explore the Selector platform