In December, the middle of the busy festive shopping season, consumers, particularly on the US’s East Coast, recently found themselves locked out of Amazon as its cloud computing arm, AWS, was plunged into an outage. Unsurprisingly, Prime Video was unavailable too, as were Netflix and Disney+.
These kinds of outages are happening even to the most critical and sophisticated infrastructure providers, like AWS, and over the past year they have become more frequent. In July an Akamai edge DNS outage took huge companies offline, including Sony and British Airways. This came hot on the heels of a Fastly CDN outage and before that a minor incident with Cloudflare CDN.
While this might be anoying for consumers, it’s much more concerning for the large corporate customers, many of them telecoms companies, who are dependent on these providers and stand to lose potentially millions of pounds, and their reputations, when their services cannot be delivered.
Outages happen for a multitude of reasons. In the recent case of AWS, underlying network issues resulted in a complex set of failures. A configuration update at Akamai set off a DNS bug which brought its edge DNS service down in the summer. Other companies suffer outages when their DNS servers are hit with DDoS attacks aimed at overwhelming them with queries to such an extent that they are unable to respond to legitimate requests.
Building infrastructure resiliency
Uncomfortable as it is, outages are a fact of life and the only answer to combating the risk they pose is to build infrastructure resiliency strategies. One solution is to adopt a multi-cloud and multi-CDN approach and sign up with multiple providers so that if one goes down, another one will be deployed. This is especially effective for companies with a worldwide audience, particularly if they are located in several disparate geographic regions. It eliminates the risks associated with single-points-of-failure that can take systems and services offline.
Achieving global performance is one consideration, but telcos, who are as dependent on providers for their applications as internet companies are, also need to address their redundancy needs. If they are using a secondary provider, they must ensure that the provider can guarantee their applications will run if their primary provider experiences an outage. It might come at a cost, but if it ensures reliability across all their regions, then it will deliver peace of mind.
Implementing redundancy at every layer of infrastructure also necessitates having observability of application delivery performance and investing in tools that provide the ability to pivot quickly if a cloud service provider or CDN does not perform as it should.
The argument for automation
Telcos themselves are providers and as such can create outages, so they need to use automation tools and strategies when deploying new applications and provisioning new infrastructure to properly mitigate risk. Today’s tooling for network automation may not be as advanced as compute or storage system automation, but with its ability to reduce the risk of human errors and misconfiguration and create more IT stability, telcos should be looking to adopt network automation with some urgency.
Dynamic traffic steering
If downtime and service impact is not acceptable at any time, or in any circumstance, network and application teams have multiple tools at their disposal that they can use to establish dynamic steering policies and adjust capacity thresholds to accommodate fluctuating application usage where necessary. If they do the work upfront to prepare playbooks and utilise those tools, workloads can automatically be shifted to available resources in the event of an outage. It is also important to ensure monitoring and observability tools are appropriately calibrated and based on real time conditions experienced by end users, so that any issues can be quickly identified.
Outages are likely to become more, not less frequent, and it is incumbent on companies to build infrastructure resiliency strategies that will address this now urgent issue. The aim should be to bring together redundant infrastructure, appropriate configurations, and dynamic traffic steering that will ensure telcos – and their customers – are not impacted by a provider outage.