Google Traces Sunday's Cloud Outage to Faulty Patch
Patch reportedly caused network egress issues, leading to second Compute Engine outage in less than one month
March 9, 2015
Google Compute Engine, the company's Infrastructure-as-a-Service cloud, suffered its second outage in less than one month's time. While not as serious as the Google cloud outage a few weeks ago, the network was again the culprit. Some Google cloud users experienced disruptions for 45 minutes beginning Sunday around 10 a.m. PST.
Google identified a patch problem as the culprit for network egress issues that caused the cloud outage for some users. The configuration change was tested prior to deployment to production, but it still had a negative impact on some VMs when made live.
The configuration change introduced to the network stack was designed to provide greater isolation between VMs and projects by capping the traffic volume allowed by an individual VM, according to Google.
It was a partial outage. Some users weren’t impacted, some saw slowdown, while some were experiencing timeouts when trying to contact their cloud VMs.
Google engineers are changing the protocol in response to the latest outage. The rollout protocol for network configuration has been changed, so future production changes will be applied incrementally across small fractions of VMs at a time, reducing the exposure if something unpredictable occurs.
The test suite that gave the a-O.K. signal will be modified in response to the incident as well.
"Future changes will not be applied to production until the test suite has been improved to demonstrate parity with behavior observed in production during this incident," said the company in a statement.
Last month, a network issue led to loss of connectivity to multiple zones. That cloud outage lasted roughly an hour.
Google’s IaaS cloud had a total of 4.5 hours of downtime last year across more than 70 outages, according to CloudHarmony.
Read more about:
Google AlphabetAbout the Author
You May Also Like