Insight and analysis on the data center space from industry thought leaders.
Assuring the Availability Your Applications Need in the Public Cloud
Definitions for "downtime” and “unavailable” in the SLAs exclude many of the reasons applications fail.
October 24, 2018
Heath Carroll is Senior Software Engineer at SIOS Technology.
Caveat emptor is good advice in the cloud. All Cloud Service Providers (CSPs) promise to deliver “high availability” across their infrastructures, and their Service Level Agreements (SLAs) generally offer from 95 percent to 99.99 percent uptime, with refunds of monthly service fees ranging from 10 percent to 50 percent for not meeting those thresholds. But as with many aspects of IT, the devil is in the details.
With the right approach it is possible to achieve five-9’s high availability (HA) in public and hybrid cloud environments using Amazon Web Services, the Google Cloud Platform and Microsoft Azure. And that requires understanding both the limitations in the SLAs and the options for creating highly-available configurations.
High-availability Limitations
Most cloud service providers offer SLAs with a 99.99 percent uptime guarantee, and redundant configurations that span a CSP’s zones and/or regions increase confidence in getting satisfactory availability. But there is a serious problem with this arrangement because the definitions for “downtime” and “unavailable” in the SLAs exclude many of the reasons applications fail.
Among the potential causes of downtime not counted are the customer’s software, any third-party software or technology, planned hardware and software maintenance, and some problems with individual instances or volumes not attributable to certain circumstances of unavailability. Also excluded are faulty inputs or instructions, or a lack of action when required, which would seem to cover “human error” as possible causes.
It is reasonable for CSPs to exclude some causes of failure, but it would be irresponsible for system administrators to use these exclusions as excuses. This makes it necessary to ensure application-level availability by some other means.
Options for Achieving More 9’s
In general, there are three basic options for improving availability in the cloud: provisions in the application software, features built into the operating system, and purpose-built failover clustering.
Many applications offer their own HA provisions. A good example is the carrier-class Always On Availability Groups feature in the Enterprise edition of Microsoft’s SQL Server. The problem with this approach is the need for different HA provisions for different applications, which makes ongoing management a constant and costly struggle.
The second option involves using HA features integrated into the operating system. Windows Server has a native capability with Windows Server Failover Cluster, but WSFC lacks data replication capabilities. Replication in a private cloud is usually provided via some form of shared storage, such as a Storage Area Network (SAN). In the public cloud, however, shared storage is not available, so a separate data replication solution is needed.
On Linux the need for separate HA provisions is considerably greater due to the lack of a native feature like WSFC. Implementing HA therefore requires using open source software like Pacemaker and Corosync to create (and then maintain) custom scripts for each and every application, and only very large organizations have the means to take on the enormous, ongoing effort involved.
The third option is third-party failover clustering software purpose-built for providing a complete high availability and disaster recovery solution for any application running on either Windows or Linux across public, private and hybrid clouds.
These solutions combine, at a minimum, data replication, continuous application-level monitoring and configurable failover/failback recovery policies. Such integration enables the software to detect any and all downtime at the application level, regardless of the cause(s), including those not covered by the various cloud SLAs. Many solutions also offer advanced capabilities, such as support for WAN optimization to enhance performance, and manual switchover of primary and secondary server assignments to facilitate planned maintenance.
While these solutions can work with a SAN in private clouds, most administrators prefer deploying shared-nothing SANless failover clusters. The reasons for this include: the elimination of potential single points of failure; the ability to work in the public cloud; and minimizing the Recovery Point Object (RPO), Recovery Time Object (RTO) and Minimum Time To Recovery (MTTR).
Five-9’s Failover Cluster Configuration
The figure shows a three-node SANless failover cluster that affords five-9’s high availability, as well as robust disaster recovery protection in a hybrid cloud. The application is a database that uses Failover Cluster Instances (FCIs) in the Standard Edition of SQL Server. SQL1 and SQL2 are located in an enterprise data center with SQL3 in the public cloud. Within the data center, data replication across the LAN is synchronous to minimize the time it takes to complete a failover and, therefore, maximize availability.
This three-node SANless failover cluster is capable of handling two concurrent failures with minimal downtime and no data loss.
In this example, SQL1 is initially the primary or active instance that replicates data continuously to both SQL2 and SQL3. Should SQL1 fail, the application would automatically failover to SQL2, which would then become the primary replicating data to SQL3.
Once the problem is rectified, SQL1 could be restored as primary or SQL2 could continue in that capacity replicating data to SQL1 and SQL3. Should SQL2 fail before SQL1 is returned to operation, as shown, SQL3 would become the primary. A manual failover is recommended here to prevent data loss due to the higher latency inherent in the WAN link to the public cloud.
Three-node clusters like this also facilitate planned hardware and software maintenance for all three servers while providing continuous disaster recovery protection for the application and its data. By making effective and efficient use of all resources in a way that is easy to implement and operate, failover clustering software makes five-9’s high availability more affordable, including in hybrid clouds.
Opinions expressed in the article above do not necessarily reflect the opinions of Data Center Knowledge and Informa.
Industry Perspectives is a content channel at Data Center Knowledge highlighting thought leadership in the data center arena. See our guidelines and submission process for information on participating.
About the Author
You May Also Like