Take a quick second to think about something right now: Can you guarantee beyond any reasonable doubt that the people using your web or online services will never experience downtime for more than a few minutes?
What do you think happens when service outages plague a system? If your system is a consumer-facing one, customers move elsewhere, or simply complain because of inaccessibility and poor service.
If it’s a service used for internal business processes, your employees lose productivity, or work simply stops.
This is especially true of time-sensitive services such as online banking, stock trading and the like, which require a strong and predictable level of reliability.
Any crippling downtime or delay can result in lost revenues, reduced output, and decreased productivity.
Enterprises that want to scale or online businesses that are already gaining traction will have to implement strategies for ensuring uptime, thereby also ensuring customer satisfaction.
In both the consumer-facing and B2B environment, one of the best ways to combat these issues is by employing high availability strategies, which keep services running at availability levels very close to 100%.
How to achieve high availability
High availability, often referred to as the “five 9s” due to the required 99.999% uptime in any given calendar year, refers to a system design and implementation that ensures optimal performance during a given period.
It requires elimination of single points of failure, reliable crossovers, and detecting potential failures in advance.
This type of system all but eliminates the chance that transactions won’t be completed due to traffic spikes or infrastructure overload.
In industries such as financial and banking services, or any industry that employs mission-critical applications, this will ensure that service-level agreements (SLAs) are adequately met.
Compliance with SLAs will mean happy customers and less likelihood of penalties or losses associated with failure to meet such performance targets.
Here are some of the things you can do to maintain high availability:
1. Use network clustering
High-availability clustering ensures that if an application crashes on the service level, there will be another “failover” server to pick up the slack and continue providing services.
This can involve provisioning failover clusters with on-premise server infrastructure. On a cloud approach, an example is establishing HBase clusters on Windows Azure or Cluster Compute Instances on AWS.
2. Make strong infrastructure choices
Before you purchase or rent a server, make sure that it’s something that you are going to be able to use for a very long time without having to upgrade at any point in the short term.
Upgrades require downtime.
3. Scale OUT your infrastructure instead of scaling up
Large enterprises, rather than improving the capacity of their existing servers, expand that network of servers by adding new ones. This helps divide the load on your infrastructure quickly and easily.
This is how large-scale services like Facebook, for instance, manages it’s over a billion daily users – through a combination of distributed infrastructure and load balancing.
4. Use load balancing
Your servers are each equipped with a certain amount of computing power and bandwidth. Load balancing allows you to distribute network traffic to servers that have more available resources to spare.
This involves various architectures and service layers, including on-premises appliance, cloud-based and purely DNS/software-based approach.
There is no one-size-fits-all approach to load balancing, although this resource on load balancing choices identifies various approaches and strategies that are applicable to enterprises of all sizes.
5. Learn your RTO/RPO
The recovery time objective (RTO) is the amount of time you need to be back up and running before your business can no longer function.
This determines how out-of-date your data will be by the time you manage to flip the switch back on.
Both of these values can help you construct a contingency plan revolving around the certainty that you do not take an excessive amount of time to recover from catastrophic outages.
6. Test your recovery plan
How do you know your plan will work? Have you tried it? Run a small-scale copy of your service locally and simulate a failure, then put your recovery plan into action.
How long did it take? Did you meet your RTO? A third-party approach may be necessary, such as the disaster recovery testing solutions by PlatformLab or RES-Q.
Maintaining high availability at all times reduces your risk of lost revenue, dwindling customer bases, and lower conversion. Even a downtime of a few minutes can cost upwards of a few thousand dollars.
Mitigating this risk is not only convenient for users, but also creates a stronger reputation for your capability as a business or service provider.