Why LuxSci Enterprise Class Servers Stay Up when Hardware Fails

February 15th, 2018

The server your email is hosted on had a power supply issue and now your email is down … and will remain down for a few hours until your provider repairs the hardware issue and gets you back online.

Downtime sucks and it can hurt your business. However, you can protect yourself from downtime due to hardware failure issues like this.

LuxSci offers two service options: Business Class and Enterprise Class. The most notable difference between these options is reliability. Enterprise Class services (both dedicated and shared) will keep running even if the underlying server hardware fails. How does that work?

Your Typical Dedicated or Shared Server

Your typical server these days is a virtual private server (i.e. Cloud Server). Many of these servers share the same powerful physical machine … each using a “slice” of its CPU, RAM, and other resources.  This is very efficient from a cost and performance perspective.

Your typical server from a Public Cloud provider, a Virtual Private Server provider, etc., is simply that … a slice of the RAM, CPU, and disk space on a single physical machine.

By the way — “Cloud” is just a terms that means “someone else’s computer … somewhere”.

So, what happens when a CPU fails, or a RAM chip goes bad, or the motherboard has a short, etc.? All of the virtual servers that share that physical machine immediately crash and remain offline until the server provider can diagnose and repair the hardware issue and then reboot them all.

So, with any of these typical servers, which includes LuxSci’s “Business Class” services, there could be downtime for emergency maintenance in the event of a hardware failure. This is familiar to most people as the same thing would happen if your servers were not “virtual” and you were using the entire physical server for yourself. If it has certain hardware issues, then it too will be down until that is resolved.

So a typical physical or virtual server is susceptible to unexpected downtime due to hardware failure. While this kind of downtime may be infrequent, it does happen and any services on this server that you rely on will be unavailable until the hardware issue is resolved.

LuxSci Enterprise Class Servers

LuxSci understands that some customers place a high value on service reliability and cannot tolerate unexpected service downtime. Other customers do not like the chance of downtime, but are willing to accept it instead of paying a premium to protect against it.

In order to ensure that LuxSci’s Enterprise Class servers are as reliable as possible, we use a different architecture to protect them from downtime due to hardware failure.

  1. Multiple powerful underlying physical servers (a.k.a. hypervisors)
  2. All of the disk space used is stored on a physically separate, redundant disk array (a private, encrypted Storage Area Network)
  3. The disk array is attached with redundant fiber optic connections to each physical server.
  4. Special software running on this “cluster” of servers can detect hardware failure on one physical server and reboot any affected virtual servers on the other servers automatically and immediately.

What does this mean?

In the case where we need to perform proactive software or hardware maintenance on a physical machine, we can move the virtual servers running on it to other physical machines with no downtime and no impact on the customers. It’s the “press of a button.”

In the case where one of the physical servers fails, all of the affected virtual servers are immediately restarted on a different physical server. This results in only a minute or so of downtime as the server is rebooted. The problem physical server itself can then be repaired without the time involved impacting any actual customers.

With this premium behavior, the worst-case scenario that can be caused by underlying hardware failure is a minute or so of downtime as your services are automatically restarted on a different machine. Most people will not even notice.

A Classic Case of “You Get What You Pay For”

In order to protect against hardware failure, LuxSci’s Enterprise Class servers are provisioned on a more complex and expensive hardware infrastructure that includes:

  1. Clustered underlying physical servers with extra unused capacity so that they can run all the virtual servers in the event that one of the physical servers is down or off
  2. Special external disk arrays that support very fast access from all of the physical servers in the cluster (as opposed to using inexpensive local storage)
  3. Very fast and reliable enterprise-grade hard drives
  4. Several extra “hot spare” hard drives
  5. Use of many small RAID arrays to maximize the speed of RAID rebuilds and to minimize the chance to multiple hard drive failure…. though at the cost of more drives needed
  6. Redundant network and storage interfaces to prevent issues due to hardware failure at that level

As a result, use of the Enterprise Class environment costs more than our Business Class environment. Customers can decide for themselves how important ultra-reliability is and if they are willing to pay extra for it … or if cost is more important and they are happy with using all of LuxSci services on a less expensive, less reliable server.

For those who want the “next level” beyond Enterprise, LuxSci can setup infrastructure options that span data centers.  What is your level of risk?