Introducing Supergiant: Datacenter Total Control System

Posted by Ben Hundley on April 14, 2016

At long last, we present Supergiant

Youtube-Intro.jpg#asset:15

Supergiant is an application platform for 2016. It's sexy and it's powerful. It's the excitement we had for application platforms back in 2008 -- but without all the crushing disappointment and stifling constraint. 

It's production-grade Docker containers on which you can actually run stateful, clustered datastores. It's portability and it's immutability, and it's made by hillbillies. It's sweet, sweet medicine for the large majority of your ailments (* disclaimer to follow).

Who are we? We are the team behind Qbox.io. Supergiant was built from our blood, sweat, and, primarily, our tears, while we were trying to orchestrate thousands of Elasticsearch nodes in the most performant, stable, and low-cost way possible.

Screen-Shot-2016-04-25-at-4.06.14-PM.png

Software deployed on virtual machines is not portable, and that's a big problem. 

It's a problem because servers and VMs fail unexpectedly, and replacing them is not a quick process. They are fragile giants, with bespoke configurations and interwoven processes, yet we gamble entire businesses on the ability to modify them while live. Deployments are an anxiety-ridden affair for many.

Docker is a hot topic because it's a solution to this problem ... partially. It combines code and its host configuration, which are often physically separate entities despite their inherent logical coupling. In other words, it allows developers to declare server configuration right in the relevant code repo, which dramatically reduces unexpected disparity between development and production environments.

But Dockerizing your applications doesn't solve all your production problems. Docker is software and (at the risk of stating the obvious) requires a server and operating system just like your application.

The critical difference is the indirection it provides between your application and servers. Instead of code relying directly on an underlying VM, it relies on a consistent pre-built image with all the necessary dependencies and configuration. A container is produced from the image, which can then move freely among any number of VMs running the Docker engine.

The Docker approach improves deployment and restart time, and it restricts disastrous server modifications.  Containers can be quickly replaced, so deployments do not hinge on modifying live configurations. Also, multiple containers can reside on one VM, which can greatly lower infrastructure costs.

Still, there exists the problem of deploying the containers themselves to the host machines (AKA container orchestration). That's where Kubernetes comes in.

When Kubernetes "clicked" for us, it was like a warm fuzzy punch in the face. Imagine living under a rock for 3 years and building disaster recovery systems for large-scale, mission-critical databases. The rock you're imagining here is the one Qbox emerged from not long ago...

jumanji5.jpg

Growing Pains

Qbox started in April 2013 with a simple idea. We wanted to host Elasticsearch for application developers and focus solely on the ops aspect. The original implementation was naive, at best. At worst, it was an absolute nightmare.

Qbox1.jpg#asset:17

We stood up servers with external SSDs running Elasticsearch on Rackspace, and on each of those ran our code that handled API tokens, rate limiting, logging, etc. In other words, they were multi-tenanted clusters with rigidly whitelisted routes for security.

At first, performance was great, and margins were terrible. After a few months, we had stuffed maybe 20 active users on the biggest cluster, which was costing us $1600 and generating about $400 each month. We eventually gained a customer who selected the largest usage tier we offered and then proceeded to beat the living crap out of the aforementioned cluster.

Amid the blinding spew of meaningless logs, we discovered the infamous "noisy neighbor" effect, although it was actually less about "noise." It wasn't that requests were slow. It was more that requests were totally dropped for periods of hours in the middle of the night due to timeouts while we frantically answered emails and prayed to anything listening. No amount of added scale could help the situation, and with such painful margins we were literally too poor.

The 2nd iteration of Qbox was an entirely new codebase with an entirely new approach. We wanted to support multiple clouds, namely AWS, and offer hand-select certain instance types. Users were no longer granted access to a cluster with an API token but instead had a form to configure single-tenant, multi-node clusters running on isolated virtual machines in any region.

Qbox2.jpg#asset:18

Qbox v2 was, somewhat surprisingly, a huge success over night (relative to a very low starting bar, of course). 

The request load increased by one order of magnitude, maybe two. But we could finally sleep at night because there was no single point of failure.

Bottlenecks

Fast forwarding 2 years found us unable to sleep at night because we had 4 engineers replacing dead nodes and answering support tickets all hours of the day, every day.

At that point, we concluded it was just the nature of the cloudhosting beast and that there was no escape. We wrote every possible recovery system we could think of, but issues still occurred.

What made matters worse was the volume of resources allocated compared to the usage. We had thousands of servers with a collective CPU utilization under 30%. We were spending a significant chunk of cheese on processors that were sitting there doing absolutely nothing.

Enter Docker. Our team avoided Docker for a while, probably on the vague assumption that the network and disk performance we had with VMs wouldn't be possible with containers. That assumption turned out to be entirely wrong.

To run performance tests, we had to find a system that could manage networked containers and volumes. That's when we discovered Kubernetes. It was alien to us at first, but by the time we had familiarized ourselves and built a performance testing tool, we were sold. Not only did we find performance to be as good as our previous VM model, we also found it was possible to achieve better performance. 

The performance improvement we observed was due to how many containers we could “pack” on a single machine. Ironically, we began the Docker experiment wanting to avoid “noisy neighbor,” which we assumed was inevitable when several containers shared the same VM. (After all, isolating users to their own infrastructure had been the catalyst for the success of Qbox v2.)

However, that isolation also acted as a bottleneck, both in terms of performance and of cost. A fundamental constraint of VMs is that they are a finite resource..... For example, if a machine has 2 cores and you need 3 cores, there’s a problem. A typical solution is to buy 4 cores (since it’s rare to come across 3) and not utilize them fully.

(Remember that, literally, the only good thing about Qbox v1 (with its big shared clusters) before it came crashing down had been that users had “wiggle room.” That is, users were placed on host machines that had more resources than they were requesting. On good days, that meant they had spare capacity. Users who were underutilizing had resources the overutilizer could… utilize. It's therefore probably obvious what happened on bad days:  if enough users were overutilizing, the cluster died, causing sweeping downtime.)

This is where Kubernetes really starts to shine. It has the concept of requests and limits, which provides granular control over resource sharing. Multiple containers can share an underlying host VM without the fear of “noisy neighbors”. They can request exclusive control over an amount of RAM, for example, and they can define a limit in anticipation of overflow. It’s practical, performant, and cost-effective multi-tenancy.


Multi-tenancy is at the heart of Supergiant. Multi-tenancy is a word that carries a negative connotation, to be sure, but in the context of a single user or organization, multi-tenancy means affordable scale and quick failover.

Supergiant takes that core concept and runs with it.

Over the past several months, we’ve fused Kubernetes with the last 3 years’ worth of our cloud experience, and we produced an open source solution with:

  • Automated server management / capacity control
  • Sharable load balancers
  • Volume management, resizing, backups
  • Extensible deployments
  • Resource monitoring
  • Seriously cool and spacey UI

Screen-Shot-2016-04-25-at-4.06.28-PM.png

Supergiant is an active work in progress, so that list will be growing quickly over the next few months. But it’s already being used in production -- with a major impact.

In early February, Qbox discontinued its VM-based offering on AWS, and started offering clusters exclusively on Supergiant. Our support engineers are sleeping again--for real. Our volume of users has continued to increase, and all the while the stream of support tickets has slowed to a trickle. Our users are getting twice the stability and performance at half the price

Screen-Shot-2016-04-25-at-4.03.15-PM.png

Supergiant was Qbox’s "Hail Mary." We were fed up with the cloud’s bullshit. We were done managing servers and focusing on disaster recovery. We wanted to build things again. We wanted enterprise scale and stability without all the dirty work. We’re software engineers. (We’re lazy.)

Supergiant isn’t your new cloud control center..... It’s your new cloud recliner ... your celestial chariot. 

It's free, and it's open source. Come and get it.

comments powered by Disqus