How Qbox Saved 5 Figures per Month using Supergiant

Several months ago, shortly after a fundraising round enabled us to finally undertake long-term thinking, I tasked our team to come up with an infrastructure model that was better optimized, lower cost, more stable, more performant, and ultimately the highest, best use for our favorite data exploration and analytics platform.  

Our team suggested a containerization route, with Docker being the most appealing.  However, we found that Docker can be cumbersome with stateful applications like Elasticsearch, Postgres, MySQL, etc.  Perusing the case study history, we also found that using Docker in a SaaS environment came with its share of performance and “noisy neighbor” problems.

We knew that other companies in our space were taking this approach, but we also knew that their performance was grossly sub-par compared to what we knew we could provide. In order to get the benefits of Docker without the problems, we would have to rethink stateful applications on Docker and rethink multi-tenancy for a very compute-intensive, distributed application like Elasticsearch.  

Kubernetes went part of the way, and our hand-rolled Supergiant did the rest.

Mark Brandon Qbox CEO Portrait

My name is Mark Brandon, and I am the CEO and Co-Founder of Qbox.  For 3 years, our business has been providing hosted Elasticsearch, the very popular enterprise search engine and NoSQL database.

The Qbox team is also the creator of Supergiant, the data center management system based on Kubernetes.

The result of this project was a 50% decrease in our AWS bill, which had reached well into 6 figures per month.

This case study tells that story, both of how we saved that massive amount of money and why we built it to go beyond Elasticsearch.

After taking stock of this incredible savings, we realized that perhaps Supergiant was not just the answer for us but that it could also be a solution for our customers who want a private-cloud version of the Qbox service. In late February, we were getting ready to release Qbox Private Cloud, but we decided we shouldn’t stop at Elasticsearch: we should just enable this type of performance and savings for any containerized application.


Background

To tell this story, it might be useful to go back to our beginning.  Qbox was founded in 2012, and was then known as Stacksearch, making e-commerce product search components.  

In early 2013, as Elasticsearch grew in popularity, our customers kept asking for the back-end goodness that powered our service.  After surveying the relative lack of competition, plus noticing the analogs for providers of hosted MongoDB, hosted CouchDB, hosted Hadoop, hosted Redis, etc., we decided to pivot toward Hosted Elasticsearch.  After first experimenting with a disastrous shared model and learning the hard way about noisy neighbors in an Elasticsearch neighborhood, in October 2013 we settled on our dedicated VM model.  

Demand exploded.  

The economic model then was cost-plus, which was a flat-out necessity because of our status as a bootstrapped startup in flyover-state, USA.  We didn’t spin up a VM for our customers on our corresponding infrastructure provider (AWS, Rackspace, Softlayer, and later Azure) until we had the order.  

Doing so provided two major advantages: (1) we didn’t have to invest a tremendous amount of up-front money before we had usage for it; and (2) we could be in ANY data center where our infrastructure partners resided, a number now approaching 50 data centers worldwide.  We found that customers wanted their Elasticsearch instances next door to their primary infrastructure, so it made sense to be everywhere.  When we did spin up a VM, we added a markup for our world-class support team, and that was that.

The Challenges of Public IaaS

The strain of this model became apparent as our customer list grew to hundreds, managing thousands of nodes.  

First, in the 2nd half of 2015, I got a call from our AWS sales rep exclaiming that Qbox was the LEAST optimized customer in the entire TOLA territory for AWS.  For the uninitiated, that is a huge four-state (US) territory that includes 36 million people in Texas, Oklahoma, Louisiana, and Arkansas.  Despite the obvious sales pitch from our AWS rep for reserved instances, my response to this initially was something along the lines of, “meh… our customers reserve certain node sizes, and we pass along the costs.”  

The second factor was much more existential.  New competition in the market -- notably from the very infrastructure providers with whom we had partnered -- required new thinking.  

When you didn’t need to sweat the cost of the infrastructure, the game had changed.  At this point, we decided that, in order to become more price-competitive, we needed to optimize our infrastructure usage.  

This was a tall order, given the specific needs of Elasticsearch, a compute-intensive application, where bulk indexing by any customer could, at any moment, result in the complete devouring of compute resources.  We already knew that resource sharing would not be an option.  (As I mentioned earlier, we had experimented with a shared model in the beginning, and it was disastrously bad.)

Like any SaaS company, the top 10% of our customers make our business worthwhile, and we have a long tail of customers who are valued, but small.  Qbox’s entry-level price point is $25/mo for our smallest node, and although every customer receives the same level of 24x7 uptime and availability service, in the old model, we had hundreds -- literally, hundreds -- of the smallest instance size on AWS (t2.micro) running.  Also like any other SaaS business, we had many small customers sign up with the best of intentions but whose projects were not urgent.  This was a waste of money for all parties and served only to enrich our IaaS partners.

Containerization

Shortly after we were able to raise funds in the private market, we had a plan to reverse course.  Now that we could invest in multi-tenant infrastructure, it was important to do so -- except that we still had the problem of Elasticsearch clusters not performing well in a multi-tenant environment.  

The solution was Kubernetes, the open-source container management platform that was donated to the Cloud Native Computing Foundation (a joint partnership with the Linux Foundation).  This open-source cluster manager makes it easier to run distributed applications  -- like Elasticsearch, MongoDB, Cassandra, et al. -- in a multi-tenant environment.  Borrowing from the Google Borg concept, it could isolate resources at the process level.  

As developers, we’ve all fallen in “like” with the concept of containerization, but it is not ideal for distributed applications.  Creating an Elasticsearch container is relatively trivial.  Creating a cluster of Elasticsearch containers that work together, share resources, and yet still not prevent crucial access to resource pools is still hard.  

Kubernetes gives you a private cloud of compute resources to utilize.  

Supergiant goes a step further:  it gives you the framework for managing distributed applications on Kubernetes clusters.

The purpose is to abstract DevOps from application developers, to finally make the world application-centric instead of machine-centric.  This has been a holy grail, still not yet achieved by the containerization era.  The crucial concept is Supergiant Components that give app developers the ability to either submit publicly available Supergiant recipes or to keep a private repository of containers that work on Supergiant.

For Qbox, we went from needing 1:1 nodes to approximately 1:11 nodes.  Sure, the nodes were larger, but the utilization made a substantial difference.  As in the picture below, we could cram a whole bunch of little instances onto one big instance and not lose any performance.  Smaller users would get the added benefit of higher network throughput by virtue of being on bigger resources, and they would also get greater CPU and RAM bursting.

In terms of stability, our support ticket volume fell dramatically.  

In a straight containerized approach, there would be a hard limit on the compute resources that could be used.  (Many of our competitors have adopted this approach.)  At the point where the maximum was reached, failures would happen, which in turn would result in an annoyed customer and a support ticket.  

With the Supergiant approach, our users had a soft limit.  They could burst above the theoretical maximum (read: paid-for) -- at least for a period -- without failures and without impacting the experience of the other users of the same resource.  After a short burstability period, Supergiant would gracefully rein in CPU and Memory usage.

Breaking Down the Savings

Meanwhile, because our infrastructure needs were far more predictable, we could fully leverage the economic goodness that is AWS Reserved Instances.  We want our customers to have the flexibility to spin resources up, down, and out.  This usually results in more resources being used -- and more to our bottom line.  

However, demand forecasting in this model is that much harder.  

Sure, for our most popular instances in our most popular data centers, we could get a pretty solid prediction and leverage RI’s to some degree.  But for the handful of our largest instance types, it was too sketchy to commit to 12 months without the same commitment from the customer.  With Supergiant, the growth of our nodes became much more predictable, and thus, we could fully commit.  

Just this aspect alone drove our costs 40% lower and was the biggest impact to our bottom line.

Even without the tremendous benefit of finally being able to fully utilize RI’s, the Supergiant packing algorithm allowed us to use fewer resources overall.  Since the dawn of computing, requiring peak usage capacity has been the bane of corporate IT bottom lines.  

This pain is especially acute for our customers, and, by extension, for just about all users of Elasticsearch, a technology with adequately diverse use cases where some users may have intensive search operations.  Others might have intensive indexing operations.  Still others may have both, although in most cases, it’s one or the other.  Search performance may slow to a crawl during index operations, so customers would need to over-provision.  (This led to the aforementioned call from the AWS rep about our optimization levels.)

Yet, conversely, if we had support tickets from customers complaining of low performance, it was almost always due to an under-sized cluster that they kept under-sized to save money.  

We abstracted indexing operations to the unused compute resources of our Kubernetes cluster.  The result was more stability and fewer support tickets.  The bonus was performance that blew out the direct-VM model.  

Based on feedback from customers, they also told us that Supergiant blew away the performance of the aforementioned competition.

To summarize how we lopped 50% off of our infrastructure bill:

The Supergiant packing algorithm resulted in 25% lower compute-hour usage without sacrificing performance or stability.

Enhanced predictability for the remaining nodes in the cluster enabled us to fully leverage AWS Reserved Instances, resulting in a 40% savings.

Side benefits include enhanced stability, better indexing performance, and better automatic recovery/failover, resulting in fewer support tickets overall.

We invite you to try Supergiant.
 It now belongs to the community.

 

When our dev team told me that with just six more weeks of polish, they could turn our planned Private Hosted Elasticsearch into a full-on data center management system, I gave the green light without hesitation.  

Supergiant is now available on Github, freely downloadable, and usable under an Apache 2 license.  For now, Supergiant works for Amazon Web Services.  We have Google Cloud, Rackspace, and Openstack on the road map.