AWS IO Performance: What’s Bottlenecking Me Now?

AWS

When moving to AWS, it can be difficult to pinpoint where your previously high-functioning system is underperforming. AWS has its own internal rhyme and reason, and anyone moving to AWS must be familiar with how AWS operates. One tool that can help you on this front is AWS CloudWatch, which provides metrics and monitoring on core system performance. The focus of this article is how to configure your IO subsystem in AWS. This requires using Cloudwatch to understand your subsystem’s performance. Are you hitting throughput limits? Are you utilizing all your allocated IOPS? What’s the average queue length?

To define all these terms from scratch is out of the scope of this article – we assume you have some basic familiarity with AWS I/O terminology or at minimum have some links to get you started. Our goal here is instead to provide some high-level guidelines on how to begin thinking through your I/O performance in AWS.

It’s Not Just the Volume

Let’s take a step back for a second to let the implications of using EBS with our instances sink in. EBS (Elastic Block Store) is a type of durable, persistent, network-attached storage available for AWS EC2 instances. Emphasis here goes on “network-attached”: whenever you read/write to an EBS volume, there is an unavoidable network component involved that sits directly between the EC2 instance and the EBS network-attached storage. In many cases, this network component will be your bottleneck unless you are running on a particularly capable instance network setup (2), since it isn’t very difficult to surpass the aforementioned instance limits.

The first step in assessing your IO woes is validating this network connection from your instance to your EBS volume. To further complicate things, your performance is not only limited by the volume you provision, but the instance you attach said volume to! Amazon lists the network limits per instance type, ranging from 4k IOPS and 500MiB/s limits to 65K IOPS and 10GiB/s limits, so check your documentation! It is incredibly important that you take into consideration your expected disk throughput and IOPS when selecting an instance. A perfectly performant IO subsystem can’t outshine a low-throughput network connection to an instance, so be sure to test in every application.

When asking “what’s bottlenecking me now?”, the EBS volume type and the instance type have to be considered when setting up your system. You could provision 20000 IOPS to an io1 disk, but because of instance bottlenecks, max out at around 8000 IOPS for a throughput of 260MiB/s. That’s 12000 IOPS you’re paying for wasted, all because you forgot to consider your instance type as a bottleneck.

When consulting AWS documentation about your instance, one should be cautious about taking the listed IO limits at face value. In a fair bit of documentation, AWS achieves the listed instance benchmarks only by  assuming a read-only workload in “perfect weather.” The reality is that these numbers are rounded approximations of best-case scenario performance, using block sizes that may not be relevant to your application. Read your documentation carefully and be prepared to test your assumptions. Of course, this is just the tip of the iceberg – you’ll need to weigh in other factors as well, such as the file system being used, the read/write workload, and whether options like encryption are enabled on the EBS volumes connected to the instance.

What About the Disk?

Go ahead and start monitoring your instance to quickly determine where you are particularly struggling in the IO department (if at all!). AWS CloudWatch metrics make it very easy to validate if your high throughput I/O application is having disk-related performance bottlenecks. Combining this with standard disk benchmarking tools at the OS level will allow clear picture of your IO subsystem to emerge. We’ve found it is fairly simple to get an IO subsystem to outperform the EBS-to-Instance network connection (RAID setups come to mind), so after validating the network pipe limitations, you’ll be much better equipped to provision your EBS volume appropriately for desired IOPS and/or throughput.

Appropriate drive selection depends heavily on your primary performance attribute (2), IOPS or bandwidth. For higher IOPS applications, general SSD’s are available in general purpose burst type (gp2) and provisioned (io1) formats. SSD’s are more expensive than their HDD counterparts, and are optimized around low latency smaller block size operations. Most general purpose and mission critical applications are just a matter of appropriately sizing one of these two SSD types. The other kinds of EBS storage options offered by AWS are throughput optimized (st1) and cold (sc1) HDD volumes. What is unique about these storage types is their maximum bandwidth per drive, 500MiB/s and 250MiB/s respectively. These HDD volumes require a larger block size and a preferably sequential workload to labor effectively due to their painfully low IOPS limits (500/250 IOPS respectively).

IOPS

At the end of the day, most AWS applications outside of big data/log processing will boil down to a choice between either io1 or gp2 volumes. HDD volumes have their place, but they must be carefully considered due to their extreme IOPS limitations even though their admirably high bandwidth limits might be appealing.

EBS snapshots: A Lesson in Prudence

AWS ELB

Let’s turn now to some of the functionality provided in tandem with EBS volumes. AWS offers powerful snapshot functionality in conjunction with EBS volumes, allowing the user to make point-in-time backups of a given volume. Better yet, these snapshots are incremental – your first snapshot of a volume will take time, but any subsequent snapshot of the same volume will only snapshot blocks that have been modified since the last snapshot. AWS accomplishes this through a bit of smoke and mirrors that ultimately boils down to two things: pointers and S3. In short, all snapshotted blocks of a given volume are stored in S3. When a new EBS volume is spun up from a snapshot, AWS will initialize the volume using the newest blocks from all snapshots associated with the original drive. This means that an end user can set up a rolling backup schedule for a given EBS volume, and be assured of both data integrity to a point in time, as well as speedy incremental snapshots.

Hearing all this, it is easy to get carried away. Incremental snapshots + speedy initialization of a drive from a snapshot + programmatic control of instances and volumes via AWS API calls = all sorts of possibilities, right? One might think of backup systems, easy setup / maintenance of slave nodes or QA environments, or perhaps a data refresh scheme using snapshots? Sounds good, but stop for a second and remember: THERE IS NO FREE LUNCH. Not here, not anywhere. We’re dealing with computers, and everything has a trade-off cost. So where are we paying here?

Remember when we mentioned how AWS stores snapshot blocks in S3? So how do we get those blocks back from S3, an entirely separate system, into a drive that we initialize from a snapshot? Indeed, AWS does you no favors (or perhaps one too many favors!) on this – when you navigate a volume spun up from a snapshot, all your files look like they’re there!

Or are they? Perhaps you might notice slight performance degradation when you access a file from your snapshot-initialized volume. The first time you open a file, your IO performance is poor; the second time, much better. This pattern repeats for every file on your volume.

There it is – no free lunch. When a drive is initialized from a snapshot (whose blocks are stored on S3) and then attached to an instance, AWS fetches blocks from S3 lazily. Until you access a given block on your disk, it is not physically on your volume. So the first time you access a block, you’re paying two costs – network latency to fetch from S3, and then IO latency for reading the block from disk. And to make things worse – recall the rest of the article on how an EBS volume is essentially storage accessed through a network, which means you have more layers of latency and bottlenecks to work through for an IO read!

Okay, so what can we do? Well, if your volume is small, and / or you have time on your side, there are a few options. The main option suggested by AWS is outlined here: in essence, use the dd tool(3)  to pull all blocks from your newly-initialized volume, and redirect the output to /dev/null. In other words, read everything such that all blocks are pulled from S3 (the drive is “pre-warmed”), and then discard those reads. Now you’ve eliminated the S3 latency penalty.

Of course, depending on the size of your drive, this process can be prohibitively slow. It also requires extra manual work or automation from your engineers, which can backfire depending on the complexity of your implementation.

None of this is meant to deter you from making use of snapshots – they’re a wonderful, well-implemented tool. But as with any tool, they come with drawbacks and gotchas, and this is a biggie. So just remember: EBS snapshots are back-ended by S3, and blocks for snapshot-initialized volumes are fetched lazily.

AWS is Not Magic

Your best bet is to test the network connection directly via an I/O load test utility (such as SQLIO.exe or diskspd.exe in the Windows world). Generating an insanely high IO/bandwidth workload is very easy when you’re using the right utility, and it allows you to validate the EBS-to-Instance pipe and disk limitations at not only the AWS layer (CloudWatch), but at the OS layer as well. You can easily find the network pipe limit in terms of bandwidth or I/O using an overly performant drive setup and a correctly tuned load tester utility, such as RAID 0 volumes and diskspd.exe (RAID0 being what it is, of course, exercise caution!).

AWS offers flexibility and power that far exceeds that of your average data center, but (at risk of a groan) with great power does indeed come with great responsibility. You MUST be prepared to play by AWS’s rules and not go by conventional datacenter wisdom. This means a few things:

  1. Live and breathe AWS documentation.
  2. At the same time, believe nothing and test everything in documentation.
  3. Perform your proof-of-concepts at scale and as robustly as possible.
  4. Remember that AWS services are nothing more than complex abstractions – you will face bottlenecks from things AWS lists, and bottlenecks from things they never mention once. We refer you back to points one through three.

If anything else, the intent of this article is to convey the complexity of moving to AWS through the lens of configuring I/O, and to offer some suggestions and principles on how to navigate the world of AWS. The payoff of moving to AWS can be wonderful, but AWS is not magic. Learn to navigate by its rules, and you’ll be rewarded – otherwise, you’ll wonder what all the hype was about!

Reference Links

Additional Reading

We're Hiring Engineers at Ladders. Come join our team!

Data Scientist’s Toolbox for Data Infrastructure I

keshen-blogarticle

Recently, the terms “Big Data” and “Data Science” have become important buzzwords; massive amounts of complex data are being produced by business, scientific applications, government agencies and social applications.

“Big Data” and “Data Science” have captured the business zeitgeist because of extravagant visualizations and the amazing predictive power from today’s newest algorithms. We’ve nearly approached mythical proportions of Data Science as a quixotic oracle. In reality, Data Science is more practical and less mystical. We as Data Scientists spend half of our time solving engineering infrastructure problems, designing data architecture solutions and preparing the data so that it can be used effectively and efficiently. A good data scientist can create statistical models and predictive algorithms, a great data scientist can handle infrastructure tasks, data architecture challenges and still build impressive and accurate algorithms for business needs.

Throughout this blog series, “Data Scientist’s Toolbox for Data Infrastructure”, we will introduce and discuss three important subjects that we feel are essential for the full stack data scientist:

  1. Docker and OpenCPU
  2. ETL and Rcpp
  3. Shiny

In the first part of this blog series we will discuss our motivations behind implementing Docker and OpenCPU. Then follow our discussion with applicable examples of how Docker containers reduce complexity of environmental management and OpenCPU allows for consistent deployment of production models and algorithms.

 

CONTAINING YOUR WORK

Environment configuration can be a frustrating task. Dealing with inconsistent package versions, diving through obscure error messages, and waiting hours for packages to compile can wear anyone’s patience thin. The following is a true (and recent) story. We used a topic modeling package in R (along with Python, the go-to programming language for Data Scientists) to develop our recommender system. Included with our recommender system were several dependencies, one of them being “Matrix” version 1.2-4. Somehow, we upgraded our “Matrix” version to 1.2-5, which (unfortunately for us) was not compatible with our development package containing the recommender system. The terrible part of this situation was the error messages did not indicate why the error occurred (apparently due to a version upgrade), which resulted in several hours of debugging in order to remedy the situation.

Another similar example is when our R environment was originally installed on CentOS 6.5. By using ‘yum install’ we only obtained R with version 3.1.2, which was released in October, 2014, and not compatible with many of dependencies from our development and production environments. Therefore, we decided to build R from source, which took us two days to complete. This was due to a bug from the source which we had to dig into the source code to find.

This begs the question, how do we avoid these painful and costly yet avoidable problems?

SIMPLE! With docker containers, we can easily handle many of our toughest problems simultaneously. We use Docker for a number of reasons, with a few of the most relevant mentioned below:

  1. Simplifying Configuration: Docker provides the same capability of a virtual machine without the unneeded overhead. It lets you put your environment and configuration into code and deploy it, similar to a recipe. The same Docker configuration can also be used in a variety of environments. This decouples infrastructure requirements from the application environment while sharing system resources.
  2. Code Pipeline Management: Docker provides a consistent environment for an application from QA to PROD therefore easing the code development and deployment pipeline.
  3. App Isolation: Docker can help run multiple applications on the same machine. Let’s say, for example, we have two REST API servers with a slightly different version of OpenCPU. Running these API servers under different containers provides a way to escape what we refer to as, “dependency hell”.
  4. Open Source Docker Hub: Docker Hub is easy to distribute Docker images, it contains over 15,000 ready-to-use images we can download and use to build containers. For example, if we want to use MongoDB, we can easily pull it from Docker Hub and run the image. Whenever we need to create a new docker container, we can easily pull and run the image from Docker Hub
`docker pull <docker_image>`

`docker run -t -d --name <container_name> -p 80:80 -p 8004:8004 <docker_image>`

Screen Shot 2016-05-18 at 3.21.00 PM

We are now at a point where we can safely develop multiple environments using common system resources without worrying about any of the mentioned horror stories simply by:

`docker ps`

`docker exec -it <container_name> bash`

Screen Shot 2016-05-18 at 3.21.54 PM

Our main structure for personalized results is shown in the image below. We have three docker containers deployed in a single Amazon EC2 machine running independently with different environments yet sharing system resources. Raw data is extracted from SQL server and goes through an ETL process to feed into the recommender system. Personalized results are called from RESTful API through OpenCPU and return in JSON format.

pci

DISTRIBUTING YOUR WORK

OpenCPU is a system that provides a reliable and interoperable HTTP API for data analysis based on R. The opencpu.js library builds on jQuery to call R functions through AJAX, straight from the browser. This makes it easy to embed R based computation or graphics in apps such that you can deploy an ETL, computation or model and have everyone using the same environment and code.

For example, we want to generate 10 samples from a random normal distribution with mean equals to 6 and standard deviation equals to 1. First, we need to call a function called ”rnorm” in R base library. Performing a HTTP POST on a function results in a function call where the HTTP request arguments are mapped to the function call.

`curl https://public.opencpu.org/ocpu/library/states/R/rnorm/ -d “n=10&mean=5”`

Screen Shot 2016-05-18 at 3.30.54 PM

The output can be retrieved using HTTP GET, when calling an R function, the output object is always called .val. In this case, we could GET:

`curl https://public.opencpu.org/ocpu/tmp/x0aff8525e4/R/.val/print`

Screen Shot 2016-05-18 at 3.32.01 PM

And here are the 10 samples:

Now imagine this type of sharing on a large scale. Where an analytics or data team can develop and internally deploy their products to the company. Consistent reproducible results are the key to making the best business decisions.

Combining Docker with OpenCPU are great first steps in streamlining the deployment process and moving towards self serviceable products in a company. However, a Full Stack Data Scientist must also be able to handle data warehousing and understand the tricks of increasing performance of their code systems scale. In part 2, we will discuss using R as an ETL tool which may seem like a crazy idea, but in reality R’s functional characteristics allow for elegant data transformation. To handle performance bottlenecks that may transpire, we will discuss the benefits of RCPP as a way of increasing performance and memory efficiency by rewriting key functions in C++.

We're Hiring Engineers at Ladders. Come join our team!

What is a Good Program?

code-woods-3

How do you know if the software you are building is “good” software?  How do you know if the programmers on your team are “good” programmers?   As a programmer, how do we systematically improve ourselves?

The goal of most programmers should be to improve their craft of building programs.  A good programmer builds good programs.  A great programmer builds great programs.  As programmers, we need a way to judge the quality of the programs we build if we have any hope of becoming better programmers.

What is the problem you are trying to solve?

A traveler approaches a heavily wooded area he must pass in order to get to his destination. There are several other travelers already here. They are scrambling to cut a path through the woods so they too can get to the other side. The traveler pulls out a machete and starts chopping away at the brush and trees. After several hours of hard physical labor in the hot sun, chopping away with his machete, the traveler steps back to see his progress. He barely made a dent on the heavily wooded area, his hands are bruised and worn, he has used up most of his water, and he is exhausted. A second traveler walks up to the first traveler and asks him what he is doing. The first traveler responds, “I am trying to cut a path through these woods.” The second traveler responds, “Why?” The first traveler snaps back in frustration, “Obviously, I need to get to the other side of these woods!” The second traveler responds, “That wasn’t obvious at all! You see, I just came down that hill over there and from there you can clearly see that the wooded area is deep, but narrow. You will die before you cut your way through the woods with that machete. It would be much easier to just go around. As a matter of fact, if you look to your right you can see a taxi stand in the distance. He can get you to the other side quickly. “

As programmers, what is the problem we are trying to solve?

The first traveler lost sight of his goal. Once he encountered the wooded area with all of the other travelers already cutting their way through the woods, the problem went from getting to the other side to chopping down trees. Instead of stepping back to evaluate the possibilities and try to find the most efficient way to the other side, he joined the crowd that was already chopping away at the woods.

Programs are tools

Programs are solutions to problems. They are tools to help people accomplish their goals. Just like the first traveler in our story, programmers often lose sight of the problem they are trying to solve, wasting most of their time solving the wrong problems. Understanding the problem you are trying to solve is the key to writing good software.

Programs are tools designed to solve a problem for an intended user.

Good tools are effective. They solve the problem they were built to solve. A good hammer can hammer in nails. A good screwdriver can screw and unscrew screws. A good web browser can browse the web. A good music player can play music. A good program solves the problem it is supposed to solve. A good program is effective.

Good tools are robust. A hammer that falls apart after just one use is not a very good hammer. Similarly a program that crashes with bad inputs is not a very good program. A good program is robust.

Good tools are efficient. An electric screwdriver that takes a long time to screw in a screw is not as good as an electric screwdriver that can screw in screws quickly. Similarly, a web browser that takes a long time to render a web page is not as good as one that does so quickly. Good programs are efficient.

Like any other good tool, good programs are effective, robust, and efficient. Most tools are built to solve a well defined problem that is not expected to change. A nail will always behave as a nail does, thus a hammer’s job will never need to change. Programs, on the other hand are typically built to solve a problem that is not well defined. Requirements change all the time. A good program has to be flexible so it can be modified easily when requirements do change.

Good programs are flexible.

Creating flexible programs that can easily be adapted to meet changing requirements is one of the biggest challenges in software development. I stated earlier the key to building a good program is understanding the problem you are trying to solve. Programming is an exercise in requirements refinement. We start with an understanding of the fundamental problem we are trying to solve by using a plain language. In order to create a solution for the problem we start defining requirements. Some requirements are based on fact and some are based on assumptions. Throughout the software development process we refine those requirements, adding more detail at every step. Fully specified, detailed requirements are called code. The code we write is nothing more then a very detailed requirements document a compiler can turn into an executable program.

The challenge comes from changing requirements over time. Our understanding of a problem may change. The landscape in which we are operating may change. Technology may change. The scope of the problem may change. We have to be ready for it all.

When requirements change we have three choices: do nothing, build a new program, or modify the original program. Building a new program is a perfectly acceptable solution, and may be the right answer in some cases. Most of the time due to time and budget constraints, the best answer is to modify the original program. A good program will spend most of its life in production. During that time the requirements of the users and landscape is likely to change. When that happens your program no longer meets our first requirement for good programs: A good program is effective. The requirements of the problem no longer match the requirements specified in your code. Your program no longer solves the problem it was intended it solve. It is broken. You have to fix it as quickly as possible. If the program is flexible enough you can modify it quickly and cheaply to meet the new requirements. Most programs are not built this way and end up failing as a consequence.

To build a flexible program, a programmer should ask the question: “What requirements are least likely to change, and what requirements are most likely change?” The answer to this question will inform every aspect of your program architecture and design. It is impossible to build a good program without the answer to this question.

Good code?

Programs are made of code. Good programs are made of good code.
In this post I have specified the top-level requirements for what a good program is. In the next post I will start to refine these requirements further to answer the question, “What is good code?”

We're Hiring Engineers at Ladders. Come join our team!