14 Jul 2020 · 13 mins
Up until late last year, I was the main ‘sales engineer’ at Boemska. While wearing that hat, I spent much of my time demoing our software and using it to show new customers exactly what’s happening on their analytic platforms. I particularly enjoyed doing those presentations because often I would experience that moment when our latest customer figures out exactly how the software they’d owned for years actually works.
“We always thought SAS was a black box, but seeing it like this makes it so easy to understand!”
That aha moment always put a smile on my face. It’s one of the best parts of my job. There’s no better validation of your product than this kind of feedback.
Occasionally though, a few days later, I’d get an email similar to this:
“Hey Nik, remember when you walked us through how all this stuff works? That was awesome! Do you have it documented anywhere, so we can use it to educate our users?”
The way I’d typically answer this question is something I’ve never quite been happy with. I’ve written some papers and online posts that I’ve directed people to, but most of those are pretty specific and already assume a certain level of knowledge. Telling a new customer to go back and refer to vendor documentation when they’ve just bought your product never felt right, but on the other hand, deep-dives about how SAS (or RStudio, or Jupyter, or Netezza) work never really made sense for our own technical documentation either.
Fast forward to July 2020. We’ve had some awesome new people join Boemska, we have a new website, a new blog and a new pandemic lockdown in the UK - all of which, together, seem to have added a few more hours to my day. So, I’ve decided to use that time to write a series of blog posts that cover exactly that subject - something like ‘How Analytic Workloads Work’, as seen through the lens of ESM.
The series will be about fundamentals, but I’m going to make it as “real world” as possible. A big part of that will involve explaining how these fundamentals fit into the modern Cloud landscape, and affect Real-Cloud Performance and Cost. I’ll also write about the more modern architectures too, including SAS Viya. But to get to that, and to truly appreciate it, it’s worth starting from the basics.
Consider a program that looks like this.
data work.inventory; infile '/some/data/inventory.csv'; input someColumns : $someFormats. /* more columns probably */ ; run;
This simple snippet of SAS code uses a SAS DATA step - one of the oldest and most widely used data load mechanisms in SAS - to read a text-based CSV dataset (
/some/data/inventory.csv), and store it as a SAS dataset in the temporary WORK library (
work.inventory). The WORK library is a bit like SAS’ temporary scratch disk, used to store intermediate datasets between processing steps.
So, what is it that actually happens as this program runs? Let’s go back to high-school for a minute and really consider the basics:
For this code to run, and for the machine to do the work we’re asking it to do, the following things must happen:
I realise all of this is quite basic, and you could argue the artwork is more kindergarten than high-school. But these fundamentals are important.
If you’ve ever run out of storage space on your laptop, you’ll know that the amount of data a disk can hold is limited by its capacity. If you then tried to free up some space by moving some of the larger files to an external disk, you’ll have discovered that the rate at which disks are able to read and write data is also very limited - by either the speed of the disk, or the speed of its connection to your laptop.
It’s a common assumption that the speed and age of a machine’s processor dictate its performance. While it’s true that a faster, newer CPU is capable of executing more instructions than an older one, if those instructions are ‘process this data’, that CPU can’t do anything useful if the data it’s meant to be processing hasn’t reached it yet. It will spend a lot of its time waiting for the rest of the system to catch up and supply it with more data; the faster the processor, the more time it’ll likely spend waiting.
The question when processing data is ‘how much data is enough data to keep my processor busy?’. Time for an example from the real world.
For this example, I downloaded a real world sized CSV file from Kaggle, and loaded it using code very similar to the snippet above. The cluster where I ran the code is instrumented with ESM, so I was able to see information about the data load step’s performance. Information like the rate at which the data was reaching the processor, and how busy it was as a result.
The following graph is a straight export from ESM.
From this graph, we can see that this code started executing at 17:00:35 and ran for just under two minutes, until 17:02:30. Here are a few other things the graph shows.
Throughout the time it ran, the code used close to 100% CPU. While the ‘percent CPU’ process metric used to describe CPU utilisation in this context is fairly universally used, it can be quite confusing.
“100% CPU? What does that actually mean, is it 100% of the entire server? How come I can see some jobs going above 100%? How is that even possible?”
I’ve heard this question many times, not just in the context of ESM. Technically it means that ‘a program which is running at 100% CPU is using the amount of CPU cycles equivalent to the capacity of a single processor core, or utilising all processing time available to a single thread of execution’.
A word on threads. Some programming languages, and many SAS Foundation procedures, are able to split their work into multiple independent streams of instruction, enabling them to distribute their work to multiple workers - multiple threads. This is how they’re able to ‘go above 100% CPU’, by utilising more than one CPU core at a time. CAS, the core compute component of SAS Viya, is a good example of a compute engine that does almost all of its processing in this way. CAS not only distributes its work across multiple CPU cores, but is able to also distribute it across multiple nodes in a cluster, achieving ‘massively parallel processing’ and huge gains in performance and processing speed.
However, older procedures like the classic ‘v1’ SAS DATA step, or the Python runtime (with its Global Interpreter Lock), are strictly single-threaded. They are only capable of processing one linear stream of instructions at a time, and a single piece of code can never use more than 100% of CPU - regardless of how many available CPU cores it has as its disposal.
I might go into more depth about threading in another post, but for now think about it like this: when a step flatlines at 100% like the one in this chart, it is a single-threaded step. As such, because it’s flatlining at its peak possible utilisation, it is running at optimal performance, as efficiently as it possibly can. In other words, here the processor is being supplied with enough data to keep it as busy as it could be, and the limiting factor in the completion time of this code is indeed the processor.
But at what rate does the data need to flow to and from the processor for it to maintain this optimal 100% CPU utilization? How much of that CSV data is it capable of converting into that easier-to-read-back-in-later SAS binary format in a given time period, and how much of the resulting binary data does it produce as output in that time?
Hovering over the chart above shows that for this code, this rate was around 100 MB per second for the incoming CSV file being read from disk, and a little over 200MB per second for the binary dataset being produced and written to disk.
I’ve mentioned a couple of times that the SAS binary dataset this step produces will be ‘easier to read next time round’. One reason that future steps will find it easier to read is precisely because they won’t have to figure out where each bit of data starts and stops - the work the data step does when it first reads the CSV file from disk. In the binary dataset, that data is stored in fixed width variables, where each record and each column are aligned and occupy the same amount of space on disk. The downside to this is that, because all those trailing blank characters are also written to disk, the output .sas7bdat binary dataset takes up more disk space. This also explains why, in this graph, the resulting data is written at twice the rate that the source data is read in.
With that in mind, if we were in a situation where we didn’t have enough disk space to store that padded out file, or our target disk device wasn’t fast enough to cope with a 200MB/s write speed, we could tell SAS to compress that output before writing it to disk. This would change the way those blanks are stored and drop the size of the output file by around 60%, and therefore also reduce the rate at which it needed to be written. However, from the analysis earlier, we know that the CPU already has no time spare (it was running at 100%), and asking it to also apply compression to the output data would give it even more work to do. Doing that in this instance would increase the overall completion time for our load step by around 30%, and as we are not currently sharing this instance with anyone else, that’s not something we care about.
You could say that here, we are choosing to keep the computational complexity of our program lower, because it lets our CPU process more data faster. We are doing that because, as our metrics suggest, our instance can cope with the higher disk throughput and extra storage space required in order to execute the data load in the shortest possible time.
Loosely speaking, computational complexity is this idea that sometimes, when performing a complex operation on a given volume of data, the processor can take a long time to finish processing that volume of data. Other times, when performing a different simpler operation on the same volume of data, it can finish that work much, much faster. It all depends on how complex, or computationally expensive, the thing it’s been asked to do with that data is.
Here’s another example program execution. I started with the same code as above, but this time I added an extra step to describe the dataset (a Proc Means), and another to sort it (a Proc Sort).
Here, the load of the CSV data again took around 2 minutes. Then, describing the data then took another 2 minutes, and sorting it took a little over 4 minutes. Let’s focus on the first two steps for now - the Data Step and the Proc Means.
We know what the first step does - it reads data from a CSV file and writes it to a SAS dataset. What the new, second step then does is read the dataset created by the first step, and calculate some summary statistics for it. For this to happen, the same data written by the first DATA step is read back by the second - the Proc Means step. This is why, if you squint a little, you can make out that the volume of blue bars covering the second step is the same as the volume of green bars covering the first.
However, while the two steps do work on the exact same volume of data, there’s an obvious difference between them in terms of the amount of CPU resource they require.
You’d be forgiven for assuming that the second step, the one that performs actual calculations on the data, would require more processing power than the first, which just loads it. But, because in this case the data contains few numeric variables, and the input dataset is already in that SAS binary format, the Means procedure doesn’t actually have to work very hard there. The entire second step barely needs any CPU time to do its thing.
So if that’s the case, if it requires barely any work to process that data, why does the second step take almost as long to complete as the first?
The answer to this may seem obvious by now. It’s because all of that data still needs to be read back from disk for the Means procedure to process it. Unlike the first step, this instance of the Means procedure is a typical example of a step where the storage system isn’t able to supply the processor with data at a sufficient rate to make use of all the available CPU time. To approach 100% CPU, it would need to read it in at about 1.1GB/sec.
When I see a chart like this in ESM, that shows a job only using a fraction of a CPU thread while there is still plenty of CPU resource available on that node, it is almost always an indication that the job is either waiting on more data to be read in, or waiting for the data it’s producing to be written before more can be processed. The hold-up can be either a network connection, a database query, or like in this case, the artificial data throughput limit our cloud provider is applying, limiting the read speed from our virtual disk device to 200MB/second because of the instance type we have selected for this particular node in our cluster.
Finally, in the chart above there is a final step, which Sorts the dataset produced by the first step. I included it as an example of what other, ‘less simple’ procedures can look like. SAS’ PROC SORT is an example of a step that uses memory and depends on memory limits, and only uses the disk if it realises that it can’t do what it needs to within those memory limits. It’s also an example of a multi-threaded operation. I might write more about that next time.
The huge range in the computational complexity of operations that data scientists commonly throw at their data is a key characteristic of analytic workloads. Opening up the “black box” and understanding the fundamentals of how your code uses system resources will help you build better, more efficient data jobs, particularly when building data engineering & data management pipelines.
Why should you care if your code is better or more efficient?
If you are using SAS professionally, there is a very good chance that you share your server with multiple other users, maybe even multiple teams of users. The resources on that server will be limited, and writing code that make best use of what you have available will result in faster completion times and an increase in system responsiveness and overall capacity.
The ability to look at a job profile and recognise what ‘bad code’ looks like can be extremely powerful in quickly identifying the root causes of poor system performance (and I’m not talking about the code used above, this is quite normal - I’ve seen far, far worse examples). The global observability that ESM offers lets experienced developers keep an eye on the workloads their teams are running, and perform on-the-fly root cause analysis, identifying bottlenecks and even mentoring colleagues who might not know that the code they are generating could run faster or be more efficient. At Boemska, we’ve seen this ‘collaborative analysis’ approach result in collective increases in performance that more than double the effective capacity of a platform.
If you are considering moving your analytic workload to the Cloud, a precise understanding of your resource requirements can play a huge role in both your projected cloud infrastructure costs and your confidence in those projections.
All of the major Cloud providers offer a range of instance types that are skewed towards CPU, disk or memory intensive workloads. All of the providers allow you to scale each of those instances up, with the same resource skew, almost indefinitely. Deciding which instance type to choose, or which batches are best suited to which instance type and size, is difficult to do without properly profiling your workloads. Your IT department may be able to derive some of this data from the physical hardware you’re running your workloads on now, but often, this kind of sizing results cloud infrastructure cost projections that are astronomical.
An exact knowledge of the size and skew of your workloads’ resource requirements, and consistency with which they use those resources - broken down by individual flow, job or group of users - lets you identify and select quick win candidates that can easily be ‘lifted and shifted’ to the Cloud in their current state. It can highlight candidates for optimisation that can be augmented and migrated in a second wave, and drive the decision around which parts of your pipeline need to be re-engineered before they are redeployed. It enables a precise, data-first approach to cloud architecture that can eliminate guesswork and make the move much simpler. Digital Transformation with confidence.
The discussion around cloud migration really deserves its own post, but I hope the relevance is clear.
In this post we looked at how the CPU is “fed” with data, we learned about the resource requirements of different steps and operations, and talked about how those requirements can impact overall system performance.
Next up, we’ll look at memory, multi-threading, and some examples of real-world performance optimisations related to those.