2 Jul 2020 · 8 mins
In Bermuda, where I live, we’re starting to open up after several months of strict restrictions. Its weirdly liberating to be able to go the supermarket on any day I like. Unlocking the potential of this beautiful country again is going to be tough, we are a country with an economy that relies heavily on a high throughput of tourists.
Fortunately, unlocking the power of an underperforming SAS® environment is far less complicated.
In this post, I’d like to take you through what a typical analysis with ESM looks like, using an example from one of our more recent customers1. The environment we’ll be focusing on here supports the interactive workload generated by their Enterprise Guide users, and is an 8 vCPU Windows virtual server running SAS 9.4. The SAS Administrator for this environment, who is also one of the SAS developers, was noticing that the environment was performing poorly and had plenty of complaints from other SAS users to support this theory! So, we got on a call with them to talk about what they were seeing.
One of the most useful features of ESM is the “Top 50 Heatmap”. It provides a neat, very visual way of quickly seeing what’s actually happening (or happened) on a server at any point in time. It’s a bit like a spectrogram, but for workloads. It can instantly show you what your users and jobs are doing and even highlight if they are affecting each other. In cases like the one in this post, it can also show fundamental issues with the underlying system configuration (spoiler alert!).
What we’re looking at above is a picture of the top 50 most active user sessions on that server, showing their effective CPU utilisation over the selected 5 minute timespan. Looking at the heatmap, you can quite clearly see a pattern where most of those 50 user sessions2 - I count around 30 - are trying to do sustained work, but are really struggling. Every 30 seconds or so there is a burst of activity, where CPU utilisation increases in perfect sync - but only for a few seconds at a time. Not long after, those sessions go back to barely3 using any CPU at all.
On a day-to-day basis, our customers typically use the Heatmap feature to identify when a process is using a disproportionate amount of a resource and causing problems for everyone else. In that scenario, you would see a pattern similar to this one, but there would be some lines that get darker just as the others get lighter, thus highlighting the greedy process(es) causing the problem. These situations also tend to be a lot more obvious when using ESM’s disk i/o heatmaps, which highlight when disk-hungry or inefficient code is affecting the performance and developer experience for everyone else.
However, in this case, you can see that there aren’t any processes that stand out in that way. This suggests that there may be a real environment-wide resource availability issue at play. Looking at overall CPU usage on that node showed that there was plenty of CPU time available during those periods of starvation (there’s a node-level graph further down this post), so this suggests that this may be a problem with getting enough data to the CPU. In other words, the environment either has insufficient or misconfigured storage infrastructure, or it’s just not getting the provisioned disk bandwidth.
Time to pick up the phone and speak to the storage team.
Note: The rest of this post isn’t based on this client’s experiences with their infrastructure team. This data is only from a couple of days ago, so we don’t know if they’ve even picked up that phone yet. Still, what follows is a fairly typical series of events based on our experiences elsewhere.
In most organizations - but particularly large enterprises - this is where the fun and games start. We’re confident that the issue is caused by insufficient disk bandwidth. However, especially as this server is virtualized, the storage team will quite likely say something like: “This is not a storage problem. You need to speak to the VMWare team. Your machine is probably starved of CPU because you are running on an over provisioned VM host. We know those guys, they’re always doing that”.
At this point, you may be in for a lengthy game of ping-pong.
However, because the SAS Administrator has ESM, they are able to confidently say that over provisioning and CPU starvation is not the issue here. Here’s why:
Let’s look at that heatmap again. You will notice that among the sea of simultaneously suffering processes, there are some that don’t seem to be affected by this problem at all: the 5th one from top, the one 12 sessions below that, then two below that one, then the second one from bottom. Here is the above image, but highlighting the sessions I’m talking about:
These outliers prove that this is not a CPU starvation problem. If it was, they too would display the exact same CPU starvation phasing pattern, which they do not. These outliers actually appear to be getting almost optimal performance on an otherwise extremely slow host. Whatever they’re doing, it probably doesn’t depend on those slow disks.
Drilling in to look at the performance graphs for each of these processes showed us that in each case the amount of disk throughput they were using was very low and the size of their individual SAS WORK and UTIL directories was close to zero, meaning the data they were getting was probably coming from another disk.
In other words, it was clear that these exceptions - these optimally-running processes - were either doing in-memory work or running computational tasks that didn’t seem to be interacting with the SAS WORK disk.
Now is probably a good time to remind ourselves of the rule of thumb for architecting SAS “mixed analytic workloads”. The rule says to have a SASWORK storage volume capable of at least 100MB/second sustained write, and 100MB/second sustained read throughput for each CPU core 4.
With that in mind, here is the overall node-level chart I mentioned earlier. It shows CPU utilisation in red, with blue/green bars representing device-level reads and writes to/from the SASWORK disk:
Without the support of our session level heatmap, anyone could assume from these host-level metrics that the server is underutilised, except for those occasional bursts of activity. But, because of our heatmap, we can prove that this is not the case.
What we can observe from this graph is that the sustained disk bandwidth the customer is getting from their SASWORK storage device is clearly not enough for an 8 vCPU machine (note the scale on the right-hand y-axis). However, when provided with sufficient bandwidth during those ~800MB/sec bursts, the SAS workload successfully utilises ~80% of available CPU, which is exactly the optimal performance we would expect to see. The peaks in this graph line up perfectly with the dark phases in our heatmap, and the bursting pattern seen here is typical of storage that depends on a small fast cache capable of providing the required bandwidth.
We’re only getting our required throughput while the cache is available, for a few seconds at a time - enough to get past a simple post-install infrastructure validation test, for example. At all other times, while that cache is getting flushed to disk, we get the underlying disk’s true sustained rate of throughput. That real, sustained rate is clearly not enough to feed the CPU the data it needs. That is why we see the phasing patterns in our heatmaps, and that is why the end users are suffering and complaining about unusually poor performance.
It’s often tough talking to your IT department and trying to explain why you believe there is a problem with the underlying infrastructure they’ve provisioned for you. They’ll look at the graphs they have, see CPU headroom, and assume that there’s nothing wrong and tell you the problem probably lies within your application. And to be fair to them, with most other applications they manage, and with most business users they interact with, that would be the correct answer. But SAS works a little differently, and SAS users aren’t typical business users. Our clients very often find that the granular level of observability ESM provides helps them either solve their own problems, or enables them to work together with IT to almost instantly resolve issues that they’ve been working on for months - sometimes, years.
We’re sure our customer wouldn’t mind us talking about them, but we’ve removed any identifiable data from this post anyway. ↩
In ESM the naming convention for how Workspace sessions are identified is configurable. The configuration shown here is the default, which is a combination of the Lev and the Application Server Context of the Workspace server (SASAppUser, in this case). ↩
The scale is not in the image, but a hover tooltip shows us that the darkest of those red tiles represents around 90% of a single, which is good performance for typical workloads. ↩
Google this topic and you’ll find a number of excellent SAS Global Forum papers by Margaret Crevar and Tony Brown. Boemska’s CTO Nik is also in the process of writing an in-depth series on how analytic workloads actually work which is going to include some great information on this too. I’ll be sure to update this post when those posts get released! ↩