How to Define Vulnerability Remediation SLAs | Shortcuts
Hello and welcome back to Nucleus Shortcuts. I am your host, Adam Dudley, and today we’re going to discuss how to define vulnerability remediation SLAs. Our expert on the topic is a familiar guest here on Shortcuts, that’s Dave Farquhar – Senior Solution Architect at Nucleus.
Shortcuts: Can you give us a quick take on the basics of SLAs for vulnerability remediation?
Dave: Yeah. SLA means Service Level Agreement. It’s industry standard terminology for how quickly you perform an action. So, in this case, we’re using SLA to mean how quickly your remediation teams apply updates to fix vulnerabilities.
Can you share some ideas about how security teams should go about setting SLAs vulnerabilities?
Dave: The industry seems to have settled on 14 days for critical vulnerabilities, 30 days for high, 60 days for medium, and 90 days for low, as measured by CVSS. That specific combination is the one that I see most frequently. Now, that doesn’t mean that anybody’s having any success with that, but it’s extremely expensive (both in terms of tooling and human resources) to reach goals that are that aggressive. So, when I tell people what we were spending when I was a remediator to reach 21 days for critical, 30 days for high, 60 for medium, and 90 for low, to do that for 1,000 systems, we were spending basically $100,000 a year on remediation.
Next, let’s measure your success rate. What’s your success rate look like? Are you able to meet these deadlines that you’re asking 85% of the time? Not even going to go for 100 yet. We’ll try for 85 because 85% is something of a magic number when you’re asking human beings to do something. Speed limits, those are set on this is what 85% of the population is willing to comply with. Civil engineers studied that, spent a lot of money studying and solving that problem. Let’s steal from that and apply that here. Now, what we typically find when we implement Nucleus with a large customer, and we define these industry standard policies, or the most common policies that we see, they’re agreeable. Everybody’s like, “Hey, yeah, that sounds good. Let’s go with that.” So we go and we implement it, we import your scans, and what we find is more like 85% non-compliance. Exactly the opposite of what we want.
So, if you’re succeeding 15% of the time, something has to change. It may mean more time. It may mean changing from prioritizing based on CVSS to threat intelligent. Because if I have 1,000 critical vulnerabilities, it’s not realistic to expect my remediation teams to fix that this weekend. And I can tell you from experience, if you want me to fix 1,000 vulnerabilities this weekend, you better not expect to pick which vulnerabilities. I can fix 1,000 random ones by throwing everything at the wall and seeing what sticks. If you want to pick which ones, probably not.
But if I’m using threat intelligence, I don’t have 1,000 criticals that I need to fix. I have more like 20, worst case scenario. More common numbers are like six, seven. Now, if I’ve got seven, that’s still a heroic effort to go and fix seven critical vulnerabilities across 100,000 assets in a single weekend. Maybe that’s something that’s humanly possible. So then we start talking about what metrics should I use to measure my success? What’s going to be different every single month? And then it looks like you’re changing your rules all the time and it just all becomes self-defeating.
Now, on the other hand, if you’re hitting your SLA numbers 85% of the time, any other metric that you choose to look at is going to look pretty good. So that solves the argument of which metrics do I look at. Well, they all look good. So hey, hell yeah, there’s room for improvement. We’re succeeding 85% of the time. If we want to look better, we increase our success rate, but all of them are looking reasonable. So you’re not arguing about metrics anymore. You’re arguing about how to improve your program. And 85% doesn’t necessarily have to be your number forever. It gets you close enough to 99%. You can start talking about and having honest and good faith conversations about what it would take to get more aggressive or with what it is that you’re achieving.
Let’s talk about how to use the vulnerability lifetime/meantime to remediation measure with SLAs and Nucleus.
Dave: We have this functionality around measuring vulnerability lifetime and measuring SLAs and compliance with SLAs right in Nucleus. We calculate it for you every time you import new scan data. So I’ll show you how to configure it.
The first thing that you do is you go to Automation and you go to Finding Processing…We have a whole bunch of rules set up here because this is a demo environment, but yours will be blank. I recommend having about four rules, especially as a starting point. So, what we need to do next is set up a rule. By default, for severity we are using whatever it is that we’re getting from your scanner. That’s usually not going to be ideal. It’s usually not threat intelligence based, it’s based on CVSS or something CVSS-like. So I recommend that we go with Mandiant Threat Intelligence there.
So, what I’m going to do is set up a processing rule for critical vulnerabilities. We’ll go under Condition and we will pick the Mandiant Risk Rating and we’ll say “critical.” Our rules can do more than one action in many cases. This is one of those cases where you very much want to do two actions in that rule because it eliminates dependencies, makes things run faster, and it eliminates some conflicts. So, the first thing we’ll make this rule do is we will set it to critical, put in a reasoning, be a little bit more verbose than I’m being here, of course. And then we’ll go and we will set a due date – we’ll go with 14 days.
Now going by “Discover Date,” this will allow me to go and see what things look like right now. When you’re first starting out I recommend that you go with this, “Days From Now,” for some agreed upon period of time, maybe 90 days or something like that. Eventually you’re going to want to pivot to “Discover Date” once you are through your backlog. Now, the reason that we have “Remove All Due Dates,” this allows us to do a couple things. I can go back and I can edit my rule, remove all the due dates with that rule so I can shift from “Discover Date” to “Now” as I need to. The other thing it lets me do is run some different scenarios. Did I just set the network on fire with my policy?
Let’s go and see. So we’ll go ahead and we’ll set it by “Discover Date,” and then we do “Save and Finish.” I’m not going to do that here because I already have a rule somewhere else in here that is affecting the setting criticals. So this’ll work for our purposes. Now, these rules will run automatically the next time that I import scans. But what I can do also if I am impatient or I need results now, go and click the “Run Now” next to those rules. It’ll go and run and it’ll take a few minutes to churn through all of it. You’ll get a prompt somewhere around here in the UI saying, “Hey, it’s finished.” And then you know that you’re done. So, now let me show you what the results of these look like. If I go to the project dashboard and scroll down to the middle of the screen, I see SLA Metrics, and what I’m seeing here is what I do not want to see. I want to see these green bars at 85%. This is a demo environment, so it needs to be intentionally vulnerable. So this is a little bit worse than what we typically see in the real world, but it looks halfway believable.
But if I see this, then that tells me, okay, this is an indication that we need to go and adjust the policies. Go back into your finding processing rules. Edit the rule to remove the due dates, run the rule now, try setting some longer due dates to see what’s realistic with your data. And that reminds me, I forgot to show you one other very important thing. So why is this so bad? What insight can I give you into why this looks so bad? Well, if I scroll down here to the unique high risk vulnerability metrics, this just gives me some very high level data about high and critical vulnerabilities here. Now, what you can see down here is average days to remediate. So, we have a track record here of it taking more than 200 days to fix the high or a critical. That’s why we have this sea of red up here.
So realistically, I need to start asking some questions to whoever is remediating this network. Why does it take 245 days to remediate a critical or a high vulnerability? Ask non-judgmentally so that you’ll get an honest answer, “Oh, well, we get one maintenance window per quarter.” Oh okay. That explains why it is that this is happening. But you have to be able to go and have these kinds of honest conversations about what the limiting factors are. And now you can go in and, even before you have those conversations, run some scenarios, go through and figure out, hey, what kind of policy do I have to put in place that would be green 85% of the time? Because if I can come to the conversation with that, hey, if I go with the industry standard of 14, 30, 60, and 90, we fail 92% of the time. But if we go with one year, two years, three years, four years, we succeed. We could succeed 85% of the time with it as that. Why does it take us that long? What can we do to speed that up?
Now I have credibility because I’ve obviously done a little bit of analysis on the data and I’ve run some different scenarios. And being able to give these kinds of insights, that increases your credibility and credibility in this field is everything.
Nucleus: The analysis that you can do with Nucleus through the processes you just demonstrated really can enable some conversation between people that can lead to positive changes in process. So that’s where the conversation comes outside of Nucleus into the organization and streamlining of the collaboration and the process can take place.
Dave: There is no tool that can replace that human component and having those conversations and collaborating. Everyone wants to fix vulnerabilities faster, but if your organization doesn’t have the resources to fix critical vulnerabilities in 14 days, you need to find out why that is and then adjust accordingly. Because what you don’t want to get into is a situation where your remediation teams disengage. If 90 days is the best that they can do, taking a hard line on 14 days is just going to result in disengagement. And vulnerability management and patch management are, I’ll argue, the two most difficult problems in security and IT respectively. And when you’re dealing with complex problems, slow and steady progress beats disengagement every time.