The terror of a "ZERO CVE" metric and how the bureaucrats lost

Using CVE counts as a primary security metric is counterproductive, distracting teams from addressing genuine security risks and creating a culture of security theater.

Intro: Fixing those CVE’s

The manager walks in the room, face red. Behind him, the head of software security fumbling with his laptop like he has seen a ghost.

Manager: Okay, everyone – listen up. As of today, we’ve started scanning our codebases and docker images for CVEs… and we already found over a thousand."

Bobby starts muttering “We’re doomed our security is like swiss cheese now” before bolting from the room. I can already see where this is going. A new bureaucracy has arrived: from now on, every demo meeting will open with a report on our CVE count. Now this new number has to go down to zero and stay there. If I were the CEO I would even spice this number game up and have an CVE fixer of the month or better a wall of shame where the team with the most CVE’s gets a place on the wall!

So what is a CVE, you may ask? Well, thanks for asking. It stands for ‘common security vulnerability and exposure’ and it is essentially a public record of security problems. Each CVE is identified by a number and describes the conditions under which the vulnerability can be exploited and how they are best mitigated. They are categorized by a severity number to show how dangerous they are.

For example:

So here begins the mandate: “Zero CVEs” which seems reasonable at first. The obvious response is to start upgrading libraries.

So we start upgrading the services and libraries

We start setting up Dependabot, which floods us with pull requests and before long, we’re chewing through CI minutes at an alarming rate. In our Go projects the code compiles with most upgrades and when a pr is created with breaking changes it blows up at build time. Allowing us to patch the issues before merging. After merging the pr our automated releases kick in.

The teams with dynamic and JVM languages are less fortunate, the python teams caught some of the issues at test time and mypy caught a few more. For the Java team the code compiles and the tests are green so everything have must be working. But then when they deploy they find some issues with transitive dependencies on the Guava library. A typical JVM issue that the build doesn’t catch binary incompatibilities.

Most of our apps now have slight behavioral changes which we didn’t expect and didn’t catch. Sadly no one tests external libraries to track the behavioral change so let’s hope this doesn’t cause more issues down the line.

The bright side is that automating dependency updates forces you to improve CI/CD infrastructure. The downside? If you don’t, you’ll descend straight into update hell. Soon enough, you’ll be drowning in artifacts, 1.0.4 to 1.0.71, all reflecting minor dependency bumps with no functional changes except version 1.0.40 because a new feature was slipped in. So much for clean semantic versioning.

Bobby, look busy. If the bikeshedders see this, we’re in for weeks of debate about versioning. Who are the bikeshedders? The people who obsess over naming integration tests vs platform tests and holds six months of meetings to classify test types without writing a single one. Different from the bureaucrats, those are the ones creating new policies and start chasing metrics like the CVE counts without measuring whether their efforts actually improve security.

Bobby: Shouldn’t we be building something for the customers? Maybe fix those missing backups so we can sleep again?

Nice idea, Bobby, but no. CVE policy says our old database must go. And worse, all services using the database need updating as well to avoid potential incompatibilities. The dream would be for each team to take care of their own app to spread the load.

Bobby: Maybe we could ask them?

Begging, you mean? Unless a manager mandates it, it won’t happen.

Bobby: How about we invent a metric to track this?

No Bobby. That’s how bureaucracies spread. This would end up with more meetings with all the teams looking at yet another metrics. Instead, We form a dedicated strike team. We go repo by repo, patch the queries, bump the dependencies and blast out automated pull requests labeled ‘Repo clear of SQL problems’ Everyone knows these PRs since they are mostly mechanical, so they get merged fast without questions. As a specialized team we can create our own CLI’s and add some linting to the SQL queries the cost is amortized anyway over all the projects. There is no need to train others so thats a lot of time saved, allowing us to be lean and mean! While freeing the other teams from the heavy Database CVE burden.

Time to start cloning those repos upgrading the database dependency and seeing that the tests are green.

Bobby: I sure hope they have some dockerized tests hitting a test database. Only one way to find out. Deploy to the test cluster and see what catches fire.

Update hell, regressions and CVE whack-a-mole

A month later, I’ve grown a long beard. Bobby has his first gray hairs. Miraculously, no one had custom plugins. Tech debt forced us to upgrade on someone else’s schedule, but we made it.

And then comes demo day. The CVE count is down. And up. Because our shiny new database already has new CVEs. Welcome to the treadmill. The bureaucrats won this round.

Next week, we upgrade Loki. Everything looks fine, metrics are green, services start up healthy and no alerts. We are about to cut a release. Then someone notices… no logs. Stop the train! Delay the release. Hopefully we will add a test to query the logs in grafana.

Another week, another upgrade. This time it is Prometheus, We have to upgrade both the internal one and the ones for each tenants. It all looks good so we decided to ship, since we still have more things to update.

But no one checked whether the tenant Prometheus instances still retained their data. The upgrade deleted the Persistent Volumes there are no alerts tracking the disk usage of tenant services. Since the bikeshedders never agreed whether to add a test or an alert for it.

Bobby: If only we had those backups. Well we have to pick our battles, hopefully the manager can smooth talk these customers.

Manager: Lucky us, the SLA stated it was only meant for operational metrics, not for financial data. Too bad one of the tenants used it for settlements. But hey, at least we fixed a CVE.

If we had time for proper system health checks, we might’ve caught it. But we’re too busy chasing upgrades to work on useful safety nets. And if you don’t keep your estimates low enough to cram into the sprint, it gets rejected. We only get the database upgrade done by slipping it through with a manager’s blessing

An Alternative Security Strategy: Triage the CVE

This is where I snap. Wasting a day every week fixing builds for pointless dependency bumps in the name of chasing an arbitrary metric is a farce.

Bobby: What if we made a dashboard? Color-coded, pie charts, CVE counts per team and severity. This gives us real insight into the problem.

And that’s how you become the thing you swore to fight, Bobby. Bureaucracies live for invented KPIs. Slip it into the next governance deck and watch them debate it for weeks, then write a new policy about dashboard color palettes. Bonus points if we hand it to the bikeshedders and they’ll spend six months arguing what shade of orange means ‘critical’.

Bobby: Or… maybe we could read the CVEs? You genius finally something useful.

First up, coming from our docker image scan: CVE-2025-37918. Kernel Bluetooth vulnerability.

Bobby: But… we don’t use Bluetooth on our kubernetes nodes. Why is this even in our docker images, lets switch over to a better base images like distroless. Then we also catch the issue related to JQ, curl and the other CLIs. Since if we would just upgrade then a new CVE would pop up.

Next: CVE-2024-45338. Shown up in prometheus. HTML parsing slowdown under specific condition. Turns out it’s an irrelevant code path. Maintainers already dismissed it.

Remember Log4Shell? The real issue wasn’t the CVE, it was unrestricted outbound network access from the JVMs. Allowing people to pull in random code and mine bitcoins.

Bobby: So if the JVMs couldn’t reach the internet, it wouldn’t have mattered? Exactly.

We checked a few CVE vulnerabilities in ten minutes. It’s amazing how much we learn from them, we even started to think if we have similar problems in our own code. Each CVE is an chance to uncover an entire class of vulnerabilities and strengthen system. If we were just running in the update treadmill like a hamster we wouldn’t have improved our security posture in a proper way.

Defense in Depth Makes CVE’s irrelevant

Dependency upgrades are important, but on their own, they’re no substitute for a layered your defense strategy. Start by blocking internet access by locking it down using egress policies. Remove the package manager and make sure there is no root in the container images. Build containers without a shell like distroless and use immutable nodes based on BottleRocket. If nothing can reach the internet and disks are immutable, half your CVEs become irrelevant. We still scan images before deployment and reject only those with CVE’s genuinely marked as critical in our environment. The rest can safely wait.

Ironically we where not allowed to move to BottleRocket AMIs because the bureaucrats mandated the use of virus scanners on the kubernetes EC2 nodes. BottleRocket is intentionally immutable and you can’t install arbitrary packages including virus scanners, As a result we where forced to maintain a mutable non-CIS compliant AMI. The punchline? The virus scanner itself was outdated and already deprecated. It existed solely to feed another security metric.

And the beauty of building real layered defenses is that you win back time. You stop chasing CVE whack-a-mole, and instead show up to meetings not with arbitrary numbers but with actual risk assessments: Yes, these CVEs exists but they can’t be exploited from this part of the system.

You’ll still need to be very careful with internet-facing services, those are fragile and the weakest part in the system. At a previous company we used private DNS and stuffed everything inside a VPN. This reduced the amount of exposure to the outside world by a lot. If you can’t reach a service it becomes harder to exploit.

1
2
3
subfinder -d "restore.energy,restore.net"
www.restore.energy
www.restore.net

Nothing exposed. No surprises.

Conclusion

Chasing a zero CVE count through endless dependency upgrades is a productivity trap. You burn weeks on build churn, introduce new bugs, and spend engineering time achieving nothing measurable. We were forced to do major upgrades in a hasty fashion because of a CVE instead of focussing on the business needs firsts.

Instead, you could spend this day in reading the CVEs and trying to understand which ones impact you. Fix what matters and ignore what doesn’t. Each time you address a CVE take the opportunity to improve your infrastructure in a general way.

Security isn’t a CVE number or a simple metric. It’s whether your system is hard to exploit.

Further reading:

appendix

You can’t fight a bureaucracy with more bureaucracy. Quote: Boris Smidt

Licensed under CC BY-NC-SA 4.0
Built with Hugo
Theme Stack designed by Jimmy