Intro

Does your company have technical debt spanning multiple teams. Has been lingering for many quarters? Is it still on nobody’s roadmap despite many rounds of lobbying? Do you consider it a low hanging fruit within your reach? Then consider yourself lucky, Bobby and I have a perfect solution. But before we get into the good stuff, a business announcement from our favorite bureaucrat…

The lead architect, the builder of a cathedral of bureaucracies, introduces his new KPI counting non-compliant projects. This will speed up the great migration. He proclaimed proudly. Then he left the problem for each individual team. Better yet, it’s the new hottest topic at our weekly meeting. Now the bikeshedders can go in all the little details. Certified to generate maximum visibility while achieving zero outcomes.

Bobby had already run out of the announcement meeting. To start a revolution against the company culture and its way of handling technical debt. Instead, I took a pragmatic stance and asked the team lead for two weeks of uninterrupted work,. No meetings, no standup. just me and the problem. Frame it as ’training time’ to get buy-in if you must. Worst case we lose two weeks, best case we solve it.

I’ve come to call this approach a strike team. A small, highly focused group tackling the problems nobody else wants. Since it’s unofficial it has to stay under the bureaucratic radar avoiding its spiderweb of red tape. It moves fast by cutting waste: No handovers, cross-team meetings, or official planning. Instead we keep a logbook of obstacles and decisions for traceability.

They the people on the strike team know the status of the problem and its challenges best. They should have the freedom to pick the solutions with the tradeoffs they see fit. Otherwise, they will end up with minute discussions. Where authority will force them into pursuing solutions they already know won’t work. Wasting precious time and momentum.

Over the years, I’ve been on many such strike ops and I like to share a few of them.

My stories on the strike team

Extra screens

Let us start with a light story: I often came in late and was often without an external monitor. What started as a simple quest, ended as a real lobbying job like the movie ikiru.

First, I asked IT support. He forwarded me to his manager. Even the manager, a person with a corporate credit card, was unable to greenlight a simple screen.

Bobby: Maybe we should just buy our own monitors? Right and next we pay for our own coffee?

Then, one lucky day, I bumped into a higher-up at the coffee machine. ‘How are things?’ he asked. Given my natural aversion to small talk and saying ‘Good and you?’, I saw it as an opportunity. I mentioned the lack of monitors and pointed him at Bobby. He was slouching over his laptop, clearly suffering the same fate.

A week later, Five new monitors arrived at my desk. Soon after people jokingly came to thank me, declaring ‘You fixed the monitors!’. I guess I wasn’t the only one suffering.

Code formatters

Now, for a story I tell with deep regret. I accidentally helped lay the cornerstone to the cathedral paving the way for my downfall. Back then there was no architect position yet.

When checked an accepted pull request, I spotted an obvious bug and rejected it. To my surprise, the only comments were trivial: ‘Add a space between the comma and parameter in function(a,b)’. Did nobody try to find the bug?

This looked like a simple thing to automate. The idea was to free people’s time to review the code more deeply. I picked Scalafmt and created a custom config with a few small tweaks.

Bobby: we should go through the official channels. That way we all have the same formatting.

Great idea, Bobby! Lets book a meeting. Several man-days worth of meetings. We still had no decisions on the exact format. I should have been wiser and use the default config. Eventually, with the help of the person. The one destined to become the cathedral builder. This person scrutinized over every single option in pursuit of the ‘perfect’ config. After convincing him by showing a few code examples side-by-side, they finally conceded, and we had our standard. We then posted the config in a channel and told everyone to use it.

In my view, this is where the later architect got his credits to start building his cathedral. He often joined such initiatives and made them a lot harder and then claimed victory.

But a bureaucracy that doesn’t enforce its rules rigorously with the heavy hand of infrastructure is one without teeth. We went through all the hassle to make a decision on a trivial thing like the config. But we didn’t roll it out on all projects. Unknowingly, I had become a bikeshedder, feeding the bureaucracy. On the other hand, people stopped to nitpick, even though nothing changed.

A few years later, It has become an industry-wide trend to avoid such silly discussions:

Gofmt’s style is no one’s favorite, yet gofmt is everyone’s favorite. Rob Pike (from go-proverbs.github.io)

Today formatters have minimal config like Black. Many tools,like Grafana Alloy, now ship with an fmt subcommand.

GRPC clients migration

Before we start, let me explain what binary incompatibility is. Java distributes libraries as precompiled JARs. When you call a method from a library. The function will be searched for and loaded in at runtime. As a consequence when you change a method signature or a class name the JVM will be unable to find the right method. When that happens the JVM will throw a MethodNotFoundException.

This example shows how a simple change can result in binary incompatibility:

1
2
3
4
5
6

def add(x: Int, y: Int): Int = x + y

// This is source-compatible but not binary-incompatible.

def add(x: Int, y: Int, z: Int = 0 ): Int = x + y + z

Now when we compile our source directly against the library we will get a compilation error. When the change happens in a transitive dependency it results in a runtime error. In the transitive case, it is almost impossible to patch the code on the client side. What you could try is to overwrite the class on the classpath. But that is brittle and a hacky solution at best.

So when were we challenged by a Binary incompatibility? Each project publishes its own client GRPC client Jar. A client was published using a newer GRPC runtime and it was incompatible with the old one.

To solve this problem we downgraded the new client. So now we have another migration on our hands. We either upgrade all the clients to a new runtime at once, or we keep the version frozen indefinitely.

Imagine missing a deadline for such a triviality? To me, it seemed like a problem waiting to be fixed. Therefore I searched around for tools to help me. However back then there weren’t any tools to solve a similar problem. So we had to build our own. Today there is buf.

My first attempt was to write a Gradle script that published a ‘GRPC’ package containing only the Protobuf definitions. This allowed us to build the clients on the consumer side with the same GRPC version. On top of that, each project could pick the code generator they wanted to use. This allowed for a gradual migration to the new runtime.

But the tradeoff didn’t feel right. All the complexity was pushed to the consumer. Worse it became impossible to build a higher-level library on top of existing GRPC clients.

I made a second attempt, I downloaded all the ‘GRPC’ packages and added them to a single repo. This single repo would build all clients of the company.

I took a pragmatic decision and published all clients with the same version. The versioning I used was a date version: YYYY.MM.DD.seconds. The ‘right way’ would be to have a semantic version of Major.Minor.Patch for each client. This would allow us to mark if the client was backward compatible. But It had multiple issues: How do we mark which clients need to be released? How is the version bump decided? Which versions of the clients had a compatible runtime version? The date-based version was elegant, all clients with the same version are compatible. The other questions where sidestepped.

Since the company was using Scala I took the opportunity to cross-building them.

In retrospect, if we had been using another language like C, Rust, or Go. We would have published a single package and let the compiler strip away the unused code. These languages would have catch the incompatibility at compile time not runtime. Because they build the whole program from source.

Sometimes the ‘right way’ is standing in the way of ‘good enough’. The theoretical ‘right’ way of compiling clients on the consumer side would have complicated all builds. Especially when considering we had Python projects. They which would then be forced to publish their clients Protobuf as JARs. Instead by centralizing the issue we only had to tackle the problem once. At the cost of semantic versioning.

The Scala IO wars

We were in the midst of the microservice era and this involved a lot of http calls. Because threads in Java were expensive it was important to avoid blocking them. This was before virtual threads existed.

To achieve this and to keep the costs in check threads had to be shared between http calls. What followed was a short time known as callback hell. Each call would accept a method to continue the computation after the call is finished.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12

function getAdminUsers( handle function (users: Seq[str])) {
  http.get("http://server/users, function (err, users) {
    for user in users {
      http.get("http://server/access-level/${user}", function (err, access) {
      if (access == ADMIN) {
        ...
      }
    }
  }
 })
}

Scala was the cool kid in town and solved this problem in a better way, it added a concept called a Future. A value that will exist eventually in the future. This allowed for a much cleaner mapping of the above code.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11

function getAdminUsers() Future[Seq[str]] {

 return http.get("http://server/users).flatMap( users =>
 Future.Sequence(user.map(http.get("http://server/access-level/${user}")))
   .flatMap(
     if (access == ADMIN) {
      ...
    }
 )
}

The Future system was eager making it tricky to implement right. My example above would do all the requests in parallel and thus hammer the server. So like mushrooms dozens of IO libraries came into existence: Cats IO, ZIO, Monix, Akka, and ScalaRx all trying to replace Plain Futures and callback hell. This was the start of the IO wars. Each new solution was incompatible with the other. They all had different tradeoffs and ways you express the code.

They all came from different periods. Akka inspired by Erlang used actors. This predated kubernetes. ZIO and cats IO where pure monadic handlers.

Bobby: Oh no not the monadic handlers. If he starts about it we will not be done by tommorow. Balkanization was a common pattern in the scala ecosystem, there were many json libraries: play-json, spray-json, argonaut, json4s, unpickle, zio-json, jackson-module-scala… And don’t get me started on the amount of HTTP libraries, akka-http, http4s, zio-http, and all of the already existing java http servers.

Scala was such a big language. You could lose yourself in learning all the different libraries. The brave of us would venture deep and learn macros. The wicked even considered compiler plugins.

I suspect The lead architect and its darn bureaucracy loved this. Each team could run a separate experiment an pick a unique stack. After a few years we would have a victorious team with the best stack. All the other teams then had guaranteed job security to rewrite there code. Maybe it was an accidental, But to me it looked like pure evilness.

Since i didn’t sit on the sidelines. I took a side in the Scala wars and added the Monix backend for the GRPC clients. With its observables it fitted the model of GRPC streams. Bobby tried on the same page. Ironically the bureaucracy didn’t make any decision at all. Thereby accidentally choosing the only right one: ‘Doing nothing until the problem was resolved’.

Scala could have avoided its shrinking user base if it had taken a simple step. Standardize a solution to the IO problem. While at it throw in a standard logging library, test library, and HTTP library. Hell, it seemed easy to make a JSON library in Scala so why not add it to the standard?

Ironically they had XML support even with special syntax support so in the past their view was different.

Eventually, the industry caught up. Javascript came with the now widly popular async await model. Making the above code look like blocking code with some awaits sprinkled in:

1
2
3
4
5
6
7
8

var users = await http.get("http://server/users)

users.map(async (user) => {
  await http.get("http://server/access-level/${user}")
})
.filter((user, access) => access == admin)
.map((user, _) => user);

Golang had the concept of go routines, they are cheap, and thus blocking code is the default. This was later adopted by Java with its virtual threads.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14

users, _ := getUsers()

val admins = []Users{}

for _, user := range users{

  access, _ = getAccess(user)
  if access == Admin {
    admins = append(admins, user)
 }
}

return admins

Essentially structured concurrency model won.

When you as a company pick a very unopinionated language and don’t take a strong stance. You will import the whole ecosystem and try to find the ‘best’ solution. Essentially when the community doesn’t do the hard work of standardizing you will have to do it. Otherwise, you will end up with separate teams all doing their own thing. Making it very hard to switch teams without learning another framework. Instead, this time could have been spend to build a library for the things that actually mattered.

Wake up call, stuck Kubernetes migration

When I walked into the Danish office after my flight, I saw both of my colleagues in war mode. No greetings no nothing, they were intensely focused on their computers. Kubernetes failed to spin up new nodes in production. Our node images were pulling extra packages at startup. Obviously, it shouldn’t have done this. But it has been working for years.

That day however our Debian version was archived. So we found a solution, we updated the startup script to add the Debian archive repository. The fire was put and we could spin up new nodes. But we had to find a better solution.

How did we end up here? Debian takes a long time before it gets archived. We were migrating from Kubernetes 1.11 to 1.22. Every application had to be updated for it to support the new version.

This migration was high on the agenda. It was spread across multiple teams. The migration was always competing with the business goals. There were coordination meetings. But essentially the problem was handed over to the teams without much tooling or support.

My team lead and I agreed we had enough of this. Time to bring the big guns and get the migration on track. I went to investigate the open problems and I found this list:

1 Do we have to do a live migration?

2 How do we make sure the helm charts are valid for both 1.11 and 1.22?

3 Can we automate the migration?

I was chatting with the infra team of the Danish office. We quickly came to a great idea to break the status quo. What if we deployed all services to the new cluster, but with a disabled scale of 0? Then we can already validate all the configs of the applications and know if they work on the new cluster. This allowed us to have a burn-down list of things to migrate.

The solution paved the road to simplify other things. The migration became as simple as scaling down all the services on the old cluster and scaling them up on the other cluster.

1) Do we have to do a live migration?

Sadly the answer was yes and thus we couldn’t do the big reboot. Instead, we had to find a way to make both Kubernetes clusters talk to each other.

Could we stick the two clusters together? So the services of one cluster would be visible to the other? Yes but only from 1.16. Maybe we could rewrite the service DNS records to an ingress DNS in the Kubernetes CoreDNS? Nope, we had different service ports when communicating internally than the ingresses. DNS only resolves hosts, not ports. It could be done using SRV records but they were quite niche. The only option left was to use actual DNS entries and run all traffic over the ingress.

Our services were not so ‘Kubernetes’ friendly and used config files with config templating. To help with the migration I wrote a linter that found if a service was using the service name and not the DNS record. This gave us a second list of ’things to fix’.

Making this linter wasn’t as easy as it seems. We had to replicate what our deployment did. It had to stitch together multiple repos to get the templates filled in.

To help the developers I made the ‘devtool’ a tool with some generic commands to help you resolve the config and lint them.

Developers could install it using pipx. Then they could resolve their templates. To make sure there config works before deploying. This was something we were unable to do. If I would redo the devutil today I would write it in Golang. It would be easier to distribute as a binary than Python.

2) Making all helms cross-Kubernetes compatible

The whole config problem would have been a lot easier with a gitops setup. We could have written a transformer. Which reads in all the resources and scale them to 0. We could also inspect the config files and even update them in place. On branch main/1.11 we would have the old cluster. On branch main/1.22 we would generate the new files. But our old Kubernetes version didn’t support ArgoCD. So we had to stick with our current setup, Bamboo’s push-based deploy system.

Inspired by the gitops approach. We added a post-render hook to our helm charts to set the scale. Which we baked into the deployment scripts I maintained. more about hooks

Luckily I was already maintaining the deployment scripts. I had written some quick and dirty scripts to update the database state of bamboo deployment to update the deployment. My team lead covered my back and helped me create a backup.

3) Can we automate the migration?

Bobby was pushing for an official project to automate the migration. The debate could get hot. The cathedral builder insisted on a manual migration and was already mapping the ‘startup’ order of the services. The teams were firing all engines to ‘click ops’ their way out of this mess. Bobby saw this as an excessive risk people will make mistakes. It would also be hard to reproduce.

After a closer investigation. There seemed to be multiple types of services each with their own migration strategy. The read-only services could run on both clusters. The write instances had to run on one cluster or the other. Then there was the category ‘others’ these were services using a VPN connection to be exposed to external systems. We decided to manually migrate them since they needed coordination with external parties. They are perfectly handled by the bureaucracy.

Finally, I wrote the migration CLI. It did the migration in several stages. First, it deployed exactly the same version of the services between the two clusters. Then it scaled up the read-only apps on the new cluster. Afterwards, we would start the real migration. It would shut down the write services and switch over the DNS entries. At this moment everything was running in ‘dry run’ we could already investigate the behavior of the tool.

To simplify the CLI tool the services were hardcoded and whenever I saw some ‘problems’ I could add a special case specifically for that service. I.e. some services didn’t resolve the Host after migrating and had to be restarted.

Test day on the development cluster. I got half a day to try out the CLI tool live. Most of the applications worked and the migration was quick. I noted down all services that didn’t want to boot up and grabbed their logs for later investigation. Now we had our last burn-down list.

I was already on a dedicated strike team and I could temporarily join another team for a few sprints to fix their deployments and services. I did the grunt work and they only had to do the validation that it still worked.

To make sure that the CLI tools and applications worked I did a few more migrations after office hours. The migration was reversible I could get statistics on how long the migration took. At some point, the CLI tool and applications worked as expected. Our work was done, and the strike team could disband. The technical problem was fixed.

Sadly we couldn’t risk having some downtime. Businesses found it too risky. So the migration ended up in another limbo for more than half a year. If only somebody had the balls to push the big red button and finish it. The strike team concept worked it made migration technically possible. But we were powerless against the human problems. I resisted the urge to do the migration in the middle of the night and handle the fire the day after. My team lead would have never approved such a surprise attack.

Why strike teams work

In large systems, some problems are everyone’s concern but no one’s responsibility. These issues typically get stuck in limbo. A strike team works because it seizes that neglected ownership. They become even more effective when they maintain a a shadow roadmap of problems, building solutions not just for today, but also for tomorrow.

Often, these projects labeled as risky, difficult, or time-consuming. People are accustomed to fix them manually. But with the right solutions, they can be automated, creating new standards simplifying subsequent migrations.

Short and fast feedback loops are possible because a striketeam has full authority over the solution. Allowing them to setup custom tooling. Since the cost is essentially amortized over the entire project. It becomes like any other coding problem but with a tight scope.

The Fatal Flaw

Strike teams have a fatal flaw: they make ‘hard’ problems look easy. Since it was fixed under the radar by a single person or a small team it didn’t get a lot of attention. They preemptively fixed problems that were not even on the roadmap yet. So there was no recognition and thus no promotion. You should only go down this route if you want to learn or out of pure frustration. Not to build a career.

They don’t solve human problems, and some cultural forces dislike this style of project. They prefer to do things the ‘right way’, while strike teams just want things ‘fixed’.

On the ‘right way’ spectrum are languages like Haskel, Scala, Rust, and ML. On the other end of the spectrum there are the simple and hacky ones like C, Python, and Golang. The former are powerful tools that try to maximize your freedom of expression. As a developer mastering that complicated tool is rewarding in itself. While the latter makes the choices for you and provides standardizing solutions. In exchange you get a system that allows you to focus on solving real the problem.

Playing chess with my future self

When you are on the strike team, don’t think it’s a one-off mission. You will end up on a few more. Try to avoid short-sighted decisions, they will only lead you to new technical debt. The choices you make today will impact the freedom you have tomorrow. It is essential to keep a roadmap of other related issues.

The tooling you build during the mission need to be polished afterwards. So they could be reused by other collegues. I.e. when making a new deployment system. Make sure there is a CLI to dry run the deployment. Or even to help you press the buttons from a script. Eventually, you’ll end up maintaining a ‘devtool’ which makes a many more efficient.

However, if you create a linter make sure people can turn off rules even on a file basis. Inevitably there will be false positives. So then it will stand in there way.

In other cases, escape hatches should be added so some features can be disabled. But make them explicit. You can then find the missing features and bugs in your tools.

Make sure you have plenty of time to maintain the tools. So you can quickly update the tool for the new needs. This will make your users happy. Otherwise they will stop using it.

The aftermath

The bureaucracy had an adverse effect. Consensus remained elusive, as teams went their own way, coordinating in the shadows to adopt new tooling, only to fragment the system further. Alliances were fleeting, shifting with immediate needs. They had their own “strike teams” too, but crucially, they didn’t make the big push to standardize.

Essentially, the company had absorbed all the cultural problems of Scala itself. This left us with flamewars over whether to aim for “better Java,” embrace Actors, or go all-in on functional programming. Agreement was impossible, and nobody with authority made the hard decisions. Strike teams weren’t effective at resolving these political problems; all they did was prolong the inevitable before someone decided to abandon Scala altogether.

After many projects as a strike team, my philosophy shifted from a ‘maximalist’ (‘do the right thing’) approach to a pragmatic, trade-off mindset. I even moved from Scala to Python and eventually Golang. I realized I couldn’t go back to a “normal” Scala team after tasting the high-agency, high-leverage world of strike teams.

Eventually, the bureaucracy truly caught up with us. The cathedral builder decided he wanted to be part of the new infrastructure initiative. With him at the helm, I knew the informal, high-trust dynamics that made our work possible wouldn’t survive. We’d be stuck endlessly debating solutions we never even asked for.

Bobby: My revolution failed.

Yes, Bobby. You can’t change a company with a grassroots movement when leadership is invested in maintaining the status quo. Our opinionated, integrated tooling was never truly supported or accepted by them. It was seen as pragmatic and useful, but not something they genuinely wanted to invest in.

The people who quietly told us they liked what we did, those who saw the same problems, never stepped forward to tip the balance. They stayed in check, kept their heads down. It wasn’t their fight.

And that’s how these things die. Not in some dramatic blow-up, but in silence. In the end, it feels lonely; you get tired of pouring your energy into it. At that point, it was time to move on.

Thank you

This entire journey, the triumphs of Boris, and the heartbreaking ‘death’ of Bobby became my most valuable, albeit painful, education. It shifted my entire perspective on engineering, organizations, and the nature of real change.

If you came this far. Thank you it was a wild ride. This post was mainly for myself. It gave me valuable insight into my career and I hope you learn something from it as well. My future posts should be less personal and more technical.

Appendix: stories that were cut:

Bureaucracy at its best: Make higher management fill in forms for a new tool like the rest of us A higher manager, doing the forms earned him more money than the license cost.

Strike team calling for duty

How small, unofficial strike teams quietly solved the problems nobody else would touch.