Skip to content

Post-mortem of a platform engineering tentative

Posted on:October 28, 2023 at 10:00 PM

A few days ago I’ve stumbled upon (or it was blasted to my face, I don’t really know) DevOps is Bullshit by Cory O’Daniel.

The title is quite provocative, but that’s not the important part. What resonated with me was:

Slowly your team feels the malleable nature of DevOps stagnate into rigid infrastructure as teams stop making the hard operations decisions and just bend their software around what they have.

And well, that is totally the case at my current company. Our code totally has to be bent and distorted to fit on the infra. Oh, you got a new project, in 2023? Welp, you gotta use that Elixir 1.8.0 version from 2019 and that will run on Ubuntu 16.04. Nice, isn’it? /s

And if that was the only issue I would be happy. Deploying is becoming harder and harder, and our whole ecosystem is getting less stable every deployment. Which is pretty bad.

So, around February 2023, the company decided to start its plan to move from our platform (very scary, unstable and sedimented), to a new one, Kubernetes based. There were a few objectives:

I was assigned to the project, with another dev (but not full time on it), and worked with the DevOps team (of 3, and then 4 members).

And, we tried. We worked with some consultants, their contract (or something) ended, we kept working, we began to be late, then super late, then we changed the scope of the project (from Kubernetes to just make things better and with less restrictions, and then we would talk about Kubernetes again), and we kept being late on schedule, and now it’s November and everything is on hold because another super priority project came and direction decided to change stuff.

So, how could everything go so wrongly?

Platform Engineering gone wrong

Not enough overlapping experience

Another quote to begin with:

For every operations person without software development skills, there are FORTY engineers without cloud operations skills. If you are going to build an internal platform, you’ll need experts with overlapping experience in both fields working together.

And well, it’s true. I have absolutely no knowledge on AWS. I don’t even have a developer account. I mostly know the different products and their use because of every bit of information that I ingest, but I’ve never ever got my developer account on AWS. My skills on the DevOps world stop at the Docker, CI/CD and mostly the running stuff on kubernetes part, even though I had the pleasure to play with nginx, it was nothing too serious.

Basically, any time we would get some issue because something was wrongly configured, or maybe the application couldn’t get its permission to access an S3, that was outside of my expertise already. (And of course, it happened, lol, not with S3 but with DynamoDB, and that was the beginning of the end.) At least another dev in the project could work on it, but he wasn’t full time assigned to the project.

On the other hand, our DevOps team has absolutely 0 code knowledge. I mean, maybe the basics, but when the health check of the application fails because the code is unable to read some git commit’s hash and timestamp (don’t ask me why we need those and why it’s not in an environment variable, I know), well, they have to ask me.

And I already have a lot of work. So, I’m putting stuff on fire already, and I have to play the firefighter multiple times a day on different fronts. I became a bottleneck almost instantly, and I had to be present for almost any deployment and debugging part of the project, which takes a lot of time. Some weeks I would almost be 3 or 4 hours straight in a call to debug that thing or this app in the testing cluster, and then redo it on preprod with another application.

Maybe we were not enough devs on the project too, I don’t know. But then again, you want devs that have at least a bit of knowledge on DevOps stuff, otherwise then it’s the DevOps team that becomes the bottleneck.

Migrating slowly and migrating the wrong project

Migrate an auxiliary service to it quickly. Get feedback from your engineering customers. Yeah. ENGINEERING CUSTOMERS. They aren’t your team anymore. They are your business’s second set of customers, but if these customers aren’t buying it, you end up with morale problems, engineers pining for “the old way,” a boatload of debt, and a bunch of wasted time and effort.

Once again, we did it wrong. We decided to migrate a project unused by clients and devs alike. Before the migration, the latest commit was probably 1 or 2 years old.

And maybe that was a good decision first, to test things up. But then, well, no one was using your migrated app. You can’t get feedback on the CI/CD part, on the ‘how to configure stuff so you can deploy the app’ part. Literally 0 feedback can be given.

So, you’re 6 months in, and you haven’t even deployed on prod, because we thought/were told we would migrate everything first in testing, then preprod, then prod, just to be sure everything was OK.

And you got no feedback, you can’t deploy in prod, you can’t change any major project (at least on main) otherwise the changes won’t either be compatible with the old platform or either won’t end up in prod for another 6 months, so nobody uses the new platform, and voilà, the snake is eating its own tail.

So we were told/we decided to use branches, and work everything on our side. Which means, again, no feedback, ever (and a ton of work to merge everything because you didn’t think the branches would be used for 6 months).

Changes of scope

In theory, we should have been done in May. We’re not done, now, in almost-November. Around July/August, the direction decided that, maybe we should change the scope down. Maybe instead of migrating to the new platform we could prepare the transition. At least add the security measurement, maybe do some upgrades and security patches where we could.

So we had to settle for ‘good enough’. For something in between. So, even without talking about the migration to kubernetes, a lot of features/stuff we wanted to enable on the new platform can’t be added, because it’s deemed ‘not urgent’, and we should aim for a good enough state first.

Can we be sure that, after getting to ‘good enough’ step we will move on to the ‘yes, we are proud of us’ step? Considering the sad state of the platform we have today, I’m pretty sure we will settle for the ‘good enough’, and be in this hellish limbo for years to come. Probably.

The things we have done

To end that first blog post (probably) on a positive note, though!

Now we have one dockerfile per project. It means every project can run different Elixir or NodeJS (we got a few NodeJS apps yeah) versions. That’s a pretty big step.

We have a somewhat good enough pipeline. We use cache, we have parallel jobs for tests, for stuff that needs the app to be compiled, and for stuff that doesn’t. We measure test and doc coverage, cross-module compilation dependency things, we run a linter, we run security scans on the code and we get notified if a library has a known vulnerability. We also check the format (This one is a controversy but I’m really happy to have slipped it in 🙏🏻).

We also have more secrets kept secrets, like the SSH key not being present in the runtime image.

And, maybe more importantly, we are upgrading everything to Elixir 1.15.6 💪.

It has been a pretty wild ride, and I’m absolutely fed up with working on that project, but since we’re so close to the ‘good enough’ singularity point I do want to see the end coming. We may also try to change a bit our exchanges with the DevOps, just to make priorities clear, and confirm which projects they are working on, which project I am migrating, and so on.

But well, it will be for the next time :D

(Also I got a ticket to create for my AWS dev account!)