Follow your Costs

Nov 3

Most people think of costs as an area of profit loss. They definitely are that, but I tend to think of them as an indicator of weakness. As an example, if I have a Kubernetes cluster and it costs $100,000/year to run, is that good or bad? The answer is: it depends. If I’m running a Bagilion (Technical term) services and 2 Gazillion (Technical term) requests per second, then $100k seems pretty cheap. If I’m running 60 rps on 1 service, $100k means I might need to fire my infrastructure team. That’s what I mean by using it as an indicator of weakness. You can look at your costs to see areas where you could save money, but you probably need to look at costs to understand where you don’t have expertise. Those areas are likely causing you to lose out on both potential savings AND revenue.

How much should my Stack Cost?

This is obviously dependent on your exact setup, but in general unless you are running an incredibly complex stack with 1000s of requests per second, you shouldn’t be spending over $1M/year. For 100s of requests per second, it isn’t quite 1:1 to say you should be spending less than $100k/year, but you should be thinking along those lines. You might be looking at your AWS bill and thinking: “Well shit, we are way over that” The good news (bad news?) is that this happens all the time. The most common areas with massive overspend are logs and metrics. I’ve yet to see a company not crazily overspend on those initially. The next most common are storage solutions like ECR (Docker hub for AWS) and S3 files. Those usually aren’t crazy expensive, but they do start to add up and take 10 minutes to trim down. Getting a hold of those common areas is almost always a free way to save 10-20% (or more) on costs with very limited work. After that, you’ll have to take a look at your services and why you run so many, and why in the world your database is a 16xlarge when you handle 300 RPS at peak. Once you start diving in to these changes, you will probably start to ask yourself: “Why in the world does my stack cost even close to this much? … And also you mentioned losing out on profit?”

How did we get here?

I will get to the profit part, but I want you to understand the what and why of costs first. Companies in the startup phase build stuff to get to production, and worry about cost later. After that the problem of success takes hold and new features outweigh managing costs. Then things really start to spiral as people think of costs as an area of profit loss, rather than noting those high costs mean they have areas they really need to fix up. Essentially every area you pay a high amount for is because you don’t have true focused expertise. When I say focused, I meaning you have a dev who does this every day, not when there is a crisis. The outcome of that is you don’t notice that your infra cost is high, because you don’t have the right people or the right amount of people to build it. Without these people, your company will continue to lack the understanding of what it should cost. Same for every service or app you have. Also note that not having the expertise to understand costs, likely means they didn’t have the expertise to build it right such that feature development will continue to be fast/easy going forward. This is where it hits at your revenue. I’ve done multiple re-platforms and major refactors to this point, and for every one of them, not having the right setup in place cost the company $Millions in potential revenue. If I (really the teams I was on) could have been tasked with developing new features on a solid foundation, rather than rebuilding the foundation that alone is worth a lot. Add to that all the teams that had to move over to the new foundation, and you can see how you could massively increase your revenue by building right from the start. Very successful companies spend $Millions to gain back the efficiency they have lost, but less successful companies never have that opportunity. All because the initial platform they built didn’t support their long term growth.

How do I fix this?

In general, everything in tech is very simple. I don’t mean that to say the job is simple, just that putting buttons on a screen and data in a database is a solved problem. The unsolved problem is just your specific product. The answer to how you fix this is the act of converting your code to an ELI5 setup. I bet you can already think of a high usage area of your code that only 1 person really understands. That is where you start. Not only do you need to get away from the single subject matter expert, you will probably notice that is also where the costs are high. And the funny thing is they are high in both money you give AWS, and the cost of adding a feature/fixing a bug. Those two are almost always linked, as complicated code isn’t conducive to being cheap in any area. It’s also very very rarely needed. Unless that millisecond you are saving from that crazy assembly looking thing you did is worth a ton of money, don’t do that. Cut the abstractions, have your SME make a diagram with a workflow of how things work, and then spend some time fixing that up. The act of moving your codebase to an ELI5 setup is that you not only allow more developers to work on it, you also make it easier for AI to be right when it generates code for you.

When you are moving towards your ELI5 setup, you are going to ask questions like:

Why are we using elasticache as a queue rather than something like SQS? (cost is in the 10x range for making this mistake)
Why are we storing logs in the DB? (cost is heavy here as the query times are terrible, and with enough traffic your DB scale will be several times larger than you should have)
Why are we logging every client request in datadog? (You probably have alerts tied to this, and need to swap them to metrics to save about 10x here)
I’m paying for both cloudwatch logs and datadog logs? And it costs how much? (You should be using fluentbit to go direct to datadog, also stop logging success cases!!!!)
Why do I only have 1 person who knows how our 5 most important things work? (You should be adding expertise to more people, and you might also start asking why the person implementing isn’t making sure people know how these things work in the first place)
Why am I running 100 instances for 300 rps? (You probably have a misconfigured server setup, or an incredibly bloated app. Fix this now or it is actually going to get worse).
Fivetran costs how much?!?! (Invest in Kafka Connect, and upgrade your Data Engineering org so you stop paying 100x (you read that right) what you should be for data pipelines.
Snowflake costs how much?!?! (But for real, if these costs are crazy high you need to upgrade your Data Engineering org)
My RDS cluster looks asleep, but if I scale it down my app performance is garbage. (That isn’t a question, but the answer is that you have some really bad tables and queries you need to fix)

These are no where close to all the questions (and non questions) you might have, but you also might notice the answer to your question is questions. From my experience a 5-year-old has more questions than you have answers, which is why it is best to use their mentality as a North Star. Just start asking questions around stuff that is high cost (in both money and development time) and find your weaknesses. Some of them are just cruft collection you can clean up in an afternoon. Some of them are architectural issues that you better start addressing quickly. Either way, the fix for your high costs is asking questions. After that, you might also want to work with Project vNext to get a little extra staffing. We can help lower your costs in whatever area you deem the most concerning.

Derek Rushing

Follow your Costs

How much should my Stack Cost?

How did we get here?

How do I fix this?

Engineering Performance Should Be Visible

Are your unit tests becoming a Liability?

Project vNext