CaelNCSU All American 7187 Posts user info edit post |
Names changed to protect the dead and innocent.
How I learned to stop worrying and love Kubernetes
The bright afternoon sun shines into my open office. Its 2:05PM I notice as I stare blankly at my Laptop.
"It's gone," says Fred, one of the senior DevOps engineers.
"What's gone?" I ask.
"ALL OF IT," Fred replies.
I fire up Chrome and type `https://player.bigcorp.tv`. Nothing. A white background in small Monaco font reads: "Server Error". I press F12 and check developer tools. The status code column reads a singular 502. Our front-end is served from one service and the backend API, for which I am responsible, is served from another service. Using a terminal I check the backend with a curl command to see if I can hit publicly accessible data. Another 502. I quickly navigate to `https://internal.bigcorp.tv/status`, and see "Server Error" again. I type `https://internal.bigcorp.tv/status` into the location bar. Our internal status is also throwing the now familiar "Server Error." "Where is Logan?" I ask. "He left for lunch," replies Fred.
We login to the root AWS account and start checking our super resilient, uncrashable Mesos infrastructure that was costing up to $100,000 a month. Mesos has worker nodes that require connection to the control plane nodes to operate. Fred explains that the control plane nodes, running on virtual machines, in AWS had been corrupted and rebuilt that morning. The reason our infrastructure was uncrashable was our lead DevOps engineer, Ryan, had extra control plane nodes running in Mesos itself. This self-hosted model allowed us to always spin up Mesos masters in such an event. The DevOps Manager, Logan, was away at lunch.
The control plane running in Mesos had split brained and was unable to take over the leader role when the AWS nodes were rebuilt. Fred had thought restarting them would quickly synchronize all the control plane nodes, and if anything was wrong we had backups of the control plane data. When this happened, unbeknownst to Fred, they lost the only good copy of the master data. Docker containers, if not mounted to a machine on their host, have ephemeral disk space. "The last backup was 4 months ago," Fred informs me. "Does Will know?" I ask. Will is my boss and the Director of our Digital TV Product.
"IF THIS IS NOT UP IMMEDIATELY THEY ARE GOING TO FIRE ME AND ALL OF YOU," yells Will. As usual the concern of Will not earning income while his high profile attorney wife cucks him is first of his concerns. "WHERE THE FUCK ARE LOGAN AND BRADEN?" Braden is the mastermind of this Mesos Architecture, likely he could fix it, however, Braden is spending his vacation by visiting Burning Man for the first time. His boss Logan is MIA at lunch. Will gives Logan a call and informs him the entire platform is down for customers and every internal media management employee at the company.
2:45 PM.
Logan enters the office and scowls at me. Surprisingly, he walks straight to where the action is. I half-expected him to retreat to the DevOps room and tinker with his modular synthesizer letting everyone else clean up the mess. I've had multiple public fights with him about the cost of running this Mesos Frankenstein. My budget partially pays for the $100,000 a month infrastructure. "The point of a container orchestration platform is to scale down so we can spend LESS money," I would scream after noting the bill had gone up more than 10x a month after migrating to his Mesos infrastructure. Now, despite how awesome the Mesos platform is, and how all my teams would love it, it is down.
By 4:00 PM afternoon bar patrons, I mean alcoholics, have been without bar TV almost 3 hours. The only thing Logan has accomplished is pacing nervously around the office and occasionally breathing stale, hungover breath into Fred's ear. The head of North America is now hovering, with Will my boss. Unfortunately the mobile and TV apps were crashing every time you tap or click the TV icon. Our platform provides digital TV to a European TV product. Europe is asleep by now, but the customers, and our European execs, will definitely start calling by morning. If you have never been yelled at by a German exec, its just as terrifying as old WW2 clips have you believe. Everyone looks increasingly nervous. "We can just run the services on bare AWS without Mesos," I suggest.
Logan had finally made contact with Braden via text. He stated the Mesostein was in jeopardy and if they didn't fix it, he would certainly lose his job. Even though Braden had not slept the night before and may or may not have had pharmaceutical assistance with not sleeping he decided to hurriedly leave Burning Man and drive the 8 hours back to the office, targeting arrival before 11PM.
The backend is a simple Golang app which is easy to run with a single command. I demo to Will and the NA lead how I can route the DNS directly to the box via auto scaling groups, which gets us scaling out of the box. The database for the backend was not impacted and running in RDS, so this works, and we see TV show titles and M3U8 playlist URLs in a JSON blob. At 4:15 PM we have a strategy.
It didn't take long to get the front-end application running again, which was a simple node.js app. Hitting the staging URL the site was back-up, but not the internal management tools of it. At 5:15PM we shout "HOORAY!" The site is back up. Unfortunately, the service which serves the M3U8 playlists responsible for playing the video is a Java web-service, whose lead has just left the company. In parallel the video playout team had been trying to get the service running, but they are not familiar with the dark arts of Linux CLI and the AWS console. Fred and the DevOps team is still trying to jolt Frankenmesos back to life so they are of little help.
A new engineer on the video team frustrated with the verbosity and complexity of the service had built a skunk works project which generated M3U8 playlists for video. It was missing the advertising stitching capability, but it would play video if you pointed our video player at it without ads. We demoed this to Will. "We can just change the M3U8 url in the database for the non-working video service to this service," we say. "But its completely untested," Will says. "Yeah but no video works, what do we have to lose?", I reply. "Fuck it," he says.
The next 4 hours we spent spinning up new services by hand on AWS running the video service with nohup after SSHing into them on public IP addresses. Around 7PM I receive a phone call from Braden. He tells me to check my text messages. I look and see a message he's sent with a picture. The picture contains a small BMW hatchback connected to a small UHaul. Both are totally destroyed, the contents of the UHaul are strewn about the highway complete with a bunny onesie and California mountains in the background.
Midnight.
We were now watching video on our player, streaming from a completely untested service. The database was updated with the new service URL just as our European customers started waking up to watch. The video team was now free to go home, and the DevOps team was directed to help them get the old video service running in the morning.
At 7 AM I arrive back in the office with a $5 pour over coffee in hand. The video team beat me there. They were still trying to save face and get their application working after being bested by a junior engineer and weekend project. Despite the panic of the previous day, they were close to getting back up and running.
I take the team and Fred out to a food truck lunch for take out. We pass Logan as we approach the second office building. "I'm out," he says. "Leaving already after yesterday?" I ask, but shit like this isn't surprising for him. "They fired me," he replies. "I guess we will always have Mesos," I say. "Hey, come over into my office," Will catches me offguard holding my Butter chicken, which is burning my hand, but soon to burn my asshole. "How would you like to run the DevOps team as well?" he says. "Only if we can delete that hell spawn of an infrastructure," I reply.
Our next months bill came to $7000--a savings of $93,000 a month.
[Edited on January 6, 2025 at 9:50 PM. Reason : a] 1/6/2025 9:41:39 PM |
qntmfred retired 40885 Posts user info edit post |
Idk, somebody probably should have been fired for insisting on a 100k a month system that crashes. This stuff is really common, especially setting up Mesos (or Nomad). 1/7/2025 7:34:36 AM |
CaelNCSU All American 7187 Posts user info edit post |
I advocated for it for months but didn't have the political capital to make it happen until it blew up. 1/7/2025 8:19:56 AM |
StTexan THINK POSITIVE! 7694 Posts user info edit post |
I am positive it will make more sense when I fully read it.
Quote : | "The story recounts a chaotic day at a tech company when their entire platform, built on a costly and complex Mesos infrastructure, crashes. The outage disrupts both customer services and internal tools, sparking panic and desperation.
Key events: • The Problem: A corrupted control plane caused a cascade of failures, exacerbated by outdated backups and Mesos’ overcomplicated setup. • Attempts to Fix: The DevOps team struggles to revive the platform while key personnel, including the system architect (at Burning Man) and the DevOps manager (at lunch), are unavailable. • Creative Solutions: The protagonist proposes running services directly on AWS. This works for some systems, but a critical video service remains down. A junior engineer’s untested project is used as a stopgap to restore video streaming without ads. • The Fallout: Despite some successes, the DevOps manager is fired, and the protagonist is offered the chance to lead the team. They accept on the condition that the bloated infrastructure is scrapped.
Outcome: The team rebuilds a simpler, more cost-effective setup on AWS, reducing monthly costs from $100,000 to $7,000." |
[Edited on January 7, 2025 at 11:37 PM. Reason : Yep]1/7/2025 11:30:26 PM |
smoothcrim Universal Magnetic! 18974 Posts user info edit post |
Quote : | " "We can just run the services on bare AWS without Mesos," I suggest. " |
I have fought this fight multiple times. I have no idea why these people insist on the leaning tower of abstractions. k8s and mesos are pretty trash and offer zero value in public cloud, especially if you're only using 11/9/2025 3:01:58 PM |
moron All American 34335 Posts user info edit post |
^ Because knowing those technologies is how you get a job in big tech, so you have to trick a small tech company into using it so you can put in on your resume 1/11/2025 11:53:28 PM |
CaelNCSU All American 7187 Posts user info edit post |
^ very true that's what people think and why people want to do it. In addition, praying for silver bullets to make their job easier.
In reality to get the job it's grinding leetcode for 6 months and getting a referral. Alternatively, being the 1 or 2 guys in the 100 person data structures class that got an A+ and destroyed everyone's curve also works. 1/12/2025 7:55:44 PM |
emnsk All American 2894 Posts user info edit post |
What exactly is this 1/15/2025 7:39:14 PM |
FroshKiller All American 51920 Posts user info edit post |
I posted this somewhere several months ago, but you might find it interesting. Kind of a low-stakes bug, not exactly a firefight or war story.
---
One of the applications I work on has an API endpoint for updating customer information. For whatever reason, updates to the name of the customer's company are passed as a header in the HTTP request named CompanyName.
Support escalated a case where a customer could not successfully sync their information to this endpoint. The API sits behind Cloudflare's Web Application Firewall. The WAF was rejecting the request with the reason "Invalid UTF-8 encoding."
Let's say for the sake of example that the name of the company is Télébec LP. We captured a request and saw these bytes for the value of the CompanyName header:
0x54 0xE9 0x6C 0xE9 0x62 0x65 0x63 0x20 0x4C 0x50
In UTF-8, the character é is two bytes: 0xC3 0xA9. We don't see those bytes here. So whatever the heck this is, it isn't UTF-8, and the WAF was right to block it according to that ruleset.
The expected UTF-8 encoding is this:
0x54 0xC3 0xA9 0x6C 0xC3 0xA9 0x62 0x65 0x63 0x20 0x4C 0x50
Or to align it with the characters:
0x54 T 0xC3 0xA9 é 0x6C l 0xC3 0xA9 é 0x62 b 0x65 e 0x63 c 0x20 (space) 0x4C L 0x50 P
What's interesting here is that all the bytes apart from the ones representing the character é are identical. Whatever this encoding is, it seems to be a fixed-length encoding where each character is a single byte and which has some overlap with UTF-8 when it comes to representing your common Latin alphabet characters and the space character.
There's a pretty well known fixed-length encoding that overlaps with UTF-8 like that: ASCII, or US-ASCII if you prefer. US-ASCII doesn't have a representation of the character é. but there are a few extended ASCII encodings that do, namely ISO-8859-1 and its bastard cousin, Windows-1252. Both of these encode the character é as 0xE9, which is not a valid UTF-8 code point, hence Cloudflare's objection.
But we're in a pickle here. You might think it'd be simple to serialize the string with UTF-8 rather than what I suspect is Windows-1252 since the request is coming from an older .NET Framework application hosted on Windows Server, but it's not that simple. We are stuck with a particular HTTP client implementation. We are not at liberty to change the dependency or add a new one. This serialization behavior is a bug in the HTTP client library that we can't change or fix ourselves.
My teammate working on the problem suggested that we could Base64-encode the string. This is appealing at first. Base64 only uses printable ASCII characters, so it's safe according to the strict HTTP header spec and overlaps completely with UTF-8, so the WAF will be satisfied. But it would mean that the API endpoint itself would have to decode the header value, which means a code change in two places. Worse, either all other client implementations would have to be updated to encode the value or the server would have to detect whether the value requires decoding. If we got that part wrong, "Télébec LP" would become "VMOpbMOpYmVjIExQ" instead.
One thing we could do is try to transcode the string to ASCII using some kind of a replacement strategy for the unsupported characters. By default in .NET, that'd result in "Télébec LP" becoming "T?l?bec LP" (note the question marks), and that sucks. We could approximate it with a transliteration, something like "Telebec LP" or even "Te'le'bec LP" if you like, but these also suck in my opinion. Would you know what the question mark is supposed to be? Would you know whether e' was literally e' or meant to be é (or something else completely)?
What we need is a way to escape the string so that it consists entirely of valid US-ASCII characters. If the string doesn't contain any characters outside the US-ASCII set, it shouldn't even change. And if there are escaped characters, we should still be able to recognize the non-escaped characters. And ideally, we should be able to tell the escaped characters are escaped and what they should actually be.
It turns out there's a good encoding for this already: numeric character references, specifically the type you've probably seen in HTML and XML before. The character é can be represented as "é" this way. The 233 is the decimal value of 0xE9, the extended ASCII codepoint of the character, and corresponds to its codepoint in the Universal Coded Character Set (UCS). In .NET, we can use the HttpUtility module's HtmlEncode and HtmlDecode methods to handle encoding & decoding strings with values outside the US-ASCII range.
That means "Télébec LP" becomes "Télébec LP" in the header, or in terms of specific bytes:
0x54 0x26 0x23 0x32 0x33 0x33 0x3B 0x6C 0x26 0x23 0x32 0x33 0x33 0x3B 0x62 0x65 0x63 0x20 0x4C 0x50
Aligned with the characters:
0x54 T 0x26 & 0x23 # 0x32 2 0x33 3 0x33 3 0x3B ; 0x6C l 0x26 & 0x23 # 0x32 2 0x33 3 0x33 3 0x3B ; 0x62 b 0x65 e 0x63 c 0x20 (space) 0x4C L 0x50 P
Breaking this down in sequence:
1. We start with a string which may (or may not!) contain characters that cannot be represented in printable ASCII. 2. We use HttpUtility.HtmlEncode to escape any characters with equivalent numeric character references in the UCS, i.e. replace those characters with escape sequences composed of US-ASCII characters. 3. We give that string to the buggy HTTP client implementation when we set the CompanyName header, which technically serializes the string to bytes using the wrong encoding (Windows-1252) still produces ASCII-compatible output because Windows-1252 and ASCII overlap in the range of characters used here. 4. The request sails through the Cloudflare WAF with ease, because the header's value is now a valid UTF-8 sequence--again, thanks to overlap between UTF-8 and US-ASCII in this range.
This preserves the intended name of the company yet sets the header in a way that satisfies the WAF's ruleset. The only downside is that we do technically need to make a change in our API to decode the header value, but even if we didn't do that, we would still get a valid string from the header that a human could recognize and correct. And while that human might not know off the top of their head that "é" represents the character é, it's still better than seeing a question mark and having no idea what character was intended.
So that's the hack. Now, what's the right solution? Well, we shouldn't be putting stuff like this in HTTP headers. It belongs in the request's body, where we can use whatever bytes we want, and we should specify the encoding to be used in the Content-Type header.
Of course, we can't break anything by changing the existing endpoint. That means it'd be a new endpoint that client applications would have to add support for, which means we'd still be on the hook to maintain the older endpoint until clients moved to the new one. But the hack allows us to move forward with minimal changes on both sides. And if we add the new endpoint right away, then the next client application that runs into this problem has one clear solution: use the new endpoint!1/17/2025 9:28:08 AM |
CaelNCSU All American 7187 Posts user info edit post |
^ good one
Twitter used to famously ask variations on UTF8 bugs in their interviews. "Write a function to validate 140 characters."
"in a previous role, what is major incident or outage that occurred and your role in fixing it. After fixing it, what are some impacts your contributions made?"
My answer to a common interview question and an attempt to make it more entertaining.
[Edited on January 17, 2025 at 9:30 AM. Reason : A] 1/17/2025 9:28:42 AM |
moron All American 34335 Posts user info edit post |
???? 😮👍
[Edited on January 17, 2025 at 12:37 PM. Reason : ] 1/17/2025 12:35:34 PM |
|