What's the worst way you ever broke production?

RacerX@lemm.ee · 9 months ago

What's the worst way you ever broke production?

𝕱𝖎𝖗𝖊𝖜𝖎𝖙𝖈𝖍@lemmy.world · edit-2 9 months ago

Accidentally deleted an entire column in a police department’s evidence database early in my career 😬

Thankfully, it only contained filepaths that could be reconstructed via a script. But I was sweating 12+1 bullets. Spent two days rebuilding that.

aksdb@lemmy.world · 9 months ago

And if you couldn’t reconstruct, you still had backups, right? … right?!

𝕱𝖎𝖗𝖊𝖜𝖎𝖙𝖈𝖍@lemmy.world · 9 months ago

Oh sweet summer child

FartsWithAnAccent@lemmy.world · 9 months ago

What the fuck is a “backups”?

z00s@lemmy.world · 9 months ago

He’s the guy that sits next to fuckups

SuperDuper@lemmy.world · 9 months ago

deleted an entire column in a police department’s evidence database

Based and ACAB-pilled

𝕱𝖎𝖗𝖊𝖜𝖎𝖙𝖈𝖍@lemmy.world · edit-2 9 months ago

deleted by creator

Quazatron@lemmy.world · 9 months ago

Did you know that “Terminate” is not an appropriate way to stop an AWS EC2 instance? I sure as hell didn’t.

Flax@feddit.uk · 9 months ago

Explain more?

Quazatron@lemmy.world · 9 months ago

Noob was told to change some parameters on an AWS EC2 instance, requiring a stop/start. Selected terminate instead, killing the instance.

Crappy company, running production infrastructure in AWS without giving proper training and securing a suitable backup process.

ilinamorato@lemmy.world · 9 months ago

“Stop” is the AWS EC2 verb for shutting down a box, but leaving the configuration and storage alone. You do it for load balancing, or when you’re done testing or developing something for the day but you’ll need to go back to it tomorrow. To undo a Stop, you just do a Start, and it’s just like power cycling a computer.

“Terminate” is the AWS EC2 verb for shutting down a box, deleting the configuration and (usually) deleting the storage as well. It’s the “nuke it from orbit” option. You do it for temporary instances or instances with sensitive information that needs to go away. To undo a Terminate, you weep profusely and then manually rebuild everything; or, if you’re very, very lucky, you restore from backups (or an AMI).

BestBouclettes@jlai.lu · 9 months ago

Apparently Terminate means stop and destroy. Definitely something to use with care.

tslnox@reddthat.com · 9 months ago

Maybe there should be some warning message… Maybe a question requiring you to manually type “yes I want it” or something.

synae[he/him]@lemmy.sdf.org · 9 months ago

Maybe an entire feature that disables it so you can’t do it accidentally, call it “termination protection” or something

Billegh@lemmy.world · 9 months ago

It doesn’t help that the webui used to hide stop. I think it still does.

sexual_tomato@lemmy.dbzer0.com · 9 months ago

I didn’t call out a specific dimension on a machined part; instead I left it to the machinist to understand and figure out what needed to be done without explicitly making it clear.

That part was a 2 ton forging with two layers of explosion-bonded cladding on one side. The machinist faced all the way through a cladding layer before realizing something was off.

The replacement had a 6 month lead time.

Buglefingers@lemmy.world · 9 months ago

That’s hilarious, actually pretty recently I “caused” a line stop because a marker feature (for visuals at assembly, so pretty meaningless dimension overall) was very much over dimensioned (we talking depth, rad, width, location from step) and to top it off instead of a spot drill just doing a .01 plunge they interpolated it! (Why I have zero clue). So it was leaving dwell marks for at least the past 10 months and because it was over dimensioned it all of them had to be put on hold because DOD demands perfection (aircraft engine parts)

treechicken@lemmy.world · edit-2 9 months ago

I once “biased for action” and removed some “unused” NS records to “fix” a flakey DNS resolution issue without telling anyone on a Friday afternoon before going out to dinner with family.

Turns out my fix did not work and those DNS records were actually important. Checked on the website halfway into the meal and freaked the fuck out once I realized the site went from resolving 90% of the time to not resolving at all. The worst part was when I finally got the guts to report I messed up on the group channel, DNS was somehow still resolving for both our internal monitoring and for everyone else who tried manually. My issue got shoo-shoo’d away, and I was left there not even sure of what to do next.

I spent the rest of my time on my phone, refreshing the website and resolving domain names in an online Dig tool over and over again, anxiety growing, knowing I couldn’t do anything to fix my “fix” while I was outside.

Once I came home I ended up reversing everything I did which seemed to bring it back to the original flakey state. Learned the value of SOPs and taking things slow after that (and also to not screw with DNS).

If this story has a happy ending, it’s that we did eventually fix the flakey DNS issue later, going through a more rigorous review this time. On the other hand, how and why I, a junior at the time, became the de facto owner of an entire product’s DNS infra remains a big mystery to me.

Burninator05@lemmy.world · 9 months ago

Hopefully you learned a rule I try to live by despite not listing it: “no significant changes on Friday, no changes at all on Friday afternoon”.

Mossheart@lemmy.ca · 9 months ago

"Man who deployed Friday, works Saturday. "

Kata1yst@kbin.social · edit-2 9 months ago

It was the bad old days of sysadmin, where literally every critical service ran on an iron box in the basement.

I was on my first oncall rotation. Got my first call from helpdesk, exchange was down, it’s 3AM, and the oncall backup and Exchange SMEs weren’t responding to pages.

Now I knew Exchange well enough, but I was new to this role and this architecture. I knew the system was clustered, so I quickly pulled the documentation and logged into the cluster manager.

I reviewed the docs several times, we had Exchange server 1 named something thoughtful like exh-001 and server 2 named exh-002 or something.

Well, I’d reviewed the docs and helpdesk and stakeholders were desperate to move forward, so I initiated a failover from clustered mode with 001 as the primary, instead to unclustered mode pointing directly to server 10.x.x.xx2

What’s that you ask? Why did I suddenly switch to the IP address rather than the DNS name? Well that’s how the servers were registered in the cluster manager. Nothing to worry about.

Well… Anyone want to guess which DNS name 10.x.x.xx2 was registered to?

Yeah. Not exh-002. For some crazy legacy reason the DNS names had been remapped in the distant past.

So anyway that’s how I made a 15 minute outage into a 5 hour one.

On the plus side, I learned a lot and didn’t get fired.

spaghetti_carbanana@krabb.org · edit-2 9 months ago

Worked for an MSP, we had a large storage array which was our cloud backup repository for all of our clients. It locked up and was doing this semi-regularly, so we decided to run an “OS reinstall”. Basically these things install the OS across all of the disks, on a separate partition to where the data lives. “OS Reinstall” clones the OS from the flash drive plugged into the mainboard back to all the disks and retains all configuration and data. “Factory default”, however, does not.

This array was particularly… special… In that you booted it up, held a paperclip into the reset pin, and the LEDs would flash a pattern to let you know you’re in the boot menu. You click the pin to move through the boot menu options, each time you click it the lights flash a different pattern to tell you which option is selected. First option was normal boot, second or third was OS reinstall, the very next option was factory default.

I head into the data centre. I had the manual, I watched those lights like a hawk and verified the “OS reinstall” LED flash pattern matched up, then I held the pin in for a few seconds to select the option.

All the disks lit up, away we go. 10 minutes pass. Nothing. Not responding on its interface. 15 minutes. 20 minutes, I start sweating. I plug directly into the NIC and head to the default IP filled with dread. It loads. I enter the default password, it works.

There staring back at me: “0B of 45TB used”.

Fuck.

This was in the days where 50M fibre was rare and most clients had 1-20M ADSL. Yes, asymmetric. We had to send guys out as far as 3 hour trips with portable hard disks to re-seed the backups over a painful 30ish days of re-ingesting them into the NAS.

The worst part? Years later I discovered that, completely undocumented, you can plug a VGA cable in and you get a text menu on the screen that shows you which option you have selected.

I (somehow) did not get fired.

Appoxo@lemmy.dbzer0.com · 9 months ago

You still remember so. That means you learned and probably won’t do it again.

Burninator05@lemmy.world · 9 months ago

I spent over 20 years in the military in IT. I took took down the network at every base I was ever at each time finding a new way to do it. Sometimes, but rarely, intentionally.

mojofrododojo@lemmy.world · 9 months ago

took out a node center by applying the patches gd recommended… took an entire weekend to restore all the shots and my ass got fed 3/4ths into the woodchipper before it came out that the vendor was at fault for this debacle.

hperrin@lemmy.world · 9 months ago

I fixed a bug and gave everyone administrator access once. I didn’t know that bug was… in use (is that the right way to put it?) by the authentication library. So every successful login request, instead of being returned the user who just logged in, was returned the first user in the DB, “admin”.

Had to take down prod for that one. In my four years there, that was the only time we ever took down prod without an announcement.

Rob Bos@lemmy.ca · 9 months ago

Plugged a serial cable into a UPS that was not expecting RS232. Took down the entire server room. Beyoop.

lud@lemm.ee · 9 months ago

That’s a common one I have seen on r/sysadminds.

I think APC is the company with the stupid issue.

Appoxo@lemmy.dbzer0.com · 9 months ago

You don’t have two unrelated power inputs? (UPS and regular power)

Rob Bos@lemmy.ca · 9 months ago

This was 2001 at a shoestring dialup ISP that also did consulting and had a couple small software products. So no.

mojofrododojo@lemmy.world · 9 months ago

Took down the entire server room

ow, goddamn…

shyguyblue@lemmy.world · 9 months ago

Updated WordPress…

Previous Web Dev had a whole mess of code inside the theme that was deprecated between WP versions.

Fuck WordPress for static sites…

doc@kbin.social · 9 months ago

UPDATE without a WHERE.

Yes in prod.

Yes it can still happen today (not my monkey).

Yes I wrap everything in a rollback now.

madkins@lemmy.ml · 9 months ago

I did something similar. It was a list box with a hidden first row representing the id. Somehow the header row got selected and an update where id=id got ran.

Skyhighatrist@lemmy.ca · 9 months ago

I did this once. But only once. The panic I felt in that moment is something I will never forget. I was able to restore the data from a recent backup before it became a problem, though.

necrobius@lemm.ee · 9 months ago

Create a database,
Have organisation manually populated it with lots of records using a web app,
accidentally delete database.

All in between the backup window.

WagnasT@iusearchlinux.fyi · 9 months ago

“acknowledge all” used to behave a bit different in Cisco UCS manager. Well at least the notifications of pending actions all went away… because they were no longer pending.

Volkditty@lemmy.world · 9 months ago

Light switch is right next to the main power breaker.

jagungal@lemmy.world · 9 months ago

And they looked the same, no cover or anything??!!

Volkditty@lemmy.world · 9 months ago

Why waste time looking when you can just reach behind you to where you’re pretty sure it is?

jagungal@lemmy.world · 9 months ago

Mmm, fair.

kindernacht@lemmy.world · 9 months ago

My first time shutting down a factory at the end of second shift for the weekend. I shut down the compressors first, and that hard stopped a bunch of other equipment that relied on the air pressure. Lessons learned. I spent another hour restarting then properly shutting down everything. Never did that again.