this post was submitted on 23 Oct 2025

591 points (99.2% liked)

Selfhosted

52618 readers

2464 users here now

A place to share alternatives to popular online services that can be self-hosted without giving up privacy or locking you into a service you don't control.

Rules:

Be civil: we're here to support and learn from one another. Insults won't be tolerated. Flame wars are frowned upon.
No spam posting.
Posts have to be centered around self-hosting. There are other communities for discussing hardware or home computing. If it's not obvious why your post topic revolves around selfhosting, please include details to make it clear.
Don't duplicate the full text of your blog or github here. Just post the link for folks to click.
Submission headline should match the article title (don’t cherry-pick information from the title to fit your agenda).
No trolling.

Resources:

selfh.st Newsletter and index of selfhosted software and apps
awesome-selfhosted software
awesome-sysadmin resources
Self-Hosted Podcast from Jupiter Broadcasting

Any issues on the community? Report it using the report flag.

Questions? DM the mods!

founded 2 years ago

MODERATORS

HybridSarcasm@lemmy.world

HybridSarcasm@lemmy.hybridsarcasm.xyz

591

A single DNS race condition brought AWS to its knees (go.theregister.com)

submitted 1 week ago by mhzawadi@lemmy.horwood.cloud to c/selfhosted@lemmy.world

90 comments fedilink hide all child comments

top 50 comments

sorted by: hot top controversial new old

[–] IsoKiero@sopuli.xyz 196 points 1 week ago (3 children)

So it is always DNS

[–] mhzawadi@lemmy.horwood.cloud 116 points 1 week ago (1 children)

can confirm, its always DNS. Even when it looks like a network issue, its DNS

[–] aarRJaay@lemmy.world 27 points 1 week ago (1 children)

Spotted the Network guy

[–] ramble81@lemmy.zip 27 points 1 week ago (3 children)

Oh man. One of my old companies, the Devs would always blame the network. Even after we spent a year upgrading and removing all SPOFs. They’d blame the network…..

“Your application is somehow producing 2 billion packets per second and your SQL queries are returning 5GB of data”…. “See! The network is too slow and it has problems”

[–] rumba@lemmy.zip 15 points 6 days ago

Dev: My app's getting a 400 hitting the server. Your firewall changes broke it.

Me: You're getting to the server, it's giving you back a malformed request error. Most likely it's a problem in your client.

Dev: it worked fine until you made that change in QA.

Me: Your server is in production.

After that, I just get too busy to look at it for a while.... They figure it out eventually.

load more comments (2 replies)

[–] Dubiousx99@lemmy.world 33 points 1 week ago

It’s always DNS

[–] AtariDump@lemmy.world 16 points 6 days ago

collapsed inline media

[–] magic_lobster_party@fedia.io 189 points 1 week ago (3 children)

It’s not DNS

There’s no way it’s DNS

It was DNS

[–] evidences@lemmy.world 131 points 1 week ago

collapsed inline media

[–] possiblylinux127@lemmy.zip 35 points 1 week ago (1 children)

That and BGP

[–] MelodiousFunk@slrpnk.net 26 points 1 week ago (1 children)

If I had a nickel for every time clearing the ARP tables fixed a problem, I'd have a shitload of nickels.

[–] possiblylinux127@lemmy.zip 17 points 6 days ago (1 children)

If clearing the ARP tables fixes the issue you have bigger problems

[–] MelodiousFunk@slrpnk.net 30 points 6 days ago (1 children)

These things happen when a skinflint company contracts out network setup for a decade, gets acquired by another skinflint company who axes the contractors and doesn't hire on-site network personnel, gradually builds out infra on top of the unsupported foundation, and then hires c suite buddies who want to bring in their own people to further muddy the waters.

load more comments (1 replies)

[–] the_q@lemmy.zip 30 points 1 week ago

collapsed inline media

[–] GreenKnight23@lemmy.world 124 points 6 days ago (1 children)

oh sure, when they fuck up DNS it's a "race condition".

when I fuck up DNS it's a "fireable offense".

[–] sommerset@thelemmy.club 12 points 6 days ago (6 children)

It's funny aws report didn't mention 40% sysops were replaced by AI. https://blog.stackademic.com/aws-just-fired-40-of-its-devops-team-then-let-ai-take-their-jobs-d9db9d298bfa

load more comments (6 replies)

[–] falseWhite@lemmy.world 98 points 1 week ago (6 children)

That's what you get when you let go hundreds of employees from your cloud computing unit in favour of AI.

I hope they end up having to compensate all the billions of losses they caused to all the businesses and people.

[–] otacon239@lemmy.world 76 points 1 week ago (1 children)

Consequences? For Amazon?

lol… lmao even

[–] falseWhite@lemmy.world 33 points 1 week ago* (last edited 1 week ago) (7 children)

They do have contracts and are obligated to provide a certain "up time", which is usually 99% or so. If they fail to provide that, they are liable to compensate for the losses.

Or do you think that Amazon is above the law and no other company could sue them?

It all depends on what kind of contracts they have.

[–] Onomatopoeia@lemmy.cafe 27 points 1 week ago* (last edited 1 week ago)

Much of this stuff is automatic - I've worked with such contracted services where uptime is guaranteed. The contracts dictate the terms and conditions for refunds, we see them on a monthly basis when uptime is missed and it's not done by a person.

I imagine many companies have already seen refunds for outage time, and Amazon scrambled to stop the automation around this.

They'll have little to stand on in court for something this visible and extensive, and could easily lose their shirt with fines and penalties when a big company sues over breech when they choose to not renew.

Just cause they're big doesn't mean all their clients are small or don't have legal teams of their own.

[–] WASTECH@lemmy.world 9 points 1 week ago (2 children)

These contracts do not stipulate reimbursement for lost revenue. The “uptime guarantee” just gets you a partial discount or service refund for the impacted services.

It is on the customer to architect their environment for high availability (use multiple regions or even multiple hyperscalers, depending on the uptime need).

Source: I work at an enterprise that is bound by one of these agreements (although not with AWS).

[–] CheezyWeezle@lemmy.world 8 points 6 days ago

SLA contracts can have a plethora of stipulations, including fines and damages for missing SLO. It really depends on how big and important the customer is. For example, you can imagine government contracts probably include hefty fines for causing downtime or data loss, although I am not involved with or familiar with public sector/ government contracts or their terms.

You can imagine that a customer that is big enough to contract a cloud provider to build new locations and install a bunch of new hardware just for them, would also be big enough to leverage contract terms that include fines and compensation for extended downtime or missing SLO.

I work at a data center for a major cloud provider, also not AWS

load more comments (1 replies)

[–] BakerBagel@midwest.social 7 points 1 week ago (2 children)

Amazon has more money than most countries. They can outlast any company in court, or just ban you from their services in the future.

[–] Onomatopoeia@lemmy.cafe 10 points 1 week ago

Depends on who we're talking about. Companies like finance orgs are all about legal contracts and would be able to hold their feet to the fire.

You don't want to go to court against a finance company or any very large org where contract law is their bread and butter (basically any large/multinational corp).

Amazon's not hosting just small operations.

load more comments (1 replies)

[–] BCsven@lemmy.ca 6 points 1 week ago (2 children)

Most services have a clause that they are not liable for unforseen issues.. Depends how good the lawyers were when formalizing the contracts.

load more comments (2 replies)

load more comments (3 replies)

[–] bigboitricky@lemmy.world 18 points 1 week ago

Oops! All slop!

[–] possiblylinux127@lemmy.zip 16 points 1 week ago

Mistakes happen with or without AI

The problem is that the current internet is structured in a way that creates high risk systems that can cause a massive outage. We went from having thousands of independent companies to a handful of massive ones. A mistake by a single company shouldn't be able to black out half the internet.

[–] phoenixz@lemmy.ca 12 points 1 week ago (2 children)

Was it proven that AI wa the cause?

In not saying it wasn't, just that if it really was, I'd like a source for that claim

[–] jaybone@lemmy.zip 7 points 6 days ago

There was an article in my lemmy all feed yesterday claiming so. But it was a super questionable shady site, which people were calling out.

load more comments (1 replies)

load more comments (2 replies)

[–] slothrop@lemmy.ca 64 points 1 week ago

I DNS see that coming.

[–] WhatsHerBucket@lemmy.world 46 points 1 week ago (1 children)

It was the best race anyone has ever seen 🫲🍊🫱

[–] BrianTheeBiscuiteer@lemmy.world 10 points 6 days ago* (last edited 6 days ago) (1 children)

Let's be honest, not all races are equal 🫲🍊🫱

load more comments (1 replies)

[–] sommerset@thelemmy.club 43 points 6 days ago* (last edited 6 days ago) (2 children)

It's funny aws report didn't mention 40% of aws sysops people were replaced by AI right prior https://blog.stackademic.com/aws-just-fired-40-of-its-devops-team-then-let-ai-take-their-jobs-d9db9d298bfa

load more comments (2 replies)

[–] regedit@lemmy.zip 43 points 6 days ago (2 children)

Unbelievable, racism even exists in networking!

load more comments (2 replies)

[–] TommySoda@lemmy.world 34 points 1 week ago (2 children)

This is purely anecdotal, but I have been running into a lot of DNS issues over the past couple months where I work. 3 of the computers and even one of the laptops for remote work were having DNS issues that needed to be fixed. One even needed Windows reinstalled after fixing the DNS issue (Which was probably unrelated, but worth mentioning)

I'm honestly starting to think that the internet in general might be imploding. Not sure why, but replacing so many developers and programmers with AI might be responsible. Who knows, but it's definitely very strange.

[–] possiblylinux127@lemmy.zip 56 points 1 week ago (2 children)

The biggest issue is how centralized the internet has become. It went from a bunch of local servers to a handful of cloud providers.

We need to spread things out again

[–] metaStatic@kbin.earth 7 points 6 days ago

That's not how capitalism works though

load more comments (1 replies)

[–] ubergeek@lemmy.today 27 points 1 week ago (2 children)

A huge problem are developers who lack a fundamental understanding of how the internet even works. I've had to explain how short, unqualified names resolve vs how fqdns resolve. Or why even you may not be able to reach another node in your proverbial cluster, because they are on different subnets. Or, why using GUIDs as hostnames is a generally bad idea, and will cause things to fail in unpredictable ways, especially with deeply nested subdomains.

[–] GreenKnight23@lemmy.world 13 points 6 days ago (1 children)

I have worked with too many devs that didn't even know what the 7 layers/OSI are or why they exist.

they didn't know what a network port was used for and why it's important to not expose 3306 to the internet.

they couldn't understand that fragmentation of a message bus occurs when you don't dedupe the contents.

you know, morons.

[–] metaStatic@kbin.earth 14 points 6 days ago

Ah, the common clay of the new Web

[–] Appoxo@lemmy.dbzer0.com 5 points 6 days ago (4 children)

GUIDs?
Could you expand on that topic? :)

load more comments (4 replies)

[–] pokexpert30@jlai.lu 33 points 6 days ago (1 children)

Just one more layer bro, just one more automated planning system bro and this time it will be entirely faultless please bro one more layer

load more comments (1 replies)

[–] ReedReads@lemmy.zip 33 points 1 week ago

Ironically, my pihole is blocking that link. So here’s a clean one: https://www.theregister.com/2025/10/23/amazon_outage_postmortem/

[–] HeartyOfGlass@piefed.social 23 points 1 week ago

Racist DNS!

[–] Cyber@feddit.uk 13 points 1 week ago (1 children)

I'm glad these things happen... it keeps everyone aware that cloud is fragile and Plan B should be considered for mission critical tasks.

I'm also hoping that it will improve cloud resiliency because a complete / partial restart of cloud systems needs a whole different approach than maintaining a running system.

load more comments (1 replies)

[–] Flax_vert@feddit.uk 9 points 1 week ago (9 children)

Makes sense. DNS is quite a single point of failure

load more comments (9 replies)

load more comments