this post was submitted on 18 Mar 2025
70 points (96.1% liked)

Technology

68067 readers
3423 users here now

This is a most excellent place for technology news and articles.


Our Rules


  1. Follow the lemmy.world rules.
  2. Only tech related news or articles.
  3. Be excellent to each other!
  4. Mod approved content bots can post up to 10 articles per day.
  5. Threads asking for personal tech support may be deleted.
  6. Politics threads may be removed.
  7. No memes allowed as posts, OK to post as comments.
  8. Only approved bots from the list below, this includes using AI responses and summaries. To ask if your bot can be added please contact a mod.
  9. Check for duplicates before posting, duplicates may be removed
  10. Accounts 7 days and younger will have their posts automatically removed.

Approved Bots


founded 2 years ago
MODERATORS
 

SourceHut continues to face disruptions due to aggressive LLM crawlers. We are continuously working to deploy mitigations. We have deployed a number of mitigations which are keeping the problem contained for now. However, some of our mitigations may impact end-users.

top 4 comments
sorted by: hot top controversial new old
[–] thatsnothowyoudoit@lemmy.ca 32 points 1 week ago* (last edited 1 week ago) (1 children)

We use NGINX’s 444 on every LLM crawler we see.

Caddy has a similar “close connection” option called “abort” as part of the static response.

HAProxy has the “silent-drop” option which also closes the TCP connection silently.

I’ve found crawling attempts end more quickly using this option - especially attacks - but my sample size is relatively small.

Edit: we do this because too often we’ve seen them ignore robots.txt. They believe all data is theirs. I do not.

[–] mesamunefire@lemmy.world 15 points 1 week ago* (last edited 1 week ago)

I had the same issue. OpenAI was just slamming my tiny little server, ignoring the robots.txt. I had to install a LLM black hole and put a very basic password protection around my git server frontend, since it kept getting slammed by the crawler.

As much as I dont like google, I did see them come in, look at the robot.txt and no other calls for a week. Thats how it should work.

[–] Treczoks@lemmy.world 19 points 1 week ago

I wonder how much of the load problems I observe with lemmy.world are due to AI crawlers.

[–] roguelazer@lemmy.world 9 points 1 week ago

The companies that run these residential proxy networks are sketchy as shit and in a better world would be criminally prosecuted. They're tricking random low-information users into installing VPNs and other software with backdoors that turn them into a veritable botnet.