this post was submitted on 02 Nov 2025

93 points (96.0% liked)

Ask Lemmy

35383 readers

1480 users here now

A Fediverse community for open-ended, thought provoking questions

Rules: (interactive)

1) Be nice and; have fun

Doxxing, trolling, sealioning, racism, and toxicity are not welcomed in AskLemmy. Remember what your mother said: if you can't say something nice, don't say anything at all. In addition, the site-wide Lemmy.world terms of service also apply here. Please familiarize yourself with them

2) All posts must end with a '?'

This is sort of like Jeopardy. Please phrase all post titles in the form of a proper question ending with ?

3) No spam

Please do not flood the community with nonsense. Actual suspected spammers will be banned on site. No astroturfing.

4) NSFW is okay, within reason

Just remember to tag posts with either a content warning or a [NSFW] tag. Overtly sexual posts are not allowed, please direct them to either !asklemmyafterdark@lemmy.world or !asklemmynsfw@lemmynsfw.com. NSFW comments should be restricted to posts tagged [NSFW].

5) This is not a support community.

It is not a place for 'how do I?', type questions. If you have any questions regarding the site itself or would like to report a community, please direct them to Lemmy.world Support or email info@lemmy.world. For other questions check our partnered communities list, or use the search function.

6) No US Politics.

Please don't post about current US Politics. If you need to do this, try !politicaldiscussion@lemmy.world or !askusa@discuss.online

Reminder: The terms of service apply here too.

Partnered Communities:

Logo design credit goes to: tubbadu

founded 2 years ago

MODERATORS

Bluetreefrog@lemmy.world

TheSaneWriter@lemm.ee

Asudox@lemmy.world

lemmy_bot@lemmy.world

beefbaby182@lemmy.world

ModeratorCan@lemmy.world

neidu3@sh.itjust.works

asudox@lemmy.asudox.dev

candyman337@lemmy.world

candyman337@sh.itjust.works

What is lemmy doing about bot scrapers? (lemmy.eco.br)

submitted 2 days ago by flango@lemmy.eco.br to c/asklemmy@lemmy.world

64 comments fedilink hide all child comments

I've been following the struggle of bearblog developer to manage the current war between bot scrapers and people who are trying to keep a safe and human oriented internet. What is lemmy doing about bot scrapers?

Some context from bearblog dev

The great scrape

https://herman.bearblog.dev/the-great-scrape/

LLMs feed on data. Vast quantities of text are needed to train these models, which are in turn receiving valuations in the billions. This data is scraped from the broader internet, from blogs, websites, and forums, without the author's permission and all content being opt-in by default.

Needless to say, this is unethical. But as Meta has proven, it's much easier to ask for forgiveness than permission. It is unlikely they will be ordered to "un-train" their next generation models due to some copyright complaints.

Aggressive bots ruined my weekend

https://herman.bearblog.dev/agressive-bots/

It's more dangerous than ever to self-host, since simple mistakes in configurations will likely be found and exploited. In the last 24 hours I've blocked close to 2 million malicious requests across several hundred blogs.

What's wild is that these scrapers rotate through thousands of IP addresses during their scrapes, which leads me to suspect that the requests are being tunnelled through apps on mobile devices, since the ASNs tend to be cellular networks. I'm still speculating here, but I think app developers have found another way to monetise their apps by offering them for free, and selling tunnel access to scrapers

you are viewing a single comment's thread
view the rest of the comments

[–] ewigkaiwelo@lemmy.world 9 points 2 days ago (19 children)

So it doesn't stop LLMs from data farming but makes it spend more energy on doing so? If that's the case it sounds like that it's making things even worse

[–] AMoistGrandpa@lemmy.ca 19 points 2 days ago* (last edited 2 days ago) (10 children)

As I understand, Anubis doesn't make the user do anything. Instead, it runs some JavaScript in the client's browser that does the calculations, and then sends the result back to the server. In order for an LLM to get through Anubis, the LLM would need to be running a real JavaScript engine (since the requested calculation is too complicated for an LLM to do natively), and that would be prohibitively expensive for bot farms at any real scale. Since all real people accessing the site will be doing so through a browser, which has JavaScript built in, and most bots will just download the website and send the source code right into the LLM without being able to execute it, real people will be able to get through Anubis while bots won't. The total amount of extra energy consumed by adding Anubis isn't actually that high since bot farms aren't doing the extra work.

Take that all with a grain of salt; that info is based on a blog post which I read like 6 months ago, and I may be remembering incorrectly.

[–] FaceDeer@fedia.io 3 points 2 days ago (4 children)

The extra work and energy expenditure is being done by every single user using the site. The server wastes everyone else's resources to provide benefits for it.

Bots can be designed to run javascript too, so if a site's contents are worth scraping it can still be done.

[–] Cocodapuf@lemmy.world 4 points 1 day ago* (last edited 1 day ago) (1 children)

Do you realize how much extra work your browser has to do every time you visit a site that makes money on ads? All the additional scripts being run in the background, it's astonishing. Trust me, the additional work that users' machines have to do for this is totally insignificant when viewed in the greater context of what we actually do with computers.

Watching a 10 minute YouTube video, that's your computer doing more work than it would loading a million text based pages running Anubis.

[–] FaceDeer@fedia.io 1 points 1 day ago (1 children)

Do you realize how much extra work your browser has to do every time you visit a site that makes money on ads?

I have uBlock origin and Ghostery, so very little.

Watching a 10 minute YouTube video, that's your computer doing more work than it would loading a million text based pages running Anubis.

Given that AI trainers are training on YouTube videos too, that sounds like Anubis isn't going to impose meaningful costs on them.

[–] Cocodapuf@lemmy.world 2 points 1 day ago* (last edited 1 day ago)

Given that AI trainers are training on YouTube videos too, that sounds like Anubis isn't going to impose meaningful costs on them.

Well, does it work?

You don't need to guess about it, you can simply look at traffic records and see how much it changes after installing Anubis. If it works for now, great. Like all things like this, it's a cat and mouse game.

Also, the way your computer interprets a YouTube video and the way a scraper interprets a YouTube video may well be different. But in general, for a browser, streaming and decoding video is a relatively heavy and high bandwidth operation. Video is much higher bandwidth and has much higher CPU processing requirements than audio, which likewise is heavier and higher higher bandwidth than text. As a result, video and text barely compare, they're totally different orders of magnitude in bandwidth and processing needs. So does an AI scraper have to do all that decoding? I actually have no idea, but there definitely could be shortcuts, ways to just avoid it. For instance, they may only care about the audio, or perhaps the transcripts are good enough for them.

load more comments (2 replies)

load more comments (7 replies)

load more comments (15 replies)