this post was submitted on 24 May 2025

1 points (100.0% liked)

Science Memes

15741 readers

284 users here now

Welcome to c/science_memes @ Mander.xyz!

A place for majestic STEMLORD peacocking, as well as memes about the realities of working in a lab.

Rules

Don't throw mud. Behave like an intellectual and remember the human.
Keep it rooted (on topic).
No spam.
Infographics welcome, get schooled.

This is a science community. We use the Dawkins definition of meme.

Research Committee

!spiders@lemmy.world

Other Mander Communities

Science and Research

Biology and Life Sciences

Physical Sciences

Humanities and Social Sciences

Practical and Applied Sciences

Memes

Miscellaneous

founded 2 years ago

MODERATORS

Sal@mander.xyz

fossilesque@mander.xyz

SciBot@mander.xyz

fossilesque@lemmy.dbzer0.com

Black Mirror AI (mander.xyz)

submitted 1 month ago by fossilesque@mander.xyz to c/science_memes@mander.xyz

170 comments fedilink hide all child comments

(page 3) 50 comments

sorted by: hot top controversial new old

[–] ZeffSyde@lemmy.world 0 points 1 month ago (1 children)

I'm imagining a break future where, in order to access data from a website you have to pass a three tiered system of tests that make, 'click here to prove you aren't a robot' and 'select all of the images that have a traffic light' , seem like child's play.

[–] Tiger_Man_@lemmy.blahaj.zone 0 points 1 month ago (1 children)

All you need to protect data from ai is use non-http protocol, at least for now

[–] Bourff@lemmy.world 0 points 1 month ago

Easier said than done. I know of IPFS, but how widespread and easy to use is it?

[–] Tiger_Man_@lemmy.blahaj.zone 0 points 1 month ago (1 children)

How can i make something like this

[–] fossilesque@mander.xyz 0 points 1 month ago (1 children)

Use Anubis.

https://anubis.techaro.lol/

load more comments (1 replies)

[–] Iambus@lemmy.world 0 points 1 month ago

Typical bluesky post

[–] gmtom@lemmy.world 0 points 1 month ago (2 children)

Cool, but as with most of the anti-AI tricks its completely trivial to work around. So you might stop them for a week or two, but they'll add like 3 lines of code to detect this and it'll become useless.

[–] JackbyDev@programming.dev 0 points 1 month ago (3 children)

I hate this argument. All cyber security is an arms race. If this helps small site owners stop small bot scrapers, good. Solutions don't need to be perfect.

[–] ByteOnBikes@slrpnk.net 0 points 1 month ago (2 children)

I worked at a major tech company in 2018 who didn't take security seriously because that was literally their philosophy, just refusing to do anything until it was an absolute perfect security solution, and everything else is wasted resources.

I left since then and I continue to see them on the news for data leaks.

Small brain people man.

[–] Joeffect@lemmy.world 0 points 1 month ago (1 children)

Did they lock their doors?

load more comments (1 replies)

[–] JackbyDev@programming.dev 0 points 1 month ago

So many companies let perfect become the enemy of good and it's insane. Recently some discussion about trying to get our team to use a consistent formatting scheme devolved into this type of thing. If the thing being proposed is better than what we currently have, let's implement it as is then if you have concerns about ways to make it better let's address those later in another iteration.

[–] moseschrute@lemmy.world 0 points 1 month ago

I bet someone like cloudflare could bounce them around traps across multiple domains under their DNS and make it harder to detect the trap.

load more comments (1 replies)

[–] stm@lemmy.dbzer0.com 0 points 1 month ago

Such a stupid title, great software!

[–] antihumanitarian@lemmy.world 0 points 1 month ago (1 children)

Some details. One of the major players doing the tar pit strategy is Cloudflare. They're a giant in networking and infrastructure, and they use AI (more traditional, nit LLMs) ubiquitously to detect bots. So it is an arms race, but one where both sides have massive incentives.

Making nonsense is indeed detectable, but that misunderstands the purpose: economics. Scraping bots are used because they're a cheap way to get training data. If you make a non zero portion of training data poisonous you'd have to spend increasingly many resources to filter it out. The better the nonsense, the harder to detect. Cloudflare is known it use small LLMs to generate the nonsense, hence requiring systems at least that complex to differentiate it.

So in short the tar pit with garbage data actually decreases the average value of scraped data for bots that ignore do not scrape instructions.

[–] fossilesque@mander.xyz 0 points 1 month ago

The fact the internet runs on lava lamps makes me so happy.

[–] mlg@lemmy.world 0 points 1 month ago

--recurse-depth=3 --max-hits=256

[–] Novocirab@feddit.org 0 points 1 month ago* (last edited 1 month ago) (1 children)

There should be a federated system for blocking IP ranges that other server operators within a chain of trust have already identified as belonging to crawlers.

(Here's an advantage of Markov chain maze generators like Nepenthes: Even when crawlers recognize that they have been served garbage and delete it, one still has obtained highly reliable evidence that the IPs that requested it do, in fact, belong to crawlers.)

[–] Opisek@lemmy.world 0 points 1 month ago (2 children)

You might want to take a look at CrowdSec if you don't already know it.

[–] rekabis@lemmy.ca 0 points 1 month ago* (last edited 1 month ago)

Holy shit, those prices. Like, I wouldn’t be able to afford any package at even 10% the going rate.

Anything available for the lone operator running a handful of Internet-addressable servers behind a single symmetrical SOHO connection? As in, anything for the other 95% of us that don’t have literal mountains of cash to burn?

[–] Novocirab@feddit.org 0 points 1 month ago* (last edited 1 month ago) (1 children)

Thanks. Makes sense that things roughly along those lines already exist, of course. CrowdSec's pricing, which apparently start at 900$/months, seem forbiddingly expensive for most small-to-medium projects, though. Do you or does anyone else know a similar solution for small or even nonexistent budgets? (Personally I'm not running any servers or projects right now, but may do so in the future.)

[–] Opisek@lemmy.world 0 points 1 month ago* (last edited 1 month ago)

There are many continuously updated IP blacklists on GitHub. Personally I have an automation that sources 10+ of such lists and blocks all IPs that appear on like 3 or more of them. I'm not sure there are any blacklists specific to "AI", but as far as I know, most of them already included particularly annoying scrapers before the whole GPT craze.

load more comments