I'm imagining a break future where, in order to access data from a website you have to pass a three tiered system of tests that make, 'click here to prove you aren't a robot' and 'select all of the images that have a traffic light' , seem like child's play.
Science Memes
Welcome to c/science_memes @ Mander.xyz!
A place for majestic STEMLORD peacocking, as well as memes about the realities of working in a lab.
Rules
- Don't throw mud. Behave like an intellectual and remember the human.
- Keep it rooted (on topic).
- No spam.
- Infographics welcome, get schooled.
This is a science community. We use the Dawkins definition of meme.
Research Committee
Other Mander Communities
Science and Research
Biology and Life Sciences
- !abiogenesis@mander.xyz
- !animal-behavior@mander.xyz
- !anthropology@mander.xyz
- !arachnology@mander.xyz
- !balconygardening@slrpnk.net
- !biodiversity@mander.xyz
- !biology@mander.xyz
- !biophysics@mander.xyz
- !botany@mander.xyz
- !ecology@mander.xyz
- !entomology@mander.xyz
- !fermentation@mander.xyz
- !herpetology@mander.xyz
- !houseplants@mander.xyz
- !medicine@mander.xyz
- !microscopy@mander.xyz
- !mycology@mander.xyz
- !nudibranchs@mander.xyz
- !nutrition@mander.xyz
- !palaeoecology@mander.xyz
- !palaeontology@mander.xyz
- !photosynthesis@mander.xyz
- !plantid@mander.xyz
- !plants@mander.xyz
- !reptiles and amphibians@mander.xyz
Physical Sciences
- !astronomy@mander.xyz
- !chemistry@mander.xyz
- !earthscience@mander.xyz
- !geography@mander.xyz
- !geospatial@mander.xyz
- !nuclear@mander.xyz
- !physics@mander.xyz
- !quantum-computing@mander.xyz
- !spectroscopy@mander.xyz
Humanities and Social Sciences
Practical and Applied Sciences
- !exercise-and sports-science@mander.xyz
- !gardening@mander.xyz
- !self sufficiency@mander.xyz
- !soilscience@slrpnk.net
- !terrariums@mander.xyz
- !timelapse@mander.xyz
Memes
Miscellaneous
All you need to protect data from ai is use non-http protocol, at least for now
Easier said than done. I know of IPFS, but how widespread and easy to use is it?
How can i make something like this
Typical bluesky post
Cool, but as with most of the anti-AI tricks its completely trivial to work around. So you might stop them for a week or two, but they'll add like 3 lines of code to detect this and it'll become useless.
I hate this argument. All cyber security is an arms race. If this helps small site owners stop small bot scrapers, good. Solutions don't need to be perfect.
I worked at a major tech company in 2018 who didn't take security seriously because that was literally their philosophy, just refusing to do anything until it was an absolute perfect security solution, and everything else is wasted resources.
I left since then and I continue to see them on the news for data leaks.
Small brain people man.
Did they lock their doors?
So many companies let perfect become the enemy of good and it's insane. Recently some discussion about trying to get our team to use a consistent formatting scheme devolved into this type of thing. If the thing being proposed is better than what we currently have, let's implement it as is then if you have concerns about ways to make it better let's address those later in another iteration.
I bet someone like cloudflare could bounce them around traps across multiple domains under their DNS and make it harder to detect the trap.
Such a stupid title, great software!
Some details. One of the major players doing the tar pit strategy is Cloudflare. They're a giant in networking and infrastructure, and they use AI (more traditional, nit LLMs) ubiquitously to detect bots. So it is an arms race, but one where both sides have massive incentives.
Making nonsense is indeed detectable, but that misunderstands the purpose: economics. Scraping bots are used because they're a cheap way to get training data. If you make a non zero portion of training data poisonous you'd have to spend increasingly many resources to filter it out. The better the nonsense, the harder to detect. Cloudflare is known it use small LLMs to generate the nonsense, hence requiring systems at least that complex to differentiate it.
So in short the tar pit with garbage data actually decreases the average value of scraped data for bots that ignore do not scrape instructions.
The fact the internet runs on lava lamps makes me so happy.
--recurse-depth=3 --max-hits=256
There should be a federated system for blocking IP ranges that other server operators within a chain of trust have already identified as belonging to crawlers.
(Here's an advantage of Markov chain maze generators like Nepenthes: Even when crawlers recognize that they have been served garbage and delete it, one still has obtained highly reliable evidence that the IPs that requested it do, in fact, belong to crawlers.)
You might want to take a look at CrowdSec if you don't already know it.
Holy shit, those prices. Like, I wouldn’t be able to afford any package at even 10% the going rate.
Anything available for the lone operator running a handful of Internet-addressable servers behind a single symmetrical SOHO connection? As in, anything for the other 95% of us that don’t have literal mountains of cash to burn?
Thanks. Makes sense that things roughly along those lines already exist, of course. CrowdSec's pricing, which apparently start at 900$/months, seem forbiddingly expensive for most small-to-medium projects, though. Do you or does anyone else know a similar solution for small or even nonexistent budgets? (Personally I'm not running any servers or projects right now, but may do so in the future.)