this post was submitted on 09 Aug 2025
129 points (95.1% liked)

Fediverse

36099 readers
64 users here now

A community to talk about the Fediverse and all it's related services using ActivityPub (Mastodon, Lemmy, KBin, etc).

If you wanted to get help with moderating your own community then head over to !moderators@lemmy.world!

Rules

Learn more at these websites: Join The Fediverse Wiki, Fediverse.info, Wikipedia Page, The Federation Info (Stats), FediDB (Stats), Sub Rehab (Reddit Migration)

founded 2 years ago
MODERATORS
all 29 comments
sorted by: hot top controversial new old
[–] FaceDeer@fedia.io 46 points 4 days ago (4 children)

I don't see why everyone's surprised about this. The Fediverse is running on ActivityPub, an open protocol whose purpose is to broadcast the content we post here to anyone who wants it. Of course it's being used to train AI, why wouldn't it?

[–] OpenStars@piefed.social 33 points 4 days ago (3 children)

Except iirc, they aren't scraping "properly" (read: efficiently at least, setting aside morality for the sake of discussing this component in isolation), and are causing traffic troubles. If only they took the time to install an actual instance themselves then nobody would care in the slightest (again, ignoring the morality part, for now).

TLDR: they are being dicks about it, bc offering everything we have for free is not enough for them.

[–] scytale@piefed.zip 11 points 4 days ago (3 children)

But if they do it the “proper” way, they won’t be able to grab the data if instances defederate from them, right? And that’s what the majority of instances will do.

[–] FaceDeer@fedia.io 9 points 4 days ago (1 children)

Assuming you know which instances are the ones they're collecting data from. It could be any instance.

[–] OpenStars@piefed.social 6 points 4 days ago* (last edited 4 days ago)

You are absolutely correct there, in that hypothetical scenario if they were to attempt to hide their traffic among normal instance activities.

To add a bit more detail to my previous answer, there were some prior discussions about this topic, citing some of the most popular instances of the entire Threadiverse having been targeted by their normal DDOS-like approach:

collapsed inline mediaimg

[–] OpenStars@piefed.social 1 points 4 days ago

I do not know enough about the ActivityPub protocol to answer that. I did think that federation at least used to be the default many years ago but aren't sure about the current status of that. Indeed detection and subsequent blocking will always be the cat and mouse game that is played but use of ActivityPub might at least delay the former part? And how would anyone find out, compared to e.g. if not a single person household then at least a small community instance just wanting to pull down all the content across the Fediverse to read up on?

[–] MrKaplan@lemmy.world 9 points 4 days ago (1 children)

of all the scrapers we see, the requests identified as originating from Meta seem to be well behaved overall. they appear to (mostly) be respecting robots.txt where present and their request volume to Lemmy.World is only averaging slightly above 5 requests per minute over the last 2 weeks. they also don't spoof their user agents to pretend to be web browsers, or at least I have not seen credible accusations of this happening.

[–] OpenStars@piefed.social 3 points 4 days ago

Thank you for sharing that data 😊

[–] carotte@lemmy.blahaj.zone 5 points 4 days ago

i mean, that’s exactly what they did with threads, and many instances defederated from it because they didn’t want to have their data scraped by meta

[–] Microw@piefed.zip 5 points 4 days ago (1 children)

That doesnt necessarily mean that training AI on this data is legal. Especially when multiple of these instances had legal documents in place specifically forbidding this kind of use.

[–] FaceDeer@fedia.io 10 points 4 days ago

There are some lawsuits in motion about this and the early signs are that it is indeed legal. For example, in Kadrey et al v. Meta the judge issued a summary judgment that training an AI on books was "highly transformative" and fell under fair use, and similarly in Bartz, Graeber and Johnson v. Anthropic the judge ruled that training an AI on books was fair use. I always expected this would be the case since an AI model does not literally contain the training material it was trained on, it learns patterns from the training material but that's not the same as the literal expression of the training material. Since the training material isn't being copied there's nothing for copyright to restrict here.

[–] Eggyhead@lemmings.world 4 points 4 days ago

At this point, I appreciate that anyone can scrape it. Not just Reddit or Meta exclusively, but any start up that’s wants to compete. Sure, meta and the biggies have an easier time of it, but at least they don’t get it all only for themselves.

[–] Corelli_III@midwest.social 0 points 4 days ago

it isn't about surprise silly goose its about moving the interaction from a suspected unknown to a known interaction in our collective threat models

silly goose

[–] Excrubulent@slrpnk.net 25 points 5 days ago
[–] kbal@fedia.io 14 points 5 days ago (1 children)

I see that shitposter.club is on the list. Good to know they're using only the highest-quality training material.

[–] sentient_loom@sh.itjust.works 3 points 4 days ago (1 children)

I tried to visit but their security certificate is expired. Are they still a legit site?

[–] kbal@fedia.io 1 points 4 days ago

Moved to shitposter.world according to their site with the expired cert, but I haven't seen as much on fedi from the new domain as I used to from the old one.

[–] flamingos@feddit.uk 12 points 4 days ago (2 children)

Copy and pasting my own list from here

List of instance

beehaw.org
furry.engineer
ibe.social
fediworld.de
framatube.org
trailers.ddigest.com
nrw.social
lemmynsfw.com
video.hardlimit.com
digitalcourage.social
xn--baw-joa.social
tube.kockatoo.org
equestria.social
wisskomm.social
social.anoxinon.de
freiburg.social
toobnix.org
toot.bike
mstdn.lalafell.org
peertube.linuxrocks.online
social.rebellion.global
mastodon.cipherbliss.com
social.sdf.org
corteximplant.com
typo.social
www.404media.co
mastodon.ml
video.liberta.vip
tilvids.com
todon.eu
hessen.social
digipres.club
shigusegubu.club
mastodon.me.uk
zdf.social
mastodon.sdf.org
spore.social
kolektiva.media
gruene.social
share.tube
nso.group
mastouille.fr
masto.es
vivaldi.com
literatur.social
mstdn.mx
kirche.social
mastodon.hams.social
federation.network
lile.cl
todon.nl
betweenthelions.link
ipv6.social
linuxrocks.online
peertube.otakufarms.com
pawb.social
mastodon-belgium.be
jasette.facil.services
machteburch.social
mastodont.cat
mastodon.eus
eupolicy.social
social.bau-ha.us
toot.berlin
amicale.net
hexbear.net
mastodon.bida.im
reddthat.com
shelter.moe
mastodon.nl
dju.social
bonn.social
mstdn.chrisalemany.ca
social.sciences.re
tldr.nettime.org
lemy.lol
climatejustice.social
rollenspiel.social
mastodon.org.uk
social.kyiv.dcomm.net.ua
pouet.chapril.org
ecoevo.social
social.politicaconciencia.org
darmstadt.social
peertube.tv
lemmus.org
libretooth.gr
hackers.town
tooter.social
anarchism.space
diode.zone
video.infosec.exchange
mastodon.thirring.org
aussie.zone
social.bund.de
apobangpo.space
shitpost.cloud
berlin.social
toot.aquilenet.fr
social.beachcom.org
lemmygrad.ml
mastodon.radio
nerdculture.de
programming.dev
decayable.ink
kafeneio.social
functional.cafe
things.uk
fuzzies.wtf
diaspodon.fr
dalek.zone
sunbeam.city
tooting.ch
fediscience.org
mastodon.tetaneutral.net
social.librem.one
im-in.space
lemmy.sdf.org
legal.social
post.lurk.org
mastodon.uy
noc.social
tube.pol.social
lemmy.ml
don.linxx.net
infosec.pub
kolektiva.social
masto.bike
furries.club
zhub.link
lemmy.world
openbiblio.social
mastodon.zaclys.com
mamot.fr
clacks.link
discuss.tchncs.de
cyberplace.social
graz.social
pl.kitsunemimi.club
mastodonczech.cz
masto.nobigtech.es
hostux.social
pawb.fun
mastodon.trueten.de
norden.social
systemli.social
mander.xyz
ciberlandia.pt
woem.men
sopuli.xyz
lemmy.ca
feddit.uk

[–] rollin@piefed.social 1 points 4 days ago

off-topic but wow, it's great to see so many lemmy instances up and running 🥰

it really looks like we're well on the way to hitting critical mass

[–] Cris_Color@lemmy.world 10 points 5 days ago (1 children)

This is only a loosely related thought, but are there any new foss licenses or anything that prohibit ai usage? I know it'll be ignored but it feels like explicitly disallowing things could be important in opening the door to successful legal challenges to ai scraping and theft...

[–] FaceDeer@fedia.io 13 points 4 days ago (1 children)

Case law is still pretty young in this area, but it's looking like there's nothing actually against copyright about the training of AI on copyrighted content. It's not something that a license can restrict because the trainers can simply reject the license and carry on training under the basics of what the law allows them to do anyway.

Open source licenses only have power because they grant permissions that people normally wouldn't have and put conditions on those permissions. If you don't need those permissions then you don't have to be bound by those conditions.

[–] Cris_Color@lemmy.world 6 points 4 days ago

Ahhh, that sucks ass :(

Thank you for expanding my understanding of the problem!

[–] shnizmuffin@lemmy.inbutts.lol 8 points 5 days ago

Those tasteless frauds!

[–] Mpeach45@lemmy.world 2 points 4 days ago

Can we poison our posts by putting a nonsense “signature” at the end of each of them?

[–] ms_lane@lemmy.world 1 points 2 days ago

They're training on Hexbear

That's... amusing.