overview for ell1e

I was wrong about robots.txt in c/technology@lemmy.world

[–] ell1e@leminal.space 2 points 2 days ago* (last edited 1 day ago) (4 children)

You look up what Googlebot does. No AI.

The page seems written to perhaps suggest it but doesn't explicitly say the other bots can't feed into some other sort of AI training. It would be in Google's interest to mislead the users here.

Edit: I found a quote where it says Googlebot does both in one: "Google-Extended doesn't have a separate HTTP request user agent string. Crawling is done with existing Google user agent [...]" and I guess Cloudflare doesn't trust Google to abide by the access controls. That seems sensible to me. Edit 2: What exactly the CEO believes was perhaps rightfully disputed below, it was just my guess.

I was wrong about robots.txt in c/technology@lemmy.world

[–] ell1e@leminal.space 2 points 2 days ago* (last edited 2 days ago) (6 children)

Nothing on this page seems to contradict the article. But if I simply missed the part that does, I'd be happy to learn.

I was wrong about robots.txt in c/technology@lemmy.world

[–] ell1e@leminal.space 3 points 2 days ago* (last edited 2 days ago) (8 children)

So what's the quote from the documentation that backs up your claim? The line "perform other product specific crawls" seems extremely vague by design.

I was wrong about robots.txt in c/technology@lemmy.world

[–] ell1e@leminal.space 8 points 2 days ago

And allowing the public crawler might also have it feed their AI: https://arstechnica.com/tech-policy/2025/07/cloudflare-wants-google-to-change-its-ai-search-crawling-google-likely-wont/

I was wrong about robots.txt in c/technology@lemmy.world

[–] ell1e@leminal.space 8 points 2 days ago* (last edited 2 days ago) (10 children)

See here: https://arstechnica.com/tech-policy/2025/07/cloudflare-wants-google-to-change-its-ai-search-crawling-google-likely-wont/ If you have a source that says it's false, I'd be curious.

I was wrong about robots.txt in c/technology@lemmy.world

[–] ell1e@leminal.space 27 points 2 days ago* (last edited 2 days ago) (14 children)

Often it is respected, but the resulting problem is platforms conflate things with the questionable AI scraping crawlers to blackmail websites into participating in feeding AI.

For example, Googlebot if enabled won't just list you for search, but will also scrape your contents for Google's AI. Edit: see https://arstechnica.com/tech-policy/2025/07/cloudflare-wants-google-to-change-its-ai-search-crawling-google-likely-wont/ as source. I imagine LinkedinBot, given it's microsoft, will feed some other AI of theirs as well on top of the previews.

Until regulation steps in to require AI bots to separately ask for crawling permission, or to actually get a proper license for reuse of the contents, this situation isn't going to improve.