Science Memes

17243 readers

621 users here now

Welcome to c/science_memes @ Mander.xyz!

A place for majestic STEMLORD peacocking, as well as memes about the realities of working in a lab.

Rules

Don't throw mud. Behave like an intellectual and remember the human.
Keep it rooted (on topic).
No spam.
Infographics welcome, get schooled.

This is a science community. We use the Dawkins definition of meme.

Research Committee

!spiders@lemmy.world

Other Mander Communities

Science and Research

Biology and Life Sciences

Physical Sciences

Humanities and Social Sciences

Practical and Applied Sciences

Memes

Miscellaneous

founded 2 years ago

MODERATORS

Sal@mander.xyz

fossilesque@mander.xyz

SciBot@mander.xyz

fossilesque@lemmy.dbzer0.com

Black Mirror AI (mander.xyz)

submitted 5 months ago by fossilesque@mander.xyz to c/science_memes@mander.xyz

170 comments fedilink hide all child comments

you are viewing a single comment's thread
view the rest of the comments

[–] kassiopaea@lemmy.blahaj.zone 0 points 5 months ago (3 children)

Wouldn't Google's crawlers respect robots.txt though? Is it naive to assume that anything would?

[–] Zexks@lemmy.world 0 points 5 months ago

Lol. And they'll delist you. Unless you're really important, good luck with that.

robots.txt

Disallow: /some-page.html

If you disallow a page in robots.txt Google won't crawl the page. Even when Google finds links to the page and knows it exists, Googlebot won't download the page or see the contents. Google will usually not choose to index the URL, however that isn't 100%. Google may include the URL in the search index along with words from the anchor text of links to it if it feels that it may be an important page.

[–] Aux@feddit.uk 0 points 5 months ago

It does respect robots.txt, but that doesn't mean it won't index the content hidden behind robots.txt. That file is context dependent. Here's an example.

Site X has a link to sitemap.html on the front page and it is blocked inside robots.txt. When Google crawler visits site X it will first load robots.txt and will follow its instructions and will skip sitemap.html.

Now there's site Y and it also links to sitemap.html on X. Well, in this context the active robots.txt file is from Y and it doesn't block anything on X (and it cannot), so now the crawler has the green light to fetch sitemap.html.

This behaviour is intentional.

[–] jaschen@lemm.ee 0 points 5 months ago (1 children)

It's naive to assume that google crawlers respect robot.txt.

[–] rosco385@lemm.ee 0 points 5 months ago

It'd be more naive to have a robot.txt file on your webserver and be surprised when webcrawlers don't stay away. 😂