During some work with Tess, I'd notice that my test instance was running horribly slow. The CPU was spiking, Postgres was not happy and using pretty much all the available compute.

Investigating, I found the culprit to be some crawler or possibly malicious actor sending a massive number of unscoped requests to /api/v3/comment/list. What I mean by "unscoped" is without limiting it to a post ID. I'm not sure if this is a bug in Lemmy or there's a legit use for just fetching only comments outside of a post, but I digress as that's another discussion.

After disallowing unscoped requests to the comment list endpoint (see mitigation further down), no more issue.

The kicker seemed to be that this bot / jackass was searching by "Old" and was requesting thousands of pages deep.

Requests looked like this: GET /api/v3/comment/list?limit=50&sort=Old&page=16413

Since I shutdown Dubvee officially, I'm not keeping logs as long as I used to, but I saw other page numbers in the access log, but they were all above 10,000. From the logs I have available, the requests seem to be coming from these 3 IP addresses, but I have insufficient data to confirm this is all of them (probably isn't).

134.19.178.167
213.152.162.5
134.19.179.211

Log Excerpt

Note that I log the query string as well as the URI. I've run a custom Nginx setup for so long, I actually don't recall if the query string is logged by default or not. If you're not logging the query string, you can still look for the 3 (known) IPs above making requests to /api/v3/comment/list and see if entries similar to these show up.

2025-09-21T14:31:59-04:00 {LB_NAME}: dubvee.org, https, {LB_IP}, 134.19.179.211, - , NL, Amsterdam, North Holland, 52.37590, 4.89750, TLSv1.3, TLS_AES_256_GCM_SHA384, "GET", "/api/v3/comment/list", "limit=50&sort=Old&page=16413"
2025-09-21T14:32:00-04:00 {LB_NAME}: dubvee.org, https, {LB_IP}, 134.19.179.211, - , NL, Amsterdam, North Holland, 52.37590, 4.89750, TLSv1.3, TLS_AES_256_GCM_SHA384, "GET", "/api/v3/comment/list", "limit=50&sort=Old&page=16413"
2025-09-21T14:32:01-04:00 {LB_NAME}: dubvee.org, https, {LB_IP}, 134.19.179.211, - , NL, Amsterdam, North Holland, 52.37590, 4.89750, TLSv1.3, TLS_AES_256_GCM_SHA384, "GET", "/api/v3/comment/list", "limit=50&sort=Old&page=16413"
2025-09-21T14:32:01-04:00 {LB_NAME}: dubvee.org, https, {LB_IP}, 134.19.179.211, - , NL, Amsterdam, North Holland, 52.37590, 4.89750, TLSv1.3, TLS_AES_256_GCM_SHA384, "GET", "/api/v3/comment/list", "limit=50&sort=Old&page=16413"
2025-09-21T14:32:12-04:00 {LB_NAME}: dubvee.org, https, {LB_IP}, 134.19.179.211, - , NL, Amsterdam, North Holland, 52.37590, 4.89750, TLSv1.3, TLS_AES_256_GCM_SHA384, "GET", "/api/v3/comment/list", "limit=50&sort=Old&page=16413"
2025-09-21T14:32:13-04:00 {LB_NAME}: dubvee.org, https, {LB_IP}, 134.19.179.211, - , NL, Amsterdam, North Holland, 52.37590, 4.89750, TLSv1.3, TLS_AES_256_GCM_SHA384, "GET", "/api/v3/comment/list", "limit=50&sort=Old&page=16413"
2025-09-21T14:32:13-04:00 {LB_NAME}: dubvee.org, https, {LB_IP}, 134.19.179.211, - , NL, Amsterdam, North Holland, 52.37590, 4.89750, TLSv1.3, TLS_AES_256_GCM_SHA384, "GET", "/api/v3/comment/list", "limit=50&sort=Old&page=16413"
2025-09-21T14:32:13-04:00 {LB_NAME}: dubvee.org, https, {LB_IP}, 134.19.179.211, - , NL, Amsterdam, North Holland, 52.37590, 4.89750, TLSv1.3, TLS_AES_256_GCM_SHA384, "GET", "/api/v3/comment/list", "limit=50&sort=Old&page=16413"

Mitigation:

First, I blocked the IPs making these requests, but they would come back from a different one. Finally, I implemented a more robust solution.

My final mitigation was to simply reject requests to /api/v3/comment/list that did not have a post ID in the query parameters. I did this by creating a dedicated location block in Nginx that is an exact match for /api/v3/comment/list and doing the checks there.

I could probably add another check to see if the page number is beyond a reasonable number, but since I'm not sure what, if any, clients utilize this, I'm content just blocking unscoped comment list requests entirely. If you have more info / better suggestion, leave it in the comments.

# Basically an and/or for has post_id or has saved_only
map $has_post_id:$has_saved_only $comment_list_invalid{
  "1:0"	1;
  "0:1" 1;
  "1:1" 1;
  default 0;
}

server {

...

location = /api/v3/comment/list {

  # You'll need the standard proxy_pass headers such as Host, etc. I load those from an include file.
  include conf.d/includes/http/server/location/proxy.conf;

  # Create a variable to hold a 0/1 state
  set $has_post_id 0;

  # If the URL query string contains 'post_id' set the variable to 1
  if ($arg_post_id) {
    set $has_post_id  1;
  }
  if ($arg_saved_only) {
    set $has_saved_only 1;
  }

  # If the comment_list_invalid map resolves to 0, "send" a 444 resposne
  # 444 is an Nginx-specific return code that immediately closes the connection 
  # and wastes no further resources on the request
  if ($comment_list_invalid = 0) {
    return 444;
  }

  # Otherwise, proxy pass to the API as normal 
  # (replace this with whatever your upstream name is for the Lemmy API
  proxy_pass "http://lemmy-be/";
}

top 34 comments

sorted by: hot top controversial new old

[–] Nawor3565@lemmy.blahaj.zone 27 points 3 months ago

That "jackass" sounds like an AI training set scraper. They're known for being incredibly brutal to the sites they scrape, ignoring robots.txt and other honor-based systems for preventing the site from getting overloaded.

[–] blaise@champserver.net 21 points 3 months ago* (last edited 3 months ago) (1 children)

I found the same IPs doing the same thing for my server, but one thing I noticed in the access log was that nginx was returning a 499 status code. That code means that the client closed the connection before the server answered the request. So this seems to be a deliberate attack instead of the rash of bots many have been dealing with recently. They just firehose out requests to DoS the server since pagination on services with dynamic data is expensive.

I ended up creating a fail2ban rule to add any IP to my firewall blocklist that makes a bunch of 499 entries.

Edit: I also set a rate limit in nginx for any url that has a "page" query included

[–] admiralpatrick@lemmy.world 7 points 3 months ago

Good idea with the f2b integration.

I thought about that before just blocking unscoped requests to that endpoint in Nginx.

[–] admiralpatrick@lemmy.world 20 points 3 months ago* (last edited 3 months ago) (1 children)

Can't edit the post (Thanks Cloudflare! /s) but additional info:

I truncated the log excerpts in the post. The user agent string in these requests isn't shown here, but it is blank in the actual logs.
This is for Lemmy admins only. It might apply to others in some form, but this seems to be specifically exploiting a Lemmy API endpoint
My Nginx solution may have room for improvement; I was just trying to block that behavior without breaking comments in posts and move on with my day. Suggestions for improvement are welcome.

[–] ademir@lemmy.eco.br 5 points 3 months ago

I am gonna try to make it for caddy too

[–] carrylex@lemmy.world 15 points 3 months ago* (last edited 3 months ago)

Get a blocklist and set it up.

Literally all of the IPs are known bots for up to 3 years:

Oh and maybe also a rate-limiter...

[–] Blaze@lemmy.zip 10 points 3 months ago

Thanks for sharing

[–] poVoq@slrpnk.net 9 points 3 months ago (1 children)

Sounds like such unscoped requests should not be allowed in the first place? Maybe worth reporting in a Lemmy issue?

[–] admiralpatrick@lemmy.world 7 points 3 months ago (1 children)

That was my thought, but also wasn't sure since there might be a use-case I'm unfamiliar with. I vaguely recall seeing a feature request for Photon a while back to be able to just browse comments, so I assume that would be how it worked.

But yeah as it is now, it can be abused.

[–] asudox@lemmy.asudox.dev 5 points 3 months ago* (last edited 3 months ago)

It's only useful with the ModeratorView type. I haven't heard more than just a few using it for anything other than for moderation purposes. It is useful for some type of bots, for example. But I think they should opt in for a solution with the upcoming plugin system (for example a webhook) or with mentions. Polling this endpoint is not very efficient and it is very possible to even miss some comments.

So I think this endpoint should be just for the modview type and authorization should therefore be required.

Or the rate limit should be more fine tunable. There are like only 4 configurable rate limits that encompass all endpoints.

[–] Demigodrick@lemmy.zip 9 points 3 months ago* (last edited 3 months ago) (2 children)

FYI these are all on ASN 49453

The other (lazier) option is to block/challenge the ASN

[–] admiralpatrick@lemmy.world 6 points 3 months ago

That's my normal go-to, but more than once I've accidentally blocked locations that Let's Encrypt uses for secondary validation, so I've had to be more precise with my firewall blocks

[–] ademir@lemmy.eco.br 1 points 3 months ago

Good, I am challenging all ASN 49453

[–] ademir@lemmy.eco.br 8 points 3 months ago* (last edited 3 months ago) (2 children)

For Cloudflare users:
Security Rules:

(http.request.uri.path eq "/api/v3/comment/list" and not http.request.uri.query contains "post_id")

For Caddy users:

  # >>> Specific handler for /api/v3/comment/list with post_id check
  handle_path /api/v3/comment/list {
    # Check if the 'post_id' query parameter is present
    @hasPostId {
      query post_id=*
    }
    # Abort the connection if the parameter is missing
    handle @hasPostId {
      reverse_proxy http://localhost:8536/
    }
    # This handles all requests that did not match @hasPostId
    abort
  }

[–] supakaity@lemmy.blahaj.zone 5 points 2 months ago

I found that the Caddy handler above blocked many third party clients and even Tesseract.

So instead I'm using this CEL expression to return a 444 error on match of the unscoped old-sorted 50 per-page comments past page 99:

@block_comment_spam expression <<CEL
    {http.request.uri.path} == "/api/v3/comment/list" &&
    {http.request.uri.query.limit} == "50" &&
    {http.request.uri.query.sort} == "Old" &&
    int({http.request.uri.query.page}) > 99 &&
    {http.request.uri.query.post_id} == ""
CEL

handle @block_comment_spam {
    respond 444
}

[–] admiralpatrick@lemmy.world 2 points 2 months ago

Very nice!

[–] rglullis@communick.news 6 points 3 months ago

Maybe we should introduce a gated API and charge $12 for 50k requests...

[–] chrisbit@leminal.space 6 points 3 months ago

We had this issue on and off for a few weeks at least, causing massive postgres CPU spikes. I ended up blocking large page params with an nginx regex.

[–] OpenStars@piefed.social 6 points 3 months ago (2 children)

This is for Lemmy I presume (or also for Piefed or Mbin)? You've modified yours heavily though, I thought, which could complicate matters. I wonder if you are having those bot scraping issues that semi-recently (a month or so ago?) started increasing in frequency. So many instances now have a human detector before letting you in whereas before it was not necessary.

[–] freamon@preferred.social 4 points 3 months ago (2 children)

PieFed has a similar API endpoint. It used to be scoped, but was changed at the request of app developers. It's how people browse sites by 'New Comments', and - for a GET request - it's not really possible to document and validate that an endpoint needs to have at least one of something (i.e. that none of 'post_id' or 'user_id' or 'community_id' or 'user_id' are individually required, but there needs to be one of them).

It's unlikely that these crawlers will discover PieFed's API, but I guess it's no surprise that they've moved on from basic HTML crawling to probing APIs. In the meantime, I've added some basic protection to the back-end for anonymous, unscoped requests to PieFed's endpoint.

[–] OpenStars@piefed.social 2 points 3 months ago

Good thinking!:-)

[–] mapto@feddit.bg 1 points 2 months ago (1 children)

Could you elaborate on this:

it's not really possible to document and validate that an endpoint needs to have at least one of something

In what sense it is not possible, as I can easily see it done in the code?

[–] freamon@preferred.social 1 points 2 months ago (1 children)

It's straight-forward enough to do in back-end code, to just reject a query if parameters are missing, but I don't think there's a way to define a schema that then gets used to auto-generate the documentation and validate the requests. If the request isn't validated, then the back-end never sees it.

For something like https://freamon.github.io/piefed-api/#/Misc/get_api_alpha_search, the docs show that 'q' and 'type_' are required, and everything else is optional. The schema definition looks like:

/api/alpha/search:
    get:
      parameters:
        - in: query
          name: q
          schema:
            type: string
          required: true
        - in: query
          name: type_
          schema:
            type: string
            enum:
              - Communities
              - Posts
              - Users
              - Url
          required: true
        - in: query
          name: limit
          schema:
            type: integer
          required: false

required is a simple boolean for each individual field - you can say every field is required, or no fields are required, but I haven't come across a way to say that at least one field is required.

[–] mapto@feddit.bg 2 points 2 months ago (1 children)

Ah, I see, so you are talking about this.

Of course it is nice if things get auto-generated, but doing it yourself, both in code and documentation should never be excluded as an option.

[–] freamon@preferred.social 1 points 2 months ago

Exactly that, yeah. Thank you for the link.

[–] admiralpatrick@lemmy.world 4 points 3 months ago* (last edited 3 months ago) (1 children)

Lemmy. I added a comment above since LW wouldn't let me edit the post.

Mine's only extended with some WAF rules and I've got a massive laundry list of bot user agents that it blocks, but otherwise it's pretty bog standard.

If instances have Anubis setup correctly (i.e. not in front of /api/...) then that might not help them since this is calling the API endpoint.

[–] OpenStars@piefed.social 3 points 3 months ago

All of a sudden your edits went through - perhaps a delay caused by this same issue?

Also some related posts:

another one reporting similar attack-like activities https://lemmy.world/post/36413045
a month ago similarly https://lemmy.world/post/34310429

[–] fmstrat@lemmy.nowsci.com 5 points 3 months ago (1 children)

Isn't the same endpoint used to list all of a user's comments in their profile?

[–] admiralpatrick@lemmy.world 5 points 3 months ago

No, that's just /api/v3/user which returns both posts and comments.

[–] rglullis@communick.news 3 points 3 months ago

Not sure if there's a legit use for just fetching only comments outside of a post

The ability to see all comments is right there at the Lemmy UI.

[–] paraphrand@lemmy.world 3 points 3 months ago (1 children)

Things have been slow for me off and on in recent weeks. And today it’s quite slow.

[–] admiralpatrick@lemmy.world 3 points 3 months ago (1 children)

Unfortunately, there's many many reasons that could be the case. I'm just putting this out there since it's easy to check for and mitigate against.

[–] paraphrand@lemmy.world 2 points 2 months ago

I appreciate the effort!

[–] ademir@lemmy.eco.br 1 points 1 month ago

I think this could end up blocking https://lemmy-meter.info/ They make requests without post id