this post was submitted on 15 Dec 2025
715 points (98.5% liked)

Technology

77090 readers
3667 users here now

This is a most excellent place for technology news and articles.


Our Rules


  1. Follow the lemmy.world rules.
  2. Only tech related news or articles.
  3. Be excellent to each other!
  4. Mod approved content bots can post up to 10 articles per day.
  5. Threads asking for personal tech support may be deleted.
  6. Politics threads may be removed.
  7. No memes allowed as posts, OK to post as comments.
  8. Only approved bots from the list below, this includes using AI responses and summaries. To ask if your bot can be added please contact a mod.
  9. Check for duplicates before posting, duplicates may be removed
  10. Accounts 7 days and younger will have their posts automatically removed.

Approved Bots


founded 2 years ago
MODERATORS
you are viewing a single comment's thread
view the rest of the comments
[โ€“] NuXCOM_90Percent@lemmy.zip 5 points 2 days ago (1 children)

found that with just 250 carefully-crafted poison pills, they could compromise the output of any size LLM

That is a very key point.

if you know what you are doing? Yes, you can destroy a model. In large part because so many people are using unlabeled training data.

As a bit of context/baby's first model training:

  • Training on unlabeled data is effectively searching the data for patterns and, optimally, identifying what those patterns are. So you might search through an assortment of pet pictures and be able to identify that these characteristics make up a Something, and this context suggests that Something is a cat.
  • Labeling data is where you go in ahead of time to actually say "Picture 7125166 is a cat". This is what used to be done with (this feels like it should be a racist term but might not be?) Mechanical Turks or even modern day captcha checks.

Just the former is very susceptible to this kind of attack because... you are effectively labeling the training data without the trainers knowing. And it can be very rapidly defeated, once people know about it, by... just labeling that specific topic. So if your Is Hotdog? app is flagging a bunch of dicks? You can go in and flag maybe 10 dicks and 10 hot dogs and ten bratwurst and you'll be good to go.

All of which gets back to: The "good" LLMs? Those are the ones companies are paying for to use for very specific use cases and training data is very heavily labeled as part of that.

For the cheap "build up word of mouth" LLMs? They don't give a fuck and they are invariably going to be poisoned by misinformation. Just like humanity is. Hey, what can't jet fuel melt again?

So you're saying that the ChatGPT's and Stable Diffusions of the world, which operate on maximizing profit by scraping vast oceans of data that would be impossibly expensive to manually label even if they were willing to pay to do the barest minimum of checks, are the most vulnerable to this kind of attack while the actually useful specialized LLMs like those used by doctors to check MRI scans for tumors are the least?

Please stop, I can only get so erect!