this post was submitted on 29 Aug 2025
296 points (99.7% liked)

Not The Onion

17839 readers
1567 users here now

Welcome

We're not The Onion! Not affiliated with them in any way! Not operated by them in any way! All the news here is real!

The Rules

Posts must be:

  1. Links to news stories from...
  2. ...credible sources, with...
  3. ...their original headlines, that...
  4. ...would make people who see the headline think, “That has got to be a story from The Onion, America’s Finest News Source.”

Please also avoid duplicates.

Comments and post content must abide by the server rules for Lemmy.world and generally abstain from trollish, bigoted, or otherwise disruptive behavior that makes this community less fun for everyone.

And that’s basically it!

founded 2 years ago
MODERATORS
 

"The surveillance, theft and death machine recommends more surveillance to balance out the death."

you are viewing a single comment's thread
view the rest of the comments
[–] tal@lemmy.today 4 points 9 hours ago* (last edited 9 hours ago)

$3-10k...not getting the speeds and quality

I mean, that's true. But the hardware that OpenAI is using costs more than that per pop.

The big factor in the room is that unless the tech nerds you mention are using the hardware for something that requires keeping the hardware under constant load


which occasionally interacting with a chatbot isn't going to do


it's probably going to be cheaper to share the hardware with others, because it'll keep the (quite expensive) hardware at a higher utilization rate.

I'm also willing to believe that there is some potential for technical improvement. I haven't been closely following the field, but one thing that I'll bet is likely technically possible


if people aren't banging on it already


is redesigning how LLMs work such that they don't need to be fully loaded into VRAM at any one time.

Right now, the major limiting factor is the amount of VRAM available on consumer hardware. Models get fully loaded onto a card. That makes for nice, predictable computation times on a query, but it's the equivalent of...oh, having video games limited by needing to load an entire world onto the GPU's memory. I would bet that there are very substantial inefficiencies there.

The largest GPU you're going to get is something like 24GB, and some workloads can be split that across multiple cards to make use of VRAM on multiple cards.

You can partially mitigate that with something like a 128GB Ryzen AI Max 395+ processor-based system. But you're still not going to be able to stuff the largest models into even that.

My guess is that it is probably possible to segment sets of neural net edge weightings into "chunks" that have a likelihood to not concurrently be important, and then keep not-important chunks not loaded, and not run non-loaded chunks. One would need to have a mechanism to identify when they likely do become important, and swap chunks out. That will make query times less-predictable, but also probably a lot more memory-efficient.

IIRC from my brief skim, they do have specialized sub-neural-networks, which are called "MoE", for "Mixture of Experts". It might be possible to unload some of those, though one is going to need more logic to decide when to include and exclude them, and probably existing systems are not optimal for these:

kagis

Yeah, sounds like it:

https://arxiv.org/html/2502.05370v1

fMoE: Fine-Grained Expert Offloading for Large Mixture-of-Experts Serving

Despite the computational efficiency, MoE models exhibit substantial memory inefficiency during the serving phase. Though certain model parameters remain inactive during inference, they must still reside in GPU memory to allow for potential future activation. Expert offloading [54, 47, 16, 4] has emerged as a promising strategy to address this issue, which predicts inactive experts and transfers them to CPU memory while retaining only the necessary experts in GPU memory, reducing the overall model memory footprint.