this post was submitted on 21 Nov 2025
983 points (97.8% liked)
Technology
77035 readers
1301 users here now
This is a most excellent place for technology news and articles.
Our Rules
- Follow the lemmy.world rules.
- Only tech related news or articles.
- Be excellent to each other!
- Mod approved content bots can post up to 10 articles per day.
- Threads asking for personal tech support may be deleted.
- Politics threads may be removed.
- No memes allowed as posts, OK to post as comments.
- Only approved bots from the list below, this includes using AI responses and summaries. To ask if your bot can be added please contact a mod.
- Check for duplicates before posting, duplicates may be removed
- Accounts 7 days and younger will have their posts automatically removed.
Approved Bots
founded 2 years ago
MODERATORS
you are viewing a single comment's thread
view the rest of the comments
view the rest of the comments
Thanks a ton, saves me having to navigate the slopped up search results ('AI' as a search term is SEOd to death and back a few times)
That system has the 3080 12GB and 64GB RAM but I have another 2 slots so I could go up to 128GB. I don't doubt that there's a GLM quant model that'll work.
Is ollama for hosting the models and LM Studio for chatbot work still the way to go? Doesn't seem like there's much to improve in that area once there's software that does the thing.
Oh no, you got it backwards. The software is everything, and ollama is awful. It’s enshittifying: don’t touch it with a 10 foot pole.
Speeds are basically limited by CPU RAM bandwidth. Hence you want to be careful doubling up RAM, and doubling it up can the max speed (and hence cut your inference speed).
Anyway, start with this. Pick your size, based on how much free CPU RAM you want to spare:
https://huggingface.co/ubergarm/GLM-4.5-Air-GGUF
The “dense” parts will live on your 3080 while the “sparse” parts will run on your CPU. The backend you want is this, specifically the built-in llama-server:
https://github.com/ikawrakow/ik_llama.cpp/
Regular llama.cpp is fine too, but it’s quants just aren’t quite as optimal or fast.
It has two really good built-in web UIs: the “new” llama.cpp chat UI, and mikupad, which is like a “raw” notebook mode more aimed at creative writing. But you can use LM Studio if you want, or anything else; there are like a bazillion frontends out there.
And IMO… your 3080 is good for ML stuff. It’s very well supported. It’s kinda hard to upgrade, in fact, as realistically you're either looking at a 4090 or a used 3090 for an upgrade that’s actually worth it.