this post was submitted on 18 Nov 2025
117 points (94.0% liked)
Linux
10165 readers
725 users here now
A community for everything relating to the GNU/Linux operating system (except the memes!)
Also, check out:
Original icon base courtesy of lewing@isc.tamu.edu and The GIMP
founded 2 years ago
MODERATORS
you are viewing a single comment's thread
view the rest of the comments
view the rest of the comments
I mean, there are many. TTS and self-hosted automation are huge in the local LLM scene.
We even have open source "omni" models now, that can ingest and output speech tokens directly (which means they get more semantic understanding from tone and such, they 'choose' the tone to reply with, and that it's streamable word-by-word). They support all sorts of tool calling.
...But they aren't easy to run. It's still in the realm of homelabs with at least an RTX 3060 + hacky python projects.
If you're mad, you can self-host Longcat Omni
https://huggingface.co/meituan-longcat/LongCat-Flash-Omni
And blow Alexa out of the water with a MIT-licensed model from, I kid you not, a Chinese food delivery company.
EDIT
For the curious, see:
Audio-text-to-text (and sometimes TTS): https://huggingface.co/models?pipeline_tag=audio-text-to-text&num_parameters=min%3A6B&sort=modified
TTS: https://huggingface.co/models?pipeline_tag=text-to-speech&num_parameters=min%3A6B&sort=modified
"Anything-to-anything," generally image/video/audio/text -> text/speech: https://huggingface.co/models?pipeline_tag=any-to-any&num_parameters=min%3A6B&sort=modified
Bigger than 6B to exclude toy/test models.
I do wish there was a smaller LongCat model available. My current AI node has a hard 16GB VRAM limit (yay AMD UMA limitations), so 27B can't really fit. An 8B dynamically loaded model would fit, and run much better.