this post was submitted on 18 Nov 2025
117 points (94.0% liked)

Linux

10165 readers
725 users here now

A community for everything relating to the GNU/Linux operating system (except the memes!)

Also, check out:

Original icon base courtesy of lewing@isc.tamu.edu and The GIMP

founded 2 years ago
MODERATORS
 

As Snowden told us, video and audio recording capabilities of your devices are NSA spying vectors. OSS/Linux is a safeguard against such capabilities. The massive datacenter investments in US will be used to classify us all into a patriotic (for Israel)/Oligarchist social credit score, and every mega tech company can increase profits through NSA cooperation, and are legally obligated to cooperate with all government orders.

Speech to text and speech automation are useful tech, though always listening state sponsored terrorists is a non-NSA targeted path for sweeping future social credit classifications of your past life.

Some small LLMs that can be used for speech to text: https://modal.com/blog/open-source-stt

you are viewing a single comment's thread
view the rest of the comments
[โ€“] brucethemoose@lemmy.world 24 points 12 hours ago* (last edited 11 hours ago) (1 children)

I mean, there are many. TTS and self-hosted automation are huge in the local LLM scene.

We even have open source "omni" models now, that can ingest and output speech tokens directly (which means they get more semantic understanding from tone and such, they 'choose' the tone to reply with, and that it's streamable word-by-word). They support all sorts of tool calling.

...But they aren't easy to run. It's still in the realm of homelabs with at least an RTX 3060 + hacky python projects.


If you're mad, you can self-host Longcat Omni

https://huggingface.co/meituan-longcat/LongCat-Flash-Omni

And blow Alexa out of the water with a MIT-licensed model from, I kid you not, a Chinese food delivery company.


EDIT

For the curious, see:

Audio-text-to-text (and sometimes TTS): https://huggingface.co/models?pipeline_tag=audio-text-to-text&num_parameters=min%3A6B&sort=modified

TTS: https://huggingface.co/models?pipeline_tag=text-to-speech&num_parameters=min%3A6B&sort=modified

"Anything-to-anything," generally image/video/audio/text -> text/speech: https://huggingface.co/models?pipeline_tag=any-to-any&num_parameters=min%3A6B&sort=modified

Bigger than 6B to exclude toy/test models.

[โ€“] fonix232@fedia.io 2 points 10 hours ago

I do wish there was a smaller LongCat model available. My current AI node has a hard 16GB VRAM limit (yay AMD UMA limitations), so 27B can't really fit. An 8B dynamically loaded model would fit, and run much better.