Technology

74112 readers

2797 users here now

This is a most excellent place for technology news and articles.

Our Rules

Follow the lemmy.world rules.
Only tech related news or articles.
Be excellent to each other!
Mod approved content bots can post up to 10 articles per day.
Threads asking for personal tech support may be deleted.
Politics threads may be removed.
No memes allowed as posts, OK to post as comments.
Only approved bots from the list below, this includes using AI responses and summaries. To ask if your bot can be added please contact a mod.
Check for duplicates before posting, duplicates may be removed
Accounts 7 days and younger will have their posts automatically removed.

Approved Bots

founded 2 years ago

MODERATORS

L3s@lemmy.world

enu@lemmy.world

technopagan@lemmy.world

L4s@lemmy.world

L3s@hackingne.ws

L4s@hackingne.ws

154

DeepSeek's distilled new R1 AI model can run on a single GPU | TechCrunch (techcrunch.com)

submitted 2 months ago by schizoidman@lemm.ee to c/technology@lemmy.world

14 comments fedilink hide all child comments

top 14 comments

sorted by: hot top controversial new old

[–] LodeMike@lemmy.today 41 points 2 months ago

So can a lot of other models.

"This load can be towed by a single vehicle"

[–] blarth@thelemmy.club 4 points 2 months ago (3 children)

7b trash model?

[–] vhstape@lemmy.sdf.org 27 points 2 months ago (2 children)

the Chinese AI lab also released a smaller, “distilled” version of its new R1, DeepSeek-R1-0528-Qwen3-8B, that DeepSeek claims beats comparably sized models on certain benchmarks

Most models come in 1B, 7-8B, 12-14B, and 27+B parameter variants. According to the docs, they benchmarked the 8B model using an NVIDIA H20 (96 GB VRAM) and got between 144-1198 tokens/sec. Most consumer GPUs probably aren’t going to be able to keep up with

[–] avidamoeba@lemmy.ca 7 points 2 months ago (1 children)

It proved sqrt(2) irrational with 40tps on a 3090 here. The 32b R1 did it with 32tps but it thought a lot longer.

[–] vhstape@lemmy.sdf.org 2 points 2 months ago* (last edited 2 months ago)

On my Mac mini running LM Studio, it managed 1702 tokens at 17.19 tok/sec and thought for 1 minute. If accurate, high-performance models were more able to run on consumer hardware, I would use my 3060 as a dedicated inference device

[–] brucethemoose@lemmy.world 2 points 2 months ago* (last edited 2 months ago)

Depends on the quantization.

7B is small enough to run it in FP8 or a Marlin quant with SGLang/VLLM/TensorRT, so you can probably get very close to the H20 on a 3090 or 4090 (or even a 3060) and you know a little Docker.

[–] knighthawk0811@lemmy.world 8 points 2 months ago

it's distilled so it's going to be smaller than any non distilled of the same quality

[–] LainTrain@lemmy.dbzer0.com 6 points 2 months ago

I'm genuinely curious what you do that a 7b model is "trash" to you? Like yeah sure a gippity now tends to beat out a mistral 7b but I'm pretty happy with my mistral most of the time if I ever even need ai at all.