Technology

75186 readers

2712 users here now

This is a most excellent place for technology news and articles.

Our Rules

Follow the lemmy.world rules.
Only tech related news or articles.
Be excellent to each other!
Mod approved content bots can post up to 10 articles per day.
Threads asking for personal tech support may be deleted.
Politics threads may be removed.
No memes allowed as posts, OK to post as comments.
Only approved bots from the list below, this includes using AI responses and summaries. To ask if your bot can be added please contact a mod.
Check for duplicates before posting, duplicates may be removed
Accounts 7 days and younger will have their posts automatically removed.

Approved Bots

founded 2 years ago

MODERATORS

L3s@lemmy.world

enu@lemmy.world

technopagan@lemmy.world

L4s@lemmy.world

L3s@hackingne.ws

L4s@hackingne.ws

100

ClockBench: Even the best AI models can't reliably read the clock (clockbench.ai)

submitted 2 days ago by Pro@programming.dev to c/technology@lemmy.world

8 comments fedilink hide all child comments

cross-posted from: https://programming.dev/post/37407786

top 8 comments

sorted by: hot top controversial new old

[–] MHLoppy@fedia.io 19 points 2 days ago (2 children)

The human level accuracy is less than 90%!?

[–] panda_abyss@lemmy.ca 23 points 2 days ago

Some of those don’t have tick marks. I hate clocks like that, they’re difficult to read.

I’m surprised it’s near 90, a while generation has grown up with digital clocks everywhere

[–] CouldntCareBear@sh.itjust.works 14 points 2 days ago (1 children)

Have a look at the clock faces there using to Benchmark and it'll make more sense.

[–] MHLoppy@fedia.io 6 points 2 days ago

Really wish they published the whole dataset. They don't specify on the page or in the paper what the full set was like, and the GitHub repo only has one of the easy-to-read ones. If >=10% of the set is comprised of clock faces designed not to be readable then fair enough.

[–] Khuda@lemmy.world 6 points 2 days ago

we need a human bench for how many people can read the room

[–] SnoringEarthworm@sh.itjust.works 6 points 2 days ago* (last edited 2 days ago) (1 children)

This seems like a dumb benchmark.

ClockBench evaluates whether models can read analog clocks - a task that is trivial for humans, but current frontier models struggle with.

What do you mean trivial? Most humans I know can't read the most basic white-background-big-black-numbers clocks.

Someone rigged the jury to get 90% on this:

collapsed inline media

[–] MCasq_qsaCJ_234@lemmy.zip 2 points 1 day ago

Rather, ClockBench will end up improving AI in this regard over the next few years. This is because they need any AI benchmark to identify its strengths and weaknesses in order to improve it in future versions.

[–] Endymion_Mallorn@kbin.melroy.org 2 points 2 days ago

So LLMs operate like blind people - like every other web scraper and chatbot to exist.