Technology

76672 readers

2908 users here now

This is a most excellent place for technology news and articles.

Our Rules

Follow the lemmy.world rules.
Only tech related news or articles.
Be excellent to each other!
Mod approved content bots can post up to 10 articles per day.
Threads asking for personal tech support may be deleted.
Politics threads may be removed.
No memes allowed as posts, OK to post as comments.
Only approved bots from the list below, this includes using AI responses and summaries. To ask if your bot can be added please contact a mod.
Check for duplicates before posting, duplicates may be removed
Accounts 7 days and younger will have their posts automatically removed.

Approved Bots

founded 2 years ago

MODERATORS

L3s@lemmy.world

enu@lemmy.world

technopagan@lemmy.world

L4s@lemmy.world

L3s@hackingne.ws

L4s@hackingne.ws

926

AI agents wrong ~70% of time: Carnegie Mellon study (www.theregister.com)

submitted 4 months ago by eli001@lemmy.world to c/technology@lemmy.world

177 comments fedilink hide all child comments

(page 2) 50 comments

sorted by: hot top controversial new old

[–] szczuroarturo@programming.dev 7 points 4 months ago (2 children)

I actually have a fairly positive experience with ai ( copilot using claude specificaly ). Is it wrong a lot if you give it a huge task yes, so i dont do that and using as a very targeted solution if i am feeling very lazy today . Is it fast . Also not . I could actually be faster than ai in some cases. But is it good if you are working for 6h and you just dont have enough mental capacity for the rest of the day. Yes . You can just prompt it specificaly enough to get desired result and just accept correct responses. Is it always good ,not really but good enough. Do i also suck after 3pm . Yes.
My main issue is actually the fact that it saves first and then asks you to pick if you want to use it. Not a problem usualy but if it crashes the generated code stays so that part sucks

load more comments (2 replies)

[–] Frenezul0_o@lemmy.world 7 points 4 months ago

I notice that the research didn't include DeepSeek. It would have been nice to see how it compares.

[–] Affidavit@lemmy.world 6 points 4 months ago (1 children)

"...for multi-step tasks"

[–] loonsun@sh.itjust.works 5 points 4 months ago (1 children)

It's about Agents, which implies multi step as those are meant to execute a series of tasks opposed to studies looking at base LLM model performance.

load more comments (1 replies)

[–] vane@lemmy.world 5 points 4 months ago

Reading with CEO mindset. 3 out of 10 employees can be fired.

[–] gargle@lemmy.world 5 points 4 months ago

I asked Claude 3.5 Haiku to write me a quine in COBOL in the bs2000 dialect. Claude does now that creating a perfect quine in COBOL is challenging due to the need to represent the self-referential nature of the code. After a few suggestions Claude restated its first draft, without proper BS2000 incantations, without a perform statement, and without any self-referential redefines. It's a lot of work. I stopped caring and moved on.

For those who wonder: https://sourceforge.net/p/gnucobol/discussion/lounge/thread/495d8008/ has an example.

Colour me unimpressed. I dread the day when they force the use of 'AI' on us at work.

[–] brown567@sh.itjust.works 5 points 4 months ago

70% seems pretty optimistic based on my experience...

load more comments