this post was submitted on 29 Jun 2025

90 points (95.0% liked)

Technology

72017 readers

2739 users here now

This is a most excellent place for technology news and articles.

Our Rules

Follow the lemmy.world rules.
Only tech related news or articles.
Be excellent to each other!
Mod approved content bots can post up to 10 articles per day.
Threads asking for personal tech support may be deleted.
Politics threads may be removed.
No memes allowed as posts, OK to post as comments.
Only approved bots from the list below, this includes using AI responses and summaries. To ask if your bot can be added please contact a mod.
Check for duplicates before posting, duplicates may be removed
Accounts 7 days and younger will have their posts automatically removed.

Approved Bots

founded 2 years ago

MODERATORS

L3s@lemmy.world

enu@lemmy.world

technopagan@lemmy.world

L4s@lemmy.world

L3s@hackingne.ws

L4s@hackingne.ws

[JS Required] How Performant are LLM Agents(AI Chatbots) on Real World Work Tasks? They Fail 70% or More of The Time. (the-agent-company.com)

submitted 1 day ago* (last edited 1 day ago) by Pro@programming.dev to c/technology@lemmy.world

19 comments fedilink hide all child comments

collapsed inline media

Source.

top 19 comments

sorted by: hot top controversial new old

[–] wise_pancake@lemmy.ca 25 points 1 day ago (1 children)

That tracks with my experience

You have to very carefully scope things for them and have a plan for when they inevitably screw up.

[–] criss_cross@lemmy.world 4 points 1 day ago (1 children)

They’re great for bootstrapping in my experience but then really fall apart when you need it to do something surgical on a larger codebase.

[–] wise_pancake@lemmy.ca 2 points 1 day ago (2 children)

Mine too

I’ve been working on an app and it was fantastic for the basics, then I decided to refactor an API and Claude code would run for hours without really getting there.

Also a good warning: I just had to completely rewrite an mcp server I had Claude build because when I needed to update it, the whole server was one giant if/else statement and utterly unmaintainable.

[–] criss_cross@lemmy.world 2 points 1 day ago (1 children)

Yeah I was trying to pull out a nested react component and styles out of a larger component that got to be almost 1500 lines. Claude and GPT both struggled to get down what styles were required and what that subcomponent was actually doing. And generating tests around just made a fuck ton of spaghetti.

Which is fine. LLMs don’t have to be great at everything. But it’d be nice if people stopped saying I’m gonna be out of a job because of em.

Also a good warning: I just had to completely rewrite an mcp server I had Claude build because when I needed to update it, the whole server was one giant if/else statement and utterly unmaintainable.

I’ve noticed that in some of my bootstrapped code (also an MCP server :) ). I think it tends to bias towards single file solutions so it tends to be a lot less maintainable.

[–] wise_pancake@lemmy.ca 1 points 16 hours ago

Maybe fastmcp is too new for Claude, it’s much less code and still one file

Is that why they like tailwind so much? Philosophically tailwind just seems unsustainable to me, css specifying the intent of an element seems nicer.

[–] AA5B@lemmy.world 1 points 5 hours ago* (last edited 5 hours ago) (1 children)

I’m having this argument with one of my junior guys who wants to just go with the generated code. We finally got his code functional, months late, and now need to get it maintainable

AI is a useful tool that can help speed up some of the tasks of coding but it’s not magical. It’s never a final result

AI could really help me get more done if we could weed out people following it blindly

[–] wise_pancake@lemmy.ca 1 points 15 minutes ago

At least with AI it's easy to see how shitty it gets as the codebase grows working on even a toy project over a week.

Then again, if you have no frame of reference maybe that doesn't feel as awful as it should.

[–] credo@lemmy.world 3 points 1 day ago (2 children)

I use LLMs for one thing only, turn my own ADHD ideas into something others can understand.

[–] Zos_Kia@lemmynsfw.com 2 points 1 day ago (1 children)

I use it to role play historical counter factuals, like how I could win the battle of Cannae through tactics, or how I could invent the telegraph in 13th century France. It's worth every watt <3

[–] mapleseedfall@lemmy.world 1 points 1 day ago

Wait a second this is brilliant! You can roleplay like a general in any history battles and see if you can do it differently!

[–] AA5B@lemmy.world 1 points 5 hours ago

Hah, I do sort of the opposite. My management has drunk the koolaid on ai - now I use it to translate my specs into something obviously ai generated for more acceptance

[–] Dasus@lemmy.world 2 points 15 hours ago

Yesterday I honestly had the bot flip and flop on an answer while deeply apologising and everytime saying "now you can actually trust, I checked and rechecked and now it's definitely correct".

Like a simple question with two choices.

It didn't know the answer so then it prompted to choose from two answers. One which confidently said one thing and one which confidently said the other. I called it out on making me choose the answer to a question I asked. Then it decided itself. I then questioned it. It changed it mind. And around and around we went for like 20 minutes, everytime it swearing this time it isn't hallucinating.

They suck so bad for most things, but they're useful for some very niche things, but even for those, they still sort of suck at those as well a not non-significant. They're definitely shouldn't be used for official shit, but they very much are.

[+] unpossum@sh.itjust.works -8 points 1 day ago (1 children)

It’s easy to forget how fucking sci-fi the existence of these models is. I’m kind of excited to see where agent frameworks are in five years time, as well as a bit apprehensive…

[–] SMillerNL@lemmy.world 16 points 1 day ago (3 children)

We clearly read very different stories, in mine the computers are usually more competent than a 30% success rate.

Imagine if the internet at its inception failed to connect you 70% of the time. It’s not as impressive as most other inventions.

[–] Zos_Kia@lemmynsfw.com 7 points 1 day ago

Don't have to imagine it when you can just remember it. Getting online in the late 90s was a horror show, seriously dialup was super unreliable. And that was 20 years after it's inception, it was shit but also extremely popular.

[–] RedstoneValley@sh.itjust.works 6 points 1 day ago (1 children)

Similar to the crypto hype. Adoption is imminent, bro. Just a few more months, bro. Please, bro

[–] Fizz@lemmy.nz 2 points 1 day ago

Adoption is actually already there. The problem at the moment is getting people to pay for it because currently they lose money on each prompt even for paying users.

[–] unpossum@sh.itjust.works 5 points 1 day ago

Heh. Dial-up bbs, internet, and the like were fairly unstable way back when, not to mention expensive if you weren’t at a university. It’s come a long way, and I imagine artificial intelligence will as well. My main point was that even a 66% failure rate on complex real-world tasks didn’t seem possible even this century, just a few years ago. Transformers with attention really were a game changer in AI, and you have to be preternaturally blasé to ignore that. The problem, especially around here, has been how it’s sold (and to some extent that it’s sold at all), and the bubble that the hype has formed. I don’t disagree too much with that, I just think it’s a shame that it overshadows the very exciting and slightly scary tech at the bottom of the hype well, and leads to people dismissing it as advanced autocomplete, when it’s clearly something of a different degree.