AI Chatbots Remain Overconfident — Even When They’re Wrong: Large Language Models appear to be unaware of their own mistakes, prompting concerns about common uses for AI chatbots. : technology

[–] SnotFlickerman@lemmy.blahaj.zone 105 points 5 days ago (10 children)

That's because they aren't "aware" of anything.

load more comments (10 replies)

[–] Perspectivist@feddit.uk 56 points 5 days ago (7 children)

Large language models aren’t designed to be knowledge machines - they’re designed to generate natural-sounding language, nothing more. The fact that they ever get things right is just a byproduct of their training data containing a lot of correct information. These systems aren’t generally intelligent, and people need to stop treating them as if they are. Complaining that an LLM gives out wrong information isn’t a failure of the model itself - it’s a mismatch of expectations.

load more comments (7 replies)

[–] rc__buggy@sh.itjust.works 27 points 5 days ago

However, when the participants and LLMs were asked retroactively how well they thought they did, only the humans appeared able to adjust expectations

This is what everyone with a fucking clue has been saying for the past 5, 6? years these stupid fucking chatbots have been around.

[–] Modern_medicine_isnt@lemmy.world 22 points 5 days ago (3 children)

It's easy, just ask the AI "are you sure"? Until it stops changing it's answer.

But seriously, LLMs are just advanced autocomplete.

[–] cley_faye@lemmy.world 7 points 5 days ago

Ah, the monte-carlo approach to truth.

[–] Lfrith@lemmy.ca 7 points 5 days ago (2 children)

They can even get math wrong. Which surprised me. Had to tell it the answer is wrong for them to recalculate and then get the correct answer. It was simple percentages of a list of numbers I had asked.

[–] jj4211@lemmy.world 4 points 5 days ago (1 children)

Fun thing, when it gets the answer right, tell it is was wrong and then see it apologize and "correct" itself to give the wrong answer.

load more comments (1 replies)

[–] saimen@feddit.org 2 points 5 days ago

I once gave some kind of math problem (how to break down a certain amount of money into bills) and the llm wrote a python script for it, ran it and thus gave me the correct answer. Kind of clever really.

[–] jj4211@lemmy.world 6 points 5 days ago (1 children)

I kid you not, early on (mid 2023) some guy mentioned using ChatGPT for his work and not even checking the output (he was in some sort of non-techie field that was still in the wheelhouse of text generation). I expresssed that LLMs can include some glaring mistakes and he said he fixed it by always including in his prompt "Do not hallucinate content and verify all data is actually correct.".

[–] Passerby6497@lemmy.world 5 points 5 days ago (1 children)

Ah, well then, if he tells the bot to not hallucinate and validate output there's no reason to not trust the output. After all, you told the bot not to, and we all know that self regulation works without issue all of the time.

[–] jj4211@lemmy.world 5 points 5 days ago (1 children)

It gave me flashbacks when the Replit guy complained that the LLM deleted his data despite being told in all caps not to multiple times.

People really really don't understand how these things work...

load more comments (1 replies)

[–] fodor@lemmy.zip 17 points 5 days ago

What a terrible headline. Self-aware? Really?

[–] Lodespawn@aussie.zone 17 points 5 days ago* (last edited 5 days ago) (1 children)

Why is a researcher with a PhD in social sciences researching the accuracy confidence of predictive text, how has this person gotten to where they are without being able to understand that LLMs don't think? Surely that came up when he started even considering this brainfart of a research project?

[–] rc__buggy@sh.itjust.works 10 points 5 days ago (1 children)

Someone has to prove it wrong before it's actually wrong. Maybe they set out to discredit the bots

[–] Lodespawn@aussie.zone 7 points 5 days ago (1 children)

I guess, but it's like proving your phones predictive text has confidence in its suggestions regardless of accuracy. Confidence is not an attribute of a math function, they are attributing intelligence to a predictive model.

[–] FanciestPants@lemmy.world 3 points 5 days ago (1 children)

I work in risk management, but don't really have a strong understanding of LLM mechanics. "Confidence" is something that i quantify in my work, but it has different terms that are associated with it. In modeling outcomes, I may say that we have 60% confidence in achieving our budget objectives, while others would express the same result by saying our chances of achieving our budget objective are 60%. Again, I'm not sure if this is what the LLM is doing, but if it is producing a modeled prediction with a CDF of possible outcomes, then representing its result with 100% confindence means that the LLM didn't model any other possible outcomes other than the answer it is providing, which does seem troubling.

[–] Lodespawn@aussie.zone 2 points 5 days ago (1 children)

Nah so their definition is the classical "how confident are you that you got the answer right". If you read the article they asked a bunch of people and 4 LLMs a bunch of random questions, then asked the respondent whether they/it had confidence their answer was correct, and then checked the answer. The LLMs initially lined up with people (over confident) but then when they iterated, shared results and asked further questions the LLMs confidence increased while people's tends to decrease to mitigate the over confidence.

But the study still assumes intelligence enough to review past results and adjust accordingly, but disregards the fact that an AI isnt intelligence, it's a word prediction model based on a data set of written text tending to infinity. It's not assessing validity of results, it's predicting what the answer is based on all previous inputs. The whole study is irrelevant.

[–] jj4211@lemmy.world 2 points 5 days ago

Well, not irrelevant. Lots of our world is trying to treat the LLM output as human-like output, so if human's are going to treat LLM output the same way they treat human generated content, then we have to characterize, for the people, how their expectations are broken in that context.

So as weird as it may seem to treat a stastical content extrapolation engine in the context of social science, there's a great deal of the reality and investment that wants to treat it as "person equivalent" output and so it must be studied in that context, if for no other reason to demonstrate to people that it should be considered "weird".

[–] Baggie@lemmy.zip 14 points 4 days ago (1 children)

Oh god I just figured it out.

It was never they are good at their tasks, faster, or more money efficient.

They are just confident to stupid people.

Christ, it's exactly the same failing upwards that produced the c suite. They've just automated the process.

[–] SnotFlickerman@lemmy.blahaj.zone 8 points 4 days ago* (last edited 4 days ago)

Oh good, so that means we can just replace the C-suite with LLMs then, right? Right?

An AI won't need a Golden Parachute when they inevitably fuck it all up.

[–] jj4211@lemmy.world 12 points 5 days ago* (last edited 5 days ago) (1 children)

They are not only unaware of their own mistakes, they are unaware of their successes. They are generating content that is, per their training corpus, consistent with the input. This gets eerie, and the 'uncanny valley' of the mistakes are all the more striking, but they are just generating content without concept of 'mistake' or' 'success' or the content being a model for something else and not just being a blend of stuff from the training data.

For example:

Me: Generate an image of a frog on a lilypad.
LLM: I'll try to create that — a peaceful frog on a lilypad in a serene pond scene. The image will appear shortly below.

Me (lying): That seems to have produced a frog under a lilypad instead of on top.
LLM: Thanks for pointing that out! I'm generating a corrected version now with the frog clearly sitting on top of the lilypad. It’ll appear below shortly.

It didn't know anything about the picture, it just took the input at it's word. A human would have stopped to say "uhh... what do you mean, the lilypad is on water and frog is on top of that?" Or if the human were really trying to just do the request without clarification, they might have tried to think "maybe he wanted it from the perspective of a fish, and he wanted the frog underwater?". A human wouldn't have gone "you are right, I made a mistake, here I've tried again" and include almost the exact same thing.

But tha training data isn't predominantly people blatantly lying about such obvious things or second guessing things that were done so obviously normally correct.

[–] vithigar@lemmy.ca 14 points 4 days ago* (last edited 4 days ago) (1 children)

The use of language like "unaware" when people are discussing LLMs drives me crazy. LLMs aren't "aware" of anything. They do not have a capacity for awareness in the first place.

People need to stop taking about them using terms that imply thought or consciousness, because it subtly feeds into the idea that they are capable of such.

load more comments (1 replies)

[–] CosmoNova@lemmy.world 9 points 5 days ago

Is that a recycled piece from 2023? Because we already knew that.

[–] cley_faye@lemmy.world 9 points 5 days ago

prompting concerns

Oh you.

[–] BeMoreCareful@lemmy.world 4 points 4 days ago

There goes middle management

[–] El_guapazo@lemmy.world 4 points 5 days ago

AI evolved their own form of the Dunning Kruger effect.

[+] SGGeorwell@lemmy.world 3 points 5 days ago (1 children)

[deleted]

[–] Whitebrow@lemmy.world 12 points 5 days ago (1 children)

Not even a good use case either, especially when it spews such bullshit like “there’s no recorded instance of trump ever having used the word enigma” and “there’s 1 r in strawberry”.

LLMs are a copy paste machine, not a rationalization engine of any sort (at least as far as all the slop that we get shoved in our face, I don’t include the specialized protein folding and reconstructive models that were purpose built for very niche applications)

[–] Quill7513@slrpnk.net 4 points 5 days ago

they're solid starting point for shopping now that wirecutter, slant, and others are enshittified. i hate it and it makes me feel dirty to use, and you can't just do whatever the llm says. but asking it for a list of options to then explore is currently the best way i've found to jump into things like outdoor basketball shoe options

[–] melsaskca@lemmy.ca 3 points 5 days ago (2 children)

If you don't know you are wrong, when you have been shown to be wrong, you are not intelligent. So A.I. has become "Adequate Intelligence".

[–] MonkderVierte@lemmy.zip 4 points 5 days ago* (last edited 5 days ago)

That definition seems a bit shaky. Trump & co. are mentally ill but they do have a minimum of intelligence.

load more comments (1 replies)

[–] etherphon@lemmy.world 2 points 5 days ago (1 children)

Sounds pretty human to me. /s

[–] shalafi@lemmy.world 3 points 5 days ago

Sounds pretty human to me. no /s

[–] Etterra@discuss.online 2 points 5 days ago

Confidently incorrect.

[–] kameecoding@lemmy.world 2 points 5 days ago

Oh shit, they do behave like humans after all.

[–] RoadTrain@lemdro.id 2 points 5 days ago

About halfway through the article they quote a paper from 2023:

Similarly, another study from 2023 found LLMs “hallucinated,” or produced incorrect information, in 69 to 88 percent of legal queries.

The LLM space has been changing very quickly over the past few years. Yes, LLMs today still "hallucinate", but you're not doing anyone a service by reporting in 2025 the state of the field over 2 years before.

Technology

Our Rules

Approved Bots