this post was submitted on 31 Oct 2025
17 points (63.9% liked)

Ask Lemmy

35368 readers
1410 users here now

A Fediverse community for open-ended, thought provoking questions


Rules: (interactive)


1) Be nice and; have funDoxxing, trolling, sealioning, racism, and toxicity are not welcomed in AskLemmy. Remember what your mother said: if you can't say something nice, don't say anything at all. In addition, the site-wide Lemmy.world terms of service also apply here. Please familiarize yourself with them


2) All posts must end with a '?'This is sort of like Jeopardy. Please phrase all post titles in the form of a proper question ending with ?


3) No spamPlease do not flood the community with nonsense. Actual suspected spammers will be banned on site. No astroturfing.


4) NSFW is okay, within reasonJust remember to tag posts with either a content warning or a [NSFW] tag. Overtly sexual posts are not allowed, please direct them to either !asklemmyafterdark@lemmy.world or !asklemmynsfw@lemmynsfw.com. NSFW comments should be restricted to posts tagged [NSFW].


5) This is not a support community.
It is not a place for 'how do I?', type questions. If you have any questions regarding the site itself or would like to report a community, please direct them to Lemmy.world Support or email info@lemmy.world. For other questions check our partnered communities list, or use the search function.


6) No US Politics.
Please don't post about current US Politics. If you need to do this, try !politicaldiscussion@lemmy.world or !askusa@discuss.online


Reminder: The terms of service apply here too.

Partnered Communities:

Tech Support

No Stupid Questions

You Should Know

Reddit

Jokes

Ask Ouija


Logo design credit goes to: tubbadu


founded 2 years ago
MODERATORS
 

So my theory is that with the help of telemetry or something else, AI can learn from data stored on users' computers, meaning AI can steal your completed work, as well as your edits and corrections to your work, etc., even offline if you're a Windows user for example.

In short, AI will be able to learn from you even when you edit your articles, edit your drawings, improve your music, etc. In other words, AI will literally steal your soul.

What do you think about it?

top 24 comments
sorted by: hot top controversial new old
[–] audaxdreik@pawb.social 26 points 3 days ago (1 children)

I mean, yes. That's what we're all kind of conjecturing on at the moment.

I want to warn against conspiratorial thinking because I'm always trying to avoid falling into it myself. We won't ever really probably have direct proof of a lot of this unless there are credibly authenticated insider leaks from Microsoft.

But we can analyze evidence. Things like seeing that the Terms of Service have been updated for so many apps and online platforms to automatically opt-in your work for AI training purposes. The fact that Microsoft is forcing online accounts. The fact that Microsoft is defaulting save locations to OneDrive (https://www.zdnet.com/article/microsoft-word-forcing-you-to-save-new-files-to-the-cloud-heres-how-to-stop-it/) and that they are training AI on pictures you've saved there and you can only opt out 3 times a year??? (https://www.windowscentral.com/microsoft/onedrives-ai-face-scanning-feature-suggests-it-can-only-be-disabled-3-times-a-year-but-that-doesnt-seem-right)

Telemetry is already pretty opaque; it's hard to tell what data they are wrapping up in it and if I understand what you're saying correctly, yes. It's very easy to keep that bottled until the next time the computer comes online so that even your "offline" activities are monitored. Look at Recall or even before that, Activity History was already giving me the heebie-jeebies (https://support.microsoft.com/en-us/account-billing/what-is-the-recent-activity-page-23cf5556-4dbe-70da-82c8-bb3a8d8f8016)

They already consumed the entire internet, probably some time ago it's been theorized. In order to keep chasing those diminishing returns they are incentivized to chase every new data source because that scales into profit (no it doesn't, that's the bubble, but that's a whole other conversation).

It's pretty bleak. Use Linux, etc.

[–] 1984@lemmy.today 1 points 1 day ago* (last edited 1 day ago) (1 children)

Good summary but its actually enough to just have a feeling about a company. The things you mentioned are great, they contribute to this feeling of absolutely not trusting Microsoft to care about your privacy.

But for people like me, who dont want to write long summaries like this, I just go by experience and gut feeling.

People who still use windows dont care about privacy and are fair game to be exploited at this point. There is only so much guides people can write, or patience to show... In the end, the person has to care.

[–] audaxdreik@pawb.social 1 points 45 minutes ago (1 children)

Valid, but that's also why I enjoy typing up these posts, too. Sometimes people just kind of intuit that they dislike something and giving them the words to more properly form their thoughts is the first step at breaking out of those patterns. I think for a not insignificant number of people, not being able to properly articulate their thoughts can lead them to believe that maybe they're not valid.

[–] 1984@lemmy.today 1 points 18 minutes ago

Yeah thats true. Its better if you can remember the details so you can share them too.

[–] brucethemoose@lemmy.world 15 points 3 days ago* (last edited 3 days ago)

...No.

The way "AI" is set up now is as big blocks of "weights" and can run in two modes:

  • Inference: Running a model to generate some kind of prediction from output, e.g. the next word in a block of text for an LLM. Typically, this not so hard and 'batched,' e.g. a single GPU may serve 16 people at once, in parallel (though I am skipping many intricacies here).

  • Training: Taking a bunch of data (like big blocks of specifically formatted, processed text for LLMs), and altering the weights to fit it with glorified linear regression. This is typically done on TONS of tightly networked GPUs in more specialized setups. Usually 8x big ones in one server, at a minimum. And all the data selection/formatting is done by humans, by hand, though sometimes enhanced with algorithms that, say, generate a thinking trace. There's also a distinction between 'pretraining' (making the initial model) and 'finetuning' (slightly altering it with new data, which is tricky).


Point I'm making is: you cannot do both at once.

You can 'infer' AI, but internally, it will never change. Its never learns.

You can train it with new data, but this is a huge and very manual/finicky and infrequent endeavor. And it takes a long time.

Theoretically 'learning on the go' is a goal of machine learning, but right now Big Tech is acting about as innovative as a brick, and just scaling up architectures and stoking egos rather than paying attention to this kind of research. The Chinese LLM companies are also being relatively conservative too, but in a different way.

Also, the recent fad (especially in China) is to make and use synthetic data instead, eg data some AI made up all by itself. This (in combination with smaller amounts of really clean/high quality 'real' data, e.g. not some random files stolen from your computer) is actually quite effective.


If you're worried about privacy, the advertising 'models' companies like Facebook and Google already make, and have been making for over a decade, are closer to what you describe. It's already happened, and we've been living with it for years.

Some of that is oldschool machine learning, but not all of it.

[–] tiredofsametab@fedia.io 14 points 2 days ago (1 children)

You can strip AI out of this post and nothing changes. Granting various things access to your various systems/works has and will do things like this.

[–] WhyJiffie@sh.itjust.works 1 points 2 days ago* (last edited 2 days ago)

people can learn from it with lots of effort, if they get access to the data. but it's not so much effort (time) for an AI company (for a good enough quality), and since microsoft collects it it does not only affect what you willingly publish, but virtually anything on your computer

[–] _cryptagion@anarchist.nexus 13 points 3 days ago* (last edited 3 days ago)

AI will literally steal your soul.

You don’t have a soul to steal. And even if you did, if it was comprised of the documents you have on your computer, it would be the saddest soul to ever have existed. I pity you if this is a thing you seriously worry about, since it must surely mean your life is meaningless, drab, and without warmth of any kind.

That’s no theory, that’s reality

[–] slazer2au@lemmy.world 8 points 3 days ago

All except this line is happening.

AI will literally steal your soul

And that can't happen because it can't steal what doesn't exist.

[–] givesomefucks@lemmy.world 4 points 3 days ago (1 children)

So my theory is that with the help of telemetry or something else, AI can learn from data stored on users’ computers, meaning AI can steal your completed work, as well as your edits and corrections to your work, etc., even offline if you’re a Windows user for example.

No. It can't do that.

[–] TribblesBestFriend@startrek.website -2 points 3 days ago (2 children)

You mean like copilot right now ?

[–] givesomefucks@lemmy.world 8 points 3 days ago (1 children)

Please explain how a PC not connected to anything online is able to steal your documents.

Bonus if you use ""telemetry" to explain something about offline PCs.

[–] AA5B@lemmy.world 0 points 3 days ago* (last edited 3 days ago) (1 children)

I’m not even following why this is an argument , but personal experience …. About 15 years ago the company I was at added telemetry to its windows product. This was legit as we were looking to stamp out some pesky bugs we had never been able to reproduce, no personal data collected. But the basic implementation was to write collected data to disk, then there was a completely separate service whose only job was uploading that data.

The point is the basic model supported collecting telemetry data even while offline and it would get uploaded ifyou ever were online

[–] givesomefucks@lemmy.world 2 points 2 days ago

then there was a completely separate service whose only job was uploading that data.

And it uploaded that data...

Thru the Internet?

The point is the basic model supported collecting telemetry data even while offline and it would get uploaded ifyou ever were online

That hypothetical would involve a massive upload, especially since OP is talking every edit and not just documents.

Like, type a 300 word essay, it would need to send exponentially huge amounts of documents because it's a new edit every time a letter is typed/erased.

Like, it's a ridiculous scenario.

We don't need those, there are plenty real world.problems with AI.

[–] TheFriendlyDickhead@feddit.org 3 points 3 days ago (1 children)

Only if you allow copilot on your device.

Updating (or buying a new computer) force feed copilot like edge to your system, does it not ?

[–] SGGeorwell@lemmy.world 3 points 2 days ago

There’s always pen and paper.

[–] MajorHavoc@programming.dev 2 points 2 days ago

That is, fundamentally, what some of us figure the long term plan is with Microsoft Recall.

It came with various guarantees of privacy, the first time they tried it.

But they know no one reads changes to terms of service.

The sad part is that I fully expect that to be the default reality in a few years: a Microsoft model training on every keystroke and click on every copy of Windows 11/12.

[–] zlatiah@lemmy.world 2 points 3 days ago (1 children)

I mean that is pretty much what AI bros want to do... and/or maybe already doing

From a researcher/developer perspective: the biggest bottleneck that affects current-gen AI is the lack of high quality training data; the more high-quality (a.k.a. human-generated and not complete shitposts) training data, the better. What people write on their computers would probably overwhelmingly be high quality. That means, without major technological advancements... if AI companies have access to the types of contents you just described, it is very much in their interests to use them

I don't 100% agree with this view, but if you subscribe to Prof. Emily M. Bender's thought of seeing AI models as plagiarism machines, maybe you can say that AI is "stealing your soul"

[–] brucethemoose@lemmy.world 1 points 2 days ago* (last edited 2 days ago)

From a researcher/developer perspective: the biggest bottleneck that affects current-gen AI is the lack of high quality training data

I don't completely agree with this. Recent papers have been working miracles with synthetic data generation and smaller datasets (eg, Phi).

Meanwhile, there's a lot of speculation that Llama4 failed because Meta's 'real' data was vast but not 'smart,' with hints via lines like this:

In order to maximize performance, we had to prune 95% of the SFT data, as opposed to 50% for smaller models, to achieve the necessary focus on quality and efficiency.

Whereas Deepseek, with a very similar architecture and size, wrote about how well synthetic data worked in their GRPO paper.

And this keeps happening. As an example, Kimi Linear is (subjectively) performing very well in spite of its 'small' training dataset: https://huggingface.co/moonshotai/Kimi-Linear-48B-A3B-Instruct


IMO the limiting factor seems to be GPU time, dev time, and willingness to 'experiment' with exotic architectures, optimizations, and more specialized models (including the burnt time/cash on experiments that don't work).

[–] AmidFuror@fedia.io 1 points 2 days ago (1 children)
[–] Old_Dread_Knight@lemmy.world 0 points 2 days ago (1 children)

No.

I just tried to draw logical conclusions

[–] AmidFuror@fedia.io 1 points 2 days ago

What you’re really describing sounds like a deeper fear: that AI might absorb your creativity - your decisions, your refinements, your “style” - without permission. That’s a valid and serious cultural concern.

If models are trained on massive amounts of human creative work (often scraped from the web), then yes - society faces a collective version of this “soul stealing,” where human creativity feeds a machine that imitates it. The ethical debate is still ongoing, and new laws and technical standards are emerging to address it (e.g., data provenance, opt-out tags, content authenticity).