ChatGPT Reportedly Using Grokipedia To Cite Information, Prompting ‘LLM Grooming’ Concerns

Low Boon Shen
3 Min Read

We know that AI models like ChatGPT relies on a boatload of data for training, which has already sparked a long debate over what constitutes as fair use. However, as AI increasingly dominates the internet, there comes to a point where AI-generated content becomes part of the citation and training data. As The Guardian finds out, it is already happening on OpenAI’s latest large language model.

ChatGPT Sourcing Grokipedia, AI Trains AI

ChatGPT Reportedly Using AI-Generated 'Grokipedia' To Train Itself

Let’s laid out what’s involved here. The Guardian reported that ChatGPT’s GPT-5.2 has been sourcing some of its information from Grokipedia, and if that name sounds familiar, it’s Grok’s version of Wikipedia competitor that is entirely AI-generated. (Grok originates from xAI, which is owned by Elon Musk, who also owns X.) The report indicates that information sourced from Grokipedia include uncommon topics like Iranian politics, and it has to be noted that editors cannot directly edit what Grokipedia shows – the only way is to give it prompts on how content is to be edited.

Now, irrespective of what Grokipedia’s political bias may be, this kind of AI-feeds-AI situation is, on paper, potentially disastrous for today’s AI models. There’s two aspects of this – model collapse and LLM grooming. The first one is a fairly well-known theory that, when AI trains itself with enough hallucinated or nonsensical AI-generated data over time, output quality will degrade and information becomes less reliable. Essentially, it’s like snake eating its tail.

The second part, LLM grooming, is the weaponization of such principle. Threat actors can output large amounts of nonsensical data, or even disinformation, where AI models will eventually scrape them and made it into the training dataset; over time, this will shift the model’s behavior more towards what the threat actors intended, such as treating the disinformation as legitimate and presents it to the user as real information. (There’s also the β€˜Nightshadeβ€˜ technique that research teams presented back in 2023 which follows similar logic.)

This certainly won’t help companies of these AI models deal with the constant skepticism from some parts of the public. It should be stressed that despite improvements, AI models fundamentally cannot eliminate every possible factor in causing hallucinations, so it is best to double-check whatever it outputs to verify the information.

Pokdepinion: Perhaps AI companies eventually have to decide if introducing AI-generated dataset is worth the risk of model collapse.

Share This Article
Leave a Comment

Leave a Reply

Your email address will not be published. Required fields are marked *