
DeepSeek Under Scrutiny: Did the AI Lab Use Google’s Gemini to Train Its Latest Model?
Chinese AI lab DeepSeek is facing scrutiny amid speculation that its latest R1 reasoning AI model may have been trained, in part, using data from Google’s Gemini family of AI. The updated model, released last week, has shown strong performance on math and coding benchmarks, but the source of its training data remains undisclosed.
Sam Paech, an AI developer specializing in emotional intelligence evaluations, has presented evidence suggesting a link between DeepSeek’s R1-0528 model and Gemini outputs. According to Paech, the DeepSeek model exhibits preferences for words and expressions similar to those favored by Google’s Gemini 2.5 Pro, as detailed in an X post.
Another developer, known as the creator of SpeechMap, an AI “free speech eval,” noted that the DeepSeek model’s internal “thoughts” resemble Gemini traces.
This isn’t the first time DeepSeek has faced accusations of using data from competing AI models. Last December, it was observed that DeepSeek’s V3 model often identified itself as ChatGPT, leading to suspicions of training on ChatGPT chat logs.
Earlier this year, the Financial Times reported that OpenAI had found evidence linking DeepSeek to distillation, a technique involving the extraction of data from larger, more capable AI models to train smaller ones. Bloomberg further reported that Microsoft detected significant data exfiltration through OpenAI developer accounts believed to be connected to DeepSeek in late 2024.
While distillation is a common practice, OpenAI’s terms of service prohibit the use of its model outputs to develop competing AI.
It’s important to note that AI models often misidentify themselves and converge on similar phrasing due to the increasing prevalence of AI-generated content on the open web. This “contamination” makes it challenging to filter AI outputs from training datasets.
Despite these challenges, experts like Nathan Lambert, a researcher at AI2, believe it is plausible that DeepSeek trained on Gemini data. Lambert suggested on X that using synthetic data from the best available API model could be a cost-effective strategy for DeepSeek, given its GPU resources and cash reserves.
In response to concerns about distillation, AI companies have been enhancing their security measures. OpenAI now requires ID verification for access to advanced models through its API, and Google has started summarizing traces generated by its AI Studio models. Anthropic has also announced plans to summarize its model’s traces to protect its competitive advantages.