Detecting and Preventing Distillation Attacks

cherryteastain · 133 days ago

Claiming they have the unrestricted right to scrape whatever information they want off the internet but complaining about it when others do to you and bringing out the 'China bad' card, just ironic

WiSaGaN · 133 days ago

This violates the ToS, but I don't think it's distillation. Distillation requires knowing the logits, which current API does not provide. This is just synthetic data generation. Anthropic definitely knows the difference.

janalsncm · 133 days ago

Yes, it is annoying that companies keep calling it “distillation” when it’s really imitation learning. In fact the closest analogy is probably more like “scraping” which is pretty ironic.

gja17 · 133 days ago

Wait until they realize their monitoring dashboard doesn't distinguish between "distillation" and "imitation" at 2am. Good luck explaining the difference to the on-call engineer when rate limits start firing.

kentger · 133 days ago

I disagree—the distinction is overblown in practice. Whether you call it distillation or imitation learning, the end result is the same: a smaller model reproducing a frontier model's behaviors at a fraction of the cost. The terminology debate feels academic when the actual concern is about unauthorized model replication.

k1musab1 · 133 days ago

I find this extremely concerning: "Countermeasures. We are developing Product, API and model-level safeguards designed to reduce the efficacy of model outputs for illicit distillation, without degrading the experience for legitimate customers."

I often ask Claude to reason out loud, and this indicates that instead of explicitly blocking flagged requests the model output will be purposefully degraded.

joshribakoff · 133 days ago

An LLM is just a compressed version of the web. In this context, I don’t see a meaningful distinction between “distill” vs “compress”.

atultw · 133 days ago

New term for web scraping just dropped

resfirestar · 133 days ago

Whatever you think of the ethics of doing this, it does hurt the reputation of the follower labs in my mind. If their capabilities can't exist without the work of the frontier labs, they're less equal competitors and more the guys trying to sell you a shoddy knockoff. Not that there's no use case for shoddy knockoffs.

janalsncm · 133 days ago

It’s not that capabilities could not exist without the original work. It’s more that the shortest path between A and B isn’t repeating all of the same work.

Further, although media likes to depict Chinese labs as “just copying” I think there’s a ton of hubris involved. First of all, American labs are filled with Chinese who are trained at the very same schools as Chinese labs. Second, if you look at the contributions from Chinese labs many have pushed the state of the art.

Zooming out, data is kind of an arbitrary line to draw. Anthropic didn’t invent the neural network, back propagation, or the transformer. They didn’t invent all of the post training techniques they are using. They might even be using some pretrained open models during pre training and data prep. They got all of those for free because those things are shared openly.

2001zhaozhao · 133 days ago

With OAI and Gemini already having anti-distillation measures for quite a while now, I thought Anthropic was purposefully letting Chinese labs distill in hopes that it would improve their safety and alignment by default (at least closer to Claude's level).

Apparently not. (Or not anymore.)

It's not like they can actually prevent distillation anyways even by hiding the thinking output, since you can just turn extended thinking off and all current Claude models will switch to thinking in the open (non-reasoning output) instead whenever it encounters a hard agentic task. So all it takes for distillation to continue to happen is for some real users to sell a competing AI lab their real usage trajectory data which is entirely undetectable by definition, and many people would probably be glad to do it.

F7F7F7 · 132 days ago

Not too long ago Anthropic started hiding chain of thought in Claude Code. Returning the 'thinking' all at once for only a second or two.

There's a lot of good insight inside of there and you can immediately identify any misconceptions and stop an LLM from going in the wrong direction or doing damage.

A lot of their decisions as of late make sense now. The experience has suffered and some of the examples they cite as signs of distillation point to things only getting worse.

tedsanders · 133 days ago

One consequence of creating a country of geniuses in a data center is that you now have a country of geniuses who can potentially help your competitors catch up on research, coding, and data labeling. It's a tough problem for the industry and, more importantly, for long-term safety.

We're obviously nowhere close now, but if we get to a world AI becomes powerful, and powerful AI can be used to create misaligned powerful AI, you may have to start regulating powerful AI like refined uranium processing tech, which is regulated more heavily than refined uranium itself.

toddsen · 133 days ago

This feels like the old crypto export controls all over again. We tried to regulate compilers in the 90s and it was unenforceable then. Now you want to regulate gradient updates? Good luck with that.

xyzsparetimexyz · 133 days ago

Whose safety? Anthropics? Sure.

axel · 133 days ago

I think the parent is being a bit uncharitable here. The safety concern isn't just about Anthropic's competitive position—there's a legitimate research question about whether model distillation could let less careful actors bypass safety measures. The Zou et al. work on jailbreaks via fine-tuning suggests this isn't purely theoretical.

noravux · 133 days ago

Oh the hypocrisy.

trexmux · 133 days ago

Look, they are stealing from our plagiarism machine ;)

titaniumrain · 131 days ago

https://x.com/nichengzaxie233/status/2026506172476674475

ReDeiPirati · 133 days ago

I think they are exposing how fragile and vulnerable in reality they are, and I wonder when it will happen that a group of highly motivated individuals will organize to create a truly community driven distilled models.

SteveVeilStream · 133 days ago

This is an exmaple of a potentially problematic prompt: "You are an expert data analyst combining statistical rigor with deep domain knowledge. Your goal is to deliver data-driven insights — not summaries or visualizations — grounded in real data and supported by complete and transparent reasoning."

And they say: "This includes detection of chain-of-thought elicitation used to construct reasoning training data." ... "We are developing Product, API and model-level safeguards designed to reduce the efficacy of model outputs for illicit distillation, without degrading the experience for legitimate customers."

It's going to be very hard to generate outputs that people need but that also can't be used for distillation. For example, it's a good practice for many reasons including audibility to ask for the chain of thought. In fact, I'd argue it's essentially impossible to modify the outputs in a way that makes them less useful for distillation without degrading quality for legitimate users.

So then their only viable option is to try to identify the traffic. However, that is very hard because: "In one case, a single proxy network managed more than 20,000 fraudulent accounts simultaneously, mixing distillation traffic with unrelated customer requests to make detection harder."

nitros · 133 days ago

How exactly does distilling a censored model produce an uncensored model?

nebezb · 133 days ago

It doesn't. Anthropic are, as usual, sounding an alarm to pull the ladder up from behind them.

janalsncm · 133 days ago

First of all this is not technically distillation, it is more imitation learning.

Second, you could do something like asking Claude to create 1 million prompt, offensive response, non offensive response triplets. Then train a model with DPO to prefer the offensive responses.

ncb9094 · 133 days ago

it technically can. there are patterns that emerge which manifest with no "safegurads" during training

amai · 132 days ago

See also https://www.theregister.com/2026/02/14/ai_risk_distillation_...

titaniumrain · 133 days ago

just blame china and everything will be fine after

vincelund · 133 days ago

Actually I think they are getting logits in a sense — even without explicit access, you can infer approximate probabilities by using the API's temperature and sampling outputs repeatedly. Not quite the same as direct logit access, but close enough for distillation to work.