Paper: Antidistillation Sampling
Authors: Yash Savani, Asher Trockman, Zhili Feng, Avi Schwarzschild, Alexander Robey, Marc Finzi, J. Zico Kolter (CMU)
Published: April 17, 2025 (arXiv)
Problem Background
Detailed reasoning traces generated by large language models (LLMs), while powerful, create a vulnerability. Competitors can leverage these public traces for model distillation, cheaply replicating powerful models. This leads to intellectual property leakage and potential security risks (e.g., bypassing safety alignments).
Proposed Method: Antidistillation Sampling
- Core Idea: Poison the reasoning traces generated by the original (teacher) model to hinder distillation, without significantly compromising the teacher model’s own performance.
- Implementation: This is a sampling strategy applied during token generation:
- It considers not only the teacher model’s original next-token probabilities but also introduces an “antidistillation” adjustment term.
- This adjustment term uses a proxy model (a smaller model) and the gradient of a loss on a downstream task to estimate which tokens are “harmful” to distillation (i.e., selecting them reduces the effectiveness of distillation).
- The next token is then sampled from this adjusted probability distribution.
- Key Aspect: The original teacher model is not modified. The adjustment happens only during inference sampling, and the poisoning strength is controlled to minimize impact on the teacher’s utility.
Experimental Results
- Effectiveness: Antidistillation sampling significantly reduces the effectiveness of student model distillation (measured by a large drop in accuracy) while maintaining the teacher model’s accuracy on benchmarks like GSM8K and MATH.
- Superiority: Compared to simply increasing the sampling temperature (which drastically degrades teacher performance), antidistillation sampling offers a better trade-off between performance and distillation resistance.
- Overhead: The main cost is two additional forward passes through the small proxy model for each generated token.