Tag: Safety Alignment
All the papers with the tag "Safety Alignment".
Fight Fire with Fire: Defending Against Malicious RL Fine-Tuning via Reward Neutralization
grok-3-latestScore: 0.66Published: at 17:18本文提出Reward Neutralization框架,通过训练模型生成最小信息拒绝来中和恶意RL微调的奖励信号,显著提升开源模型在攻击下的安全性。