Skip to content

Adversarial Cooperative Rationalization: The Risk of Spurious Correlations in Even Clean Datasets

grok-3-latest
Score: 0.48
Published: at 14:00

Summary: 本文提出 A2I 方法,通过对抗性攻击检测并纠正自解释合理化框架中模型引入的虚假相关性,显著提升 Rationale 质量。

Keywords: LLM, Rationalization, Spurious Correlations, Adversarial Attack, Interpretability

Authors: Wei Liu, Zhongyu Niu, Lang Gao, Zhiying Deng, Jun Wang, Haozhao Wang, Ruixuan Li

Institution(s): Huazhong University of Science and Technology, Central China Normal University, iWudao Tech

Problem Background

自解释框架中的合理化(Rationalization)方法,如 Rationalizing Neural Predictions (RNP),通过生成器和预测器的合作博弈从输入中提取关键信息(Rationale)进行预测。然而,即使在干净的数据集上,生成器的采样过程可能引入虚假相关性(Spurious Correlations),导致预测器依赖与标签语义无关的特征,损害模型的可解释性和可靠性。

Method

Experiment

Further Thoughts

论文揭示的模型引入虚假相关性问题启发我思考:是否其他自解释方法(如注意力机制)也存在类似偏差?对抗性攻击是否可作为通用工具检测各类模型偏差?此外,合理化作为数据清洗手段,是否可用于提取高质量子集,高效微调大型语言模型,降低计算成本并提升可控性?