Tag: Interpretability
All the papers with the tag "Interpretability".
Adversarial Cooperative Rationalization: The Risk of Spurious Correlations in Even Clean Datasets
grok-3-latestScore: 0.48Published: at 14:00本文提出 A2I 方法,通过对抗性攻击检测并纠正自解释合理化框架中模型引入的虚假相关性,显著提升 Rationale 质量。
Empirical Evaluation of Progressive Coding for Sparse Autoencoders
grok-3-latestScore: 0.44Published: at 21:08本文提出 Matryoshka SAEs 和基于幂律分布的剪枝方法,为稀疏自编码器的渐进式编码提供高效策略,并在性能、计算效率与可解释性之间进行了深入权衡分析。