Daily Paper Machine

Tag: Interpretability

All the papers with the tag "Interpretability".

Adversarial Cooperative Rationalization: The Risk of Spurious Correlations in Even Clean Datasets
grok-3-latest
Score: 0.48
Published:2025年5月4日 at 14:00
#LLM, #Rationalization, #Spurious Correlations, #Adversarial Attack, #Interpretability
本文提出 A2I 方法，通过对抗性攻击检测并纠正自解释合理化框架中模型引入的虚假相关性，显著提升 Rationale 质量。
Empirical Evaluation of Progressive Coding for Sparse Autoencoders
grok-3-latest
Score: 0.44
Published:2025年4月30日 at 21:08
#LLM, #Sparse Autoencoder, #Progressive Coding, #Feature Extraction, #Interpretability
本文提出 Matryoshka SAEs 和基于幂律分布的剪枝方法，为稀疏自编码器的渐进式编码提供高效策略，并在性能、计算效率与可解释性之间进行了深入权衡分析。