Daily Paper Machine

Tag: Reward Modeling

All the papers with the tag "Reward Modeling".

RM-R1: Reward Modeling as Reasoning
grok-3-latest
Score: 0.83
Published:2025年5月5日 at 06:11
#LLM, #Reward Modeling, #Reasoning, #Distillation, #RLHF
本文提出 RM-R1 模型家族，通过将奖励建模转化为推理任务，利用蒸馏和强化学习显著提升了奖励模型的解释性和性能，超越了更大规模的开源和商业模型。
RM-R1: Reward Modeling as Reasoning
grok-3-latest
Score: 0.83
Published:2025年5月5日 at 06:11
#LLM, #Reward Modeling, #Reasoning, #Distillation, #Reinforcement Learning
本文提出将奖励建模作为推理任务的范式，通过推理链蒸馏和强化学习训练 RM-R1 模型，显著提升了奖励模型的解释性和性能，超越了更大规模的开源和闭源模型。
Coupled Distributional Random Expert Distillation for World Model Online Imitation Learning
grok-3-latest
Score: 0.48
Published:2025年5月4日 at 19:32
#Imitation Learning, #World Model, #Density Estimation, #Reward Modeling, #Latent Space
本文提出基于密度估计的耦合分布随机专家蒸馏（CDRED）奖励模型，显著提升了世界模型在线模仿学习的稳定性和性能，成功应对了对抗性训练带来的挑战。