Tag: Benchmarking
All the papers with the tag "Benchmarking".
From Knowledge to Reasoning: Evaluating LLMs for Ionic Liquids Research in Chemical and Biological Engineering
grok-3-latestScore: 0.57Published: at 12:32本文通过构建专家标注数据集评估大型语言模型在离子液体碳捕获研究中的推理能力,揭示其领域特定推理的局限性并提出未来改进方向。
Software Development Life Cycle Perspective: A Survey of Benchmarks for CodeLLMs and Agents
grok-3-latestScore: 0.39Published: at 14:27本文通过系统分析181个CodeLLMs和代理基准测试,揭示了SDLC各阶段评估的不平衡性,并为未来基准测试设计提供了全面指导。
FormalMATH: Benchmarking Formal Mathematical Reasoning of Large Language Models
grok-3-latestScore: 0.55Published: at 15:37本文提出 FormalMATH,一个包含 5560 个形式化数学问题的 Lean4 基准测试,通过高效的‘人在回路中’自动化形式化流程构建,并揭示了当前大型语言模型在形式化推理中的显著局限性。