AI LLM (Large Language Model) benchmark metrics and platforms 大语言模型各种任务评价指标 评价标准 评价平台 语文写作,编程,数学 等等
See also the main item: /LLM-. Combined Information (active updating) # https://llm-stats.com/ inc. Price, Sizes etc. with human votes 幻觉排行榜 # 由于“不输出=0分,输出带有错误=低分”的训练,导致大模型有幻觉。 新模型可能由于倾向于大胆推测导致比老模型更容易有幻觉,例如 Gemini 3 (13.5%) 比 2.5 (3.3%) 幻觉比例高很多。 vectara/hallucination-leaderboard Coding, Programming # baseline: 人类专家 97%. SWE-bench leaderboard: (last update 2025-02) top: 33% Claude 3.7 sonnet. 侧重于实际应用, 由普林斯顿大学开发,真实 GitHub Issue 修复,需跨文件编辑、通过单元测试 measured in ELO rating a dataset that tests systems’ ability to solve GitHub issues automatically, contains 2,294 Issue-Pull Request pairs from 12 popular Python repositories. ...