BeyondBench Leaderboard
| Rank | Model (Param) | Easy | Medium | Hard | Overall | Actions | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Acc (%) | Inst (%) | Tokens (avg) | Acc (%) | Inst (%) | Tokens (avg) | Acc (%) | Inst (%) | Tokens (avg) | Acc (%) | Inst (%) | Tokens (avg) | |||
Showing 0 of 0 models
Benchmark-Free Evaluation of Reasoning in Language Models
| Rank | Model (Param) | Easy | Medium | Hard | Overall | Actions | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Acc (%) | Inst (%) | Tokens (avg) | Acc (%) | Inst (%) | Tokens (avg) | Acc (%) | Inst (%) | Tokens (avg) | Acc (%) | Inst (%) | Tokens (avg) | |||
BeyondBench introduces a revolutionary approach to evaluating reasoning capabilities in language models without relying on traditional static benchmarks. Our system dynamically generates novel problems across 44 distinct reasoning tasks with 117 variations, ensuring that models cannot memorize solutions and must demonstrate true reasoning abilities.
@article{srivastava2025beyondbench,
title={BeyondBench: Benchmark-Free Evaluation of Reasoning in Language Models},
author={Srivastava, Gaurav and Hussain, Aafiya and Bi, Zhenyu and Roy, Swastik and Pitre, Priya and Lu, Meng and Ziyadi, Morteza and Wang, Xuan},
journal={arXiv preprint arXiv:xxxx.xxxxx},
year={2025}
}