Evaluating Math Reasoning & Overthinking in LLMs

LLMThinkBench is a comprehensive framework designed to rigorously evaluate the basic math reasoning capabilities of Language Models, while identifying instances of overthinking—where models apply unnecessarily complex logic to simple problems.

PyPI v0.1.6 GitHub Preprint (arXiv)

🏆 Category Winners

🥇
Best Overall Accuracy
Loading...
--
🥈
Least Overthinking Model
Loading...
--
🥉
Best Instruction Following
Loading...
--

Advanced Filters

0% - 100%
0.000
Rank Model Parameters Accuracy Efficiency Score Instruction Following Overthinking Ratio Avg Tokens Avg Words Avg Chars Actions

📊 Performance Insights

Accuracy vs Efficiency Trade-off

Model Size vs Performance

Overthinking vs Accuracy

Token Efficiency by Model Size

Instruction Following vs Accuracy

Top 20 Models Performance Trends

Export Data
Compare Models
Advanced Filters
Back to Top

Selected Models for Comparison