🧠

LLMThinkBench

Evaluating Math Reasoning & Overthinking in LLMs

LLMThinkBench is a comprehensive framework designed to rigorously evaluate the basic math reasoning capabilities of Language Models, while identifying instances of overthinking—where models apply unnecessarily complex logic to simple problems.

PyPI v0.1.6 GitHub Preprint (arXiv)

🏆 Category Winners

🥇

Best Overall Accuracy

🥈

Least Overthinking Model

🥉

Best Instruction Following

Rank	Model	Parameters	Accuracy	Efficiency Score	Instruction Following	Overthinking Ratio	Avg Tokens	Avg Words	Avg Chars	Actions

LLMThinkBench

Evaluating Math Reasoning & Overthinking in LLMs

🏆 Category Winners

Advanced Filters

📊 Performance Insights

Accuracy vs Efficiency Trade-off

Model Size vs Performance

Overthinking vs Accuracy

Token Efficiency by Model Size

Instruction Following vs Accuracy

Top 20 Models Performance Trends

Selected Models for Comparison

Evaluating Math Reasoning & Overthinking in LLMs

🏆 Category Winners

Advanced Filters Reset

📊 Performance Insights

Accuracy vs Efficiency Trade-off

Model Size vs Performance

Overthinking vs Accuracy

Token Efficiency by Model Size

Instruction Following vs Accuracy

Top 20 Models Performance Trends

Selected Models for Comparison

Model Comparison

Advanced Filters