Towards Reasoning Ability of Small Language Models

1Department of Computer Science, Virginia Tech, Blacksburg, VA, USA
2Department of Physics, Clarendon Laboratory, University of Oxford, OX1 3PU, UK
3NVIDIA Corporation

Abstract

Reasoning has long been viewed as an emergent property of large language models (LLMs), appearing at or above a certain scale (~100B parameters). However, recent studies challenge this assumption, showing that small language models (SLMs) can also achieve competitive reasoning performance. SLMs are increasingly favored for their efficiency and deployability. However, there is a lack of systematic study on the reasoning abilities of diverse SLMs, including those trained from scratch or derived from LLMs through quantization, pruning, and distillation. This raises a critical question: Can SLMs achieve reasoning abilities comparable to LLMs? In this work, we systematically survey, benchmark, and analyze 72 SLMs from six model families across 14 reasoning benchmarks. For reliable evaluation, we examine four evaluation methods and compare four LLM judges against human evaluations on 800 data points. We repeat all experiments three times to ensure a robust performance assessment. Additionally, we analyze the impact of different prompting strategies in small models. Beyond accuracy, we also evaluate model robustness under adversarial conditions and intermediate reasoning steps. Our findings challenge the assumption that scaling is the only way to achieve strong reasoning. Instead, we foresee a future where SLMs with strong reasoning capabilities can be developed through structured training or post-training compression. They can serve as efficient alternatives to LLMs for reasoning tasks.

Small Language Models Reasoning Leaderboard

Performance of small language models across 14 reasoning benchmarks and 6 sorting tasks along with GPU memory and disk size.
The models are sorted by their overall average performance on the reasoning benchmarks.

Model Params Quantization GPU Memory (GB) Disk Size (GB) GSM8K (Direct I/O) GSM8K (COT) GSM8K (5-Shot) GSM8K (5-Shot COT) GSM8K (8-Shot) ARC-Easy ARC-Challenge CommonsenseQA Sort-8 (+ve) Sort-8 (mixed) Sort-16 (+ve) Sort-16 (mixed) Sort-32 (+ve) Sort-32 (mixed) Sorting Average Overall Average
Qwen2.5-0.5B 0.5B GPTQ 4-bit 1.12 0.45 34.62 32.85 28.15 27.52 27.80 52.58 37.63 36.42 5.33 3.33 0.33 0.00 0.00 0.00 1.50 19.04
Qwen2-0.5B 0.5B GPTQ 4-bit 1.12 0.71 21.51 25.32 14.38 16.76 14.23 52.05 37.03 43.11 2.00 0.00 0.00 0.00 0.00 0.00 0.33 22.64
Qwen2-0.5B 0.5B W: INT4 & A: INT16 1.51 0.71 25.42 27.32 18.09 18.35 16.40 50.56 36.63 42.42 5.67 0.67 0.00 0.00 0.00 0.00 1.06 24.05
Qwen2.5-0.5B 0.5B GPTQ 8-bit 0.71 0.62 46.85 47.18 42.20 44.20 42.25 61.74 44.43 46.19 12.67 3.00 0.00 0.00 0.00 0.00 2.61 24.23
Qwen2.5-0.5B 0.5B None 2.02 0.95 46.80 46.88 42.73 43.19 42.28 62.50 44.28 46.90 11.67 3.67 0.33 0.00 0.00 0.00 2.61 24.37
Qwen2-0.5B 0.5B W: INT8 & A: INT8 1.38 0.87 37.60 37.50 26.23 26.99 25.78 55.36 40.27 47.45 7.33 0.33 0.00 0.00 0.00 0.00 1.28 25.35
Qwen2-0.5B 0.5B GPTQ 8-bit 0.71 1.4 38.08 37.91 26.33 27.27 26.59 56.13 40.30 47.50 7.67 0.33 0.00 0.00 0.00 0.00 1.33 25.84
Qwen2-0.5B 0.5B W: INT8 & A: INT16 1.38 0.61 37.68 38.13 26.43 26.54 26.81 56.51 39.87 47.23 11.67 0.33 0.00 0.00 0.00 0.00 2.00 26.10
Qwen2-0.5B 0.5B None 2.02 0.95 37.25 38.31 26.38 28.46 26.76 56.41 40.44 48.13 10.33 0.00 0.00 0.00 0.00 0.00 1.72 26.25
Qwen2-0.5B 0.5B FP8 -- -- 35.20 35.94 23.17 25.25 22.52 56.61 40.13 46.76 6.33 0.67 0.00 0.00 0.00 0.00 1.17 29.26
Llama-3.2-1B 1B FP8 2.47 1.9 36.42 39.63 31.16 30.88 31.87 67.03 48.01 48.48 42.67 1.33 0.67 0.00 0.00 0.00 7.45 29.47
Llama-3.2-1B 1B None 4.73 2.4 36.39 38.99 33.69 32.73 33.13 67.23 47.50 48.38 44.67 1.33 1.00 0.00 0.00 0.00 7.83 29.67
Llama-3.2-1B 1B W: INT8 & A: INT8 1.53 1.9 36.87 39.63 32.07 30.58 32.88 67.45 47.90 48.10 50.33 1.67 1.33 0.00 0.00 0.00 8.89 30.17
Llama-3.2-1B 1B FP8-dynamic 2.47 2.0 36.21 40.86 32.93 31.24 32.83 67.02 48.69 48.40 45.67 2.00 2.67 0.00 0.00 0.00 8.39 30.29
Minitron 4B None 16.01 7.9 27.95 28.68 35.41 34.80 34.07 -- -- -- -- -- -- -- -- -- -- 32.18
SmolLM2 1.7B None 6.55 3.2 46.17 43.75 44.23 41.47 44.78 75.04 54.21 53.18 55.33 28.00 14.67 2.67 0.33 0.00 16.83 36.99
Hymba 1.5B None -- 2.9 53.75 53.53 52.87 52.99 52.74 84.57 66.78 64.73 34.67 12.00 1.00 0.00 0.00 0.00 7.95 38.10
Qwen2-1.5B 1.5B GPTQ 4-bit 1.81 2.4 56.31 57.54 49.41 52.99 49.66 82.03 63.99 68.99 33.00 13.00 3.33 0.00 0.00 0.00 8.22 42.00
Qwen2-1.5B 1.5B W: INT4 & A: INT16 3.14 1.6 57.90 58.55 48.40 53.10 48.29 81.64 63.51 66.42 43.00 17.67 6.00 0.33 0.00 0.00 11.17 42.87
Mistral-v0.3 7B W: INT4 & A: INT16 11.17 3.9 53.93 56.03 51.91 53.47 50.87 88.33 74.97 69.83 54.00 25.00 16.00 3.00 4.00 0.00 17.00 42.95
Mistral-v0.3 7B W: INT8 & A: INT8 34.84 7.1 52.11 55.60 53.88 55.75 52.79 88.65 75.97 70.52 55.33 38.67 14.67 5.00 0.67 1.00 19.22 44.04
Qwen2-1.5B 1.5B FP8 -- -- 61.97 63.38 53.88 57.27 54.28 83.77 66.33 68.93 42.33 19.33 7.00 0.00 0.00 0.00 11.44 45.21
Mistral-v0.3 7B FP8 -- -- 54.13 54.99 53.96 57.67 53.85 88.64 76.39 69.48 58.33 45.67 21.00 5.33 2.67 0.00 22.17 46.09
Qwen2-1.5B 1.5B W: INT8 & A: INT8 2.48 2.2 62.45 63.00 54.13 58.73 55.75 83.64 66.84 69.72 46.33 21.33 5.67 0.00 0.00 0.00 12.22 46.13
Mistral-v0.3 7B W: INT8 & A: INT16 14.36 7.1 54.26 55.85 54.13 56.86 52.82 89.07 76.68 70.22 62.00 45.00 24.00 5.33 2.00 0.00 23.06 46.15
Qwen2-1.5B 1.5B None 7.09 2.9 62.83 64.85 56.46 59.51 55.88 84.34 67.29 69.78 44.67 21.33 7.33 0.00 0.00 0.00 12.22 46.19
Qwen2-1.5B 1.5B GPTQ 8-bit 2.54 3.1 62.85 63.86 57.16 59.79 57.24 84.19 66.55 69.97 46.33 20.00 7.33 0.00 0.00 0.00 12.28 46.42
Qwen2-1.5B 1.5B W: INT8 & A: INT16 2.51 1.7 62.98 64.04 56.41 59.72 57.19 83.96 66.84 70.19 47.67 21.33 5.67 0.00 0.00 0.00 12.45 46.67
Mistral-v0.3 7B None 27.67 14 54.84 55.98 54.76 57.90 54.23 88.99 76.82 69.83 60.33 48.33 21.33 5.67 2.00 1.00 23.11 46.93
Qwen2.5-1.5B 1.5B GPTQ 4-bit 1.81 1.1 64.92 64.92 62.40 63.28 62.42 86.25 70.25 69.10 60.00 46.67 12.67 7.33 0.00 0.00 21.11 47.02
Qwen2.5-1.5B 1.5B None 6.68 2.9 70.00 70.20 69.72 68.46 69.90 87.58 73.81 71.85 66.33 65.33 34.33 7.33 1.33 0.00 29.11 53.22
Qwen2.5-1.5B 1.5B GPTQ 8-bit 2.54 1.7 70.33 70.33 70.03 68.99 69.52 87.78 73.72 72.10 68.33 65.33 36.67 8.00 1.33 0.00 29.94 54.10
Qwen2.5-3B 3B GPTQ 4-bit 2.88 2.0 81.78 81.60 81.58 81.91 81.78 92.12 80.86 71.96 72.67 65.67 17.67 19.67 0.00 1.00 29.45 61.29
Qwen2-7B 7B GPTQ 4-bit 6.48 5.3 85.54 86.35 85.92 84.96 85.42 93.45 85.52 78.92 80.67 72.00 33.33 23.33 4.33 0.33 35.67 64.91
Llama-3.2-3B 3B FP8 6.44 4.2 74.07 75.31 72.91 73.04 71.19 88.03 74.03 68.74 96.00 41.33 61.67 18.00 19.00 0.00 39.33 65.57
Qwen2.5-3B 3B None 12.42 5.8 84.74 84.38 85.44 84.96 85.44 93.49 83.73 76.25 78.33 75.33 47.67 34.33 2.67 1.00 39.89 65.78
Llama-3.2-3B 3B FP8-dynamic 6.44 4.2 73.49 75.13 72.71 73.34 72.05 87.53 73.58 69.75 94.00 52.33 60.33 16.00 18.00 0.00 40.11 65.80
Qwen2-7B 7B W: INT4 & A: INT16 12.96 5.3 84.53 85.57 85.32 84.91 85.19 94.22 84.95 78.98 79.67 77.00 43.00 26.67 5.33 0.00 38.61 65.85
Qwen2.5-3B 3B GPTQ 8-bit 4.21 3.3 85.17 84.99 84.38 84.38 84.71 93.55 83.53 76.77 80.33 75.00 47.67 32.67 2.00 1.00 39.78 65.87
Llama-3.2-3B 3B W: INT8 & A: INT8 3.66 4.2 72.58 75.23 73.39 73.74 72.68 87.22 74.37 69.31 95.67 49.67 62.67 17.33 15.00 0.00 40.06 66.06
Llama-3.2-3B 3B None 13.21 6.0 73.54 75.18 74.02 72.73 72.61 87.84 74.63 69.72 96.67 55.33 73.33 17.33 17.00 0.00 43.28 67.19
Qwen2-7B 7B FP8 -- -- 86.66 87.14 86.05 86.56 86.13 94.26 85.41 80.32 81.00 83.00 47.00 29.00 13.33 1.00 42.39 67.45
Qwen2-7B 7B W: INT8 & A: INT16 9.42 8.2 86.40 87.06 86.15 85.97 86.38 93.91 85.47 80.13 84.00 80.67 43.67 35.00 14.33 1.67 43.22 67.59
Qwen2-7B 7B W: INT8 & A: INT8 9.58 8.2 87.11 87.31 86.63 86.58 86.56 94.02 85.38 79.66 79.33 83.67 40.33 31.67 17.00 0.67 42.11 67.69
Qwen2-7B 7B None 30.05 15 87.14 87.34 86.58 85.82 86.40 94.21 85.52 80.54 83.33 80.33 45.00 36.33 15.00 2.67 43.78 68.35
Qwen2-7B 7B GPTQ 8-bit 9.63 8.3 87.16 87.54 86.56 86.50 86.40 94.28 85.64 80.04 84.33 82.67 44.00 32.33 14.67 3.00 43.50 68.35
Qwen2.5-7B 7B GPTQ 4-bit 6.48 5.3 90.62 91.23 90.65 90.73 90.85 95.62 89.19 82.69 80.67 15.00 58.33 15.67 31.67 1.00 33.72 68.35
Phi-3.5-mini 3.8B None 14.6 7.2 85.47 87.14 82.97 80.74 82.89 95.09 86.89 76.11 90.33 77.33 68.67 18.33 29.00 0.33 47.33 70.44
Llama-3.1 8B W: INT4 & A: INT16 12.6 5.4 82.21 83.80 82.13 80.74 81.70 90.49 76.62 73.57 82.67 66.67 69.67 52.00 56.67 6.67 55.73 74.38
Phi-3-small 7B None -- 17.95 70.10 81.73 83.14 86.02 83.62 97.12 91.38 79.85 98.00 93.33 69.00 52.00 9.33 0.67 53.72 74.69
Mistral-Nemo 12B W: INT4 & A: INT16 61.98 7.8 84.74 85.67 84.61 83.67 84.99 91.82 81.80 71.33 97.00 79.00 77.33 42.33 59.67 7.33 60.44 76.02
Mistral-Nemo 12B FP8 -- -- 87.31 86.58 85.67 85.77 85.29 92.19 83.16 73.41 95.00 78.67 77.33 50.33 48.33 9.00 59.78 76.43
Mistral-Nemo 12B None 57.89 23 86.76 86.08 85.57 84.94 85.34 92.79 83.70 72.78 95.00 81.33 78.33 54.67 49.33 6.67 60.89 77.02
Llama-3.1 8B FP8-dynamic 21.09 8.5 83.27 84.86 82.97 83.88 84.69 92.33 81.00 74.09 81.67 74.67 74.33 53.00 65.33 5.00 59.00 77.13
Llama-3.1 8B FP8 14.44 8.5 82.89 84.63 83.42 84.94 83.83 92.17 79.52 73.93 81.00 81.33 72.00 51.67 61.00 6.00 58.83 77.15
Llama-3.1 8B W: INT8 & A: INT8 8.98 8.5 83.37 85.27 83.45 84.41 83.32 92.33 79.98 73.63 82.33 77.00 70.67 58.00 62.33 4.67 59.17 77.53
Llama-3.1 8B None 30.65 15 83.45 85.27 83.45 84.51 83.50 92.07 79.58 74.28 86.00 78.67 74.67 56.33 59.67 5.33 60.11 77.93
Llama-3.1 8B W: INT8 & A: INT16 15.94 8.5 83.95 84.89 83.78 83.75 83.62 92.34 80.32 73.87 86.33 79.00 73.67 56.00 65.00 5.67 60.95 78.01
Qwen2.5-7B 7B None 30.05 15 91.76 92.19 91.05 91.89 91.33 96.03 90.53 82.66 94.33 90.00 69.67 47.00 39.33 5.67 57.67 78.36
Qwen2.5-7B 7B GPTQ 8-bit 9.63 8.3 91.84 92.22 91.81 91.56 91.31 96.03 90.64 82.58 94.00 92.00 71.33 49.00 41.33 5.67 58.89 78.87
Qwen2.5-14B 14B GPTQ 4-bit 10.65 9.4 94.74 94.69 94.01 94.31 93.63 97.57 93.17 83.10 95.00 95.67 82.33 64.00 54.33 26.00 69.56 83.53
Qwen2.5-14B 14B None 57.04 28 94.29 94.57 94.06 94.54 93.86 97.87 93.37 84.08 96.33 95.33 84.00 72.00 61.33 38.67 74.61 85.80
Qwen2.5-14B 14B GPTQ 8-bit 17.24 16 94.49 94.95 93.71 94.59 94.11 97.90 93.71 84.22 96.33 95.00 84.00 72.00 65.00 36.33 74.78 86.02
Qwen2.5-32B 32B None 125 62 95.40 95.78 95.20 95.55 94.92 98.26 95.25 87.11 99.00 99.33 93.33 92.33 79.00 60.00 87.17 91.77
Qwen2.5-32B 32B GPTQ 4-bit 52.42 19 95.73 95.73 94.92 95.43 95.12 98.09 95.19 87.06 100.00 100.00 98.33 91.67 77.33 56.33 87.28 91.80
Qwen2.5-32B 32B GPTQ 8-bit 33.81 33 95.73 95.86 95.50 95.60 95.25 98.34 95.16 86.62 99.00 99.00 93.33 92.33 79.67 61.00 87.39 91.86
Llama-3.1 70B W: INT8 & A: INT16 138.64 68 92.92 94.36 93.96 94.39 93.51 97.59 92.89 80.04 100.00 99.00 98.00 95.00 99.00 85.33 96.06 93.14
Llama-3.1 70B W: INT4 & A: INT16 107.34 38 95.15 95.20 94.82 95.12 94.90 98.26 94.51 82.77 100.00 100.00 98.67 97.00 99.67 76.33 95.28 94.09
Llama-3.1 70B None 269.17 132 95.10 95.27 94.72 94.44 94.64 98.34 94.43 83.73 100.00 100.00 99.00 97.00 100.00 88.00 97.33 94.74
Llama-3.1 70B W: INT8 & A: INT8 69.34 68 94.72 95.00 94.52 94.62 94.54 98.43 94.62 83.92 100.00 100.00 99.33 96.67 100.00 85.33 96.89 94.82
Llama-3.1 70B FP8 107.32 68 94.87 95.40 94.67 94.52 94.74 98.36 94.71 83.87 100.00 100.00 98.67 96.33 100.00 86.67 96.95 94.92
Llama-3.1 70B FP8-dynamic 176.63 68 94.64 95.38 95.00 95.10 94.52 98.46 94.54 83.70 100.00 100.00 98.67 97.67 100.00 86.00 97.06 95.04

Note: The sorting tasks involve arranging lists of numbers, with (+ve) containing only positive numbers and (mixed) containing both positive and negative numbers.

Key Findings

Small ≠ Weak — 32 B Qwen rivals GPT‑4 Turbo!

Qwen 2.5‑32B matches GPT‑4‑Turbo on intermediate‑reasoning (MR‑GSM8K 55.6 vs 53.0) and reaches 95 % on GSM8K while using ≈1⁄5 the parameters, overturning the “> 100 B for reasoning” myth.

Quantization is (almost) free!

4‑ to 8‑bit GPTQ cuts GPU memory by up to 75 % yet preserves ≥ 99 % of accuracy across GSM8K, ARC, CQA and robustness benchmarks, enabling laptop‑scale deployment of formerly heavyweight models.

Keep prompts simple!

On GSM8K, direct I/O prompts outperform or equal Chain‑of‑Thought and multi‑shot variants; additional “think‑step” instructions often confuse SLMs instead of helping them.

Sequence length is the Achilles’ heel!

Accuracy on sorting jumps from > 80 % (8‑item lists) to < 40 % (32‑item mixed lists), and negative numbers exacerbate errors, revealing a context‑length bottleneck for algorithmic reasoning.

Robustness scales with size—but survives quantization!

Larger SLMs (32B, 70B) remain the most resilient to adversarial GSM‑Plus inputs, yet their quantized versions show negligible degradation, whereas pruned counterparts collapse.

Pruning hurts, sometimes fatally!

Weight‑pruned 8 B models lose 30–50 points on reasoning tasks and score ~0 on MR‑GSM8K, showing that aggressive sparsification cripples logical consistency even after recovery fine‑tuning.

BibTeX

@article{srivastava2025towards,
  title={Towards reasoning ability of small language models},
  author={Srivastava, Gaurav and Cao, Shuxiang and Wang, Xuan},
  journal={arXiv preprint arXiv:2502.11569},
  year={2025}
}