ThinkSLM: Towards Reasoning in Small Language Models

Department of Computer Science, Virginia Tech, Blacksburg, VA, USA
Department of Physics, Clarendon Laboratory, University of Oxford, OX1 3PU, UK
NVIDIA Corporation
EMNLP 2025 Main Conference

Abstract

Reasoning has long been viewed as an emergent property of large language models (LLMs), appearing at or above a certain scale (~100B parameters). However, recent studies challenge this assumption, showing that small language models (SLMs) can also achieve competitive reasoning performance. SLMs are increasingly favored for their efficiency and deployability. However, there is a lack of systematic study on the reasoning abilities of diverse SLMs, including those trained from scratch or derived from LLMs through quantization, pruning, and distillation. This raises a critical question: Can SLMs achieve reasoning abilities comparable to LLMs? In this work, we systematically survey, benchmark, and analyze 72 SLMs from six model families across 14 reasoning benchmarks. For reliable evaluation, we examine four evaluation methods and compare four LLM judges against human evaluations on 800 data points. We repeat all experiments three times to ensure a robust performance assessment. Additionally, we analyze the impact of different prompting strategies in small models. Beyond accuracy, we also evaluate model robustness under adversarial conditions and intermediate reasoning steps. Our findings challenge the assumption that scaling is the only way to achieve strong reasoning. Instead, we foresee a future where SLMs with strong reasoning capabilities can be developed through structured training or post-training compression. They can serve as efficient alternatives to LLMs for reasoning tasks.

Small Language Models Reasoning Leaderboard

Performance of small language models across 14 reasoning benchmarks and 6 sorting tasks along with GPU memory and disk size.
The models are sorted by their overall average performance on the reasoning benchmarks.

Model Params Quantization GPU Memory (GB) Disk Size (GB) Overall Average GSM8K (Direct I/O) GSM8K (COT) GSM8K (5-Shot) GSM8K (5-Shot COT) GSM8K (8-Shot) ARC-Easy ARC-Challenge CommonsenseQA Sort-8 (+ve) Sort-8 (mixed) Sort-16 (+ve) Sort-16 (mixed) Sort-32 (+ve) Sort-32 (mixed) Sorting Average
Llama-3.2-1B 1B None 4.73 2.4 36.39 38.99 33.69 32.73 33.13 67.23 47.50 48.38 44.67 1.33 1.00 0.00 0.00 0.00 7.83 29.67
Llama-3.2-1B 1B W: INT8 & A: INT8 1.53 1.9 36.87 39.63 32.07 30.58 32.88 67.45 47.90 48.10 50.33 1.67 1.33 0.00 0.00 0.00 8.89 30.17
Llama-3.2-1B 1B FP8 2.47 1.9 36.42 39.63 31.16 30.88 31.87 67.03 48.01 48.48 42.67 1.33 0.67 0.00 0.00 0.00 7.45 29.47
Llama-3.2-1B 1B FP8-dynamic 2.47 2.0 36.21 40.86 32.93 31.24 32.83 67.02 48.69 48.40 45.67 2.00 2.67 0.00 0.00 0.00 8.39 30.29
Llama-3.2-3B 3B None 13.21 6.0 73.54 75.18 74.02 72.73 72.61 87.84 74.63 69.72 96.67 55.33 73.33 17.33 17.00 0.00 43.28 67.19
Llama-3.2-3B 3B W: INT8 & A: INT8 3.66 4.2 72.58 75.23 73.39 73.74 72.68 87.22 74.37 69.31 95.67 49.67 62.67 17.33 15.00 0.00 40.06 66.06
Llama-3.2-3B 3B FP8 6.44 4.2 74.07 75.31 72.91 73.04 71.19 88.03 74.03 68.74 96.00 41.33 61.67 18.00 19.00 0.00 39.33 65.57
Llama-3.2-3B 3B FP8-dynamic 6.44 4.2 73.49 75.13 72.71 73.34 72.05 87.53 73.58 69.75 94.00 52.33 60.33 16.00 18.00 0.00 40.11 65.80
Llama-3.1 8B None 30.65 15 83.45 85.27 83.45 84.51 83.50 92.07 79.58 74.28 86.00 78.67 74.67 56.33 59.67 5.33 60.11 77.93
Llama-3.1 8B W: INT8 & A: INT8 8.98 8.5 83.37 85.27 83.45 84.41 83.32 92.33 79.98 73.63 82.33 77.00 70.67 58.00 62.33 4.67 59.17 77.53
Llama-3.1 8B W: INT8 & A: INT16 15.94 8.5 83.95 84.89 83.78 83.75 83.62 92.34 80.32 73.87 86.33 79.00 73.67 56.00 65.00 5.67 60.95 78.01
Llama-3.1 8B W: INT4 & A: INT16 12.6 5.4 82.21 83.80 82.13 80.74 81.70 90.49 76.62 73.57 82.67 66.67 69.67 52.00 56.67 6.67 55.73 74.38
Llama-3.1 8B FP8 14.44 8.5 82.89 84.63 83.42 84.94 83.83 92.17 79.52 73.93 81.00 81.33 72.00 51.67 61.00 6.00 58.83 77.15
Llama-3.1 8B FP8-dynamic 21.09 8.5 83.27 84.86 82.97 83.88 84.69 92.33 81.00 74.09 81.67 74.67 74.33 53.00 65.33 5.00 59.00 77.13
Llama-3.1 70B None 269.17 132 95.10 95.27 94.72 94.44 94.64 98.34 94.43 83.73 100.00 100.00 99.00 97.00 100.00 88.00 97.33 94.74
Llama-3.1 70B W: INT8 & A: INT8 69.34 68 94.72 95.00 94.52 94.62 94.54 98.43 94.62 83.92 100.00 100.00 99.33 96.67 100.00 85.33 96.89 94.82
Llama-3.1 70B W: INT8 & A: INT16 138.64 68 92.92 94.36 93.96 94.39 93.51 97.59 92.89 80.04 100.00 99.00 98.00 95.00 99.00 85.33 96.06 93.14
Llama-3.1 70B W: INT4 & A: INT16 107.34 38 95.15 95.20 94.82 95.12 94.90 98.26 94.51 82.77 100.00 100.00 98.67 97.00 99.67 76.33 95.28 94.09
Llama-3.1 70B FP8 107.32 68 94.87 95.40 94.67 94.52 94.74 98.36 94.71 83.87 100.00 100.00 98.67 96.33 100.00 86.67 96.95 94.92
Llama-3.1 70B FP8-dynamic 176.63 68 94.64 95.38 95.00 95.10 94.52 98.46 94.54 83.70 100.00 100.00 98.67 97.67 100.00 86.00 97.06 95.04
Mistral-v0.3 7B None 27.67 14 54.84 55.98 54.76 57.90 54.23 88.99 76.82 69.83 60.33 48.33 21.33 5.67 2.00 1.00 23.11 46.93
Mistral-v0.3 7B W: INT8 & A: INT8 34.84 7.1 52.11 55.60 53.88 55.75 52.79 88.65 75.97 70.52 55.33 38.67 14.67 5.00 0.67 1.00 19.22 44.04
Mistral-v0.3 7B W: INT8 & A: INT16 14.36 7.1 54.26 55.85 54.13 56.86 52.82 89.07 76.68 70.22 62.00 45.00 24.00 5.33 2.00 0.00 23.06 46.15
Mistral-v0.3 7B W: INT4 & A: INT16 11.17 3.9 53.93 56.03 51.91 53.47 50.87 88.33 74.97 69.83 54.00 25.00 16.00 3.00 4.00 0.00 17.00 42.95
Mistral-v0.3 7B FP8 -- -- 54.13 54.99 53.96 57.67 53.85 88.64 76.39 69.48 58.33 45.67 21.00 5.33 2.67 0.00 22.17 46.09
Mistral-Nemo 12B None 57.89 23 86.76 86.08 85.57 84.94 85.34 92.79 83.70 72.78 95.00 81.33 78.33 54.67 49.33 6.67 60.89 77.02
Mistral-Nemo 12B W: INT4 & A: INT16 61.98 7.8 84.74 85.67 84.61 83.67 84.99 91.82 81.80 71.33 97.00 79.00 77.33 42.33 59.67 7.33 60.44 76.02
Mistral-Nemo 12B FP8 -- -- 87.31 86.58 85.67 85.77 85.29 92.19 83.16 73.41 95.00 78.67 77.33 50.33 48.33 9.00 59.78 76.43
SmolLM2 1.7B None 6.55 3.2 46.17 43.75 44.23 41.47 44.78 75.04 54.21 53.18 55.33 28.00 14.67 2.67 0.33 0.00 16.83 36.99
Minitron 4B None 16.01 7.9 27.95 28.68 35.41 34.80 34.07 -- -- -- -- -- -- -- -- -- -- 32.18
Hymba 1.5B None -- 2.9 53.75 53.53 52.87 52.99 52.74 84.57 66.78 64.73 34.67 12.00 1.00 0.00 0.00 0.00 7.95 38.10
Phi-3.5-mini 3.8B None 14.6 7.2 85.47 87.14 82.97 80.74 82.89 95.09 86.89 76.11 90.33 77.33 68.67 18.33 29.00 0.33 47.33 70.44
Phi-3-small 7B None -- 17.95 70.10 81.73 83.14 86.02 83.62 97.12 91.38 79.85 98.00 93.33 69.00 52.00 9.33 0.67 53.72 74.69
Qwen2-0.5B 0.5B None 2.02 0.95 37.25 38.31 26.38 28.46 26.76 56.41 40.44 48.13 10.33 0.00 0.00 0.00 0.00 0.00 1.72 26.25
Qwen2-0.5B 0.5B GPTQ 8-bit 0.71 1.4 38.08 37.91 26.33 27.27 26.59 56.13 40.30 47.50 7.67 0.33 0.00 0.00 0.00 0.00 1.33 25.84
Qwen2-0.5B 0.5B GPTQ 4-bit 1.12 0.71 21.51 25.32 14.38 16.76 14.23 52.05 37.03 43.11 2.00 0.00 0.00 0.00 0.00 0.00 0.33 22.64
Qwen2-0.5B 0.5B W: INT8 & A: INT16 1.38 0.61 37.68 38.13 26.43 26.54 26.81 56.51 39.87 47.23 11.67 0.33 0.00 0.00 0.00 0.00 2.00 26.10
Qwen2-0.5B 0.5B W: INT8 & A: INT8 1.38 0.87 37.60 37.50 26.23 26.99 25.78 55.36 40.27 47.45 7.33 0.33 0.00 0.00 0.00 0.00 1.28 25.35
Qwen2-0.5B 0.5B W: INT4 & A: INT16 1.51 0.71 25.42 27.32 18.09 18.35 16.40 50.56 36.63 42.42 5.67 0.67 0.00 0.00 0.00 0.00 1.06 24.05
Qwen2-0.5B 0.5B FP8 -- -- 35.20 35.94 23.17 25.25 22.52 56.61 40.13 46.76 6.33 0.67 0.00 0.00 0.00 0.00 1.17 29.26
Qwen2-1.5B 1.5B None 7.09 2.9 62.83 64.85 56.46 59.51 55.88 84.34 67.29 69.78 44.67 21.33 7.33 0.00 0.00 0.00 12.22 46.19
Qwen2-1.5B 1.5B GPTQ 8-bit 2.54 3.1 62.85 63.86 57.16 59.79 57.24 84.19 66.55 69.97 46.33 20.00 7.33 0.00 0.00 0.00 12.28 46.42
Qwen2-1.5B 1.5B GPTQ 4-bit 1.81 2.4 56.31 57.54 49.41 52.99 49.66 82.03 63.99 68.99 33.00 13.00 3.33 0.00 0.00 0.00 8.22 42.00
Qwen2-1.5B 1.5B W: INT8 & A: INT16 2.51 1.7 62.98 64.04 56.41 59.72 57.19 83.96 66.84 70.19 47.67 21.33 5.67 0.00 0.00 0.00 12.45 46.67
Qwen2-1.5B 1.5B W: INT8 & A: INT8 2.48 2.2 62.45 63.00 54.13 58.73 55.75 83.64 66.84 69.72 46.33 21.33 5.67 0.00 0.00 0.00 12.22 46.13
Qwen2-1.5B 1.5B W: INT4 & A: INT16 3.14 1.6 57.90 58.55 48.40 53.10 48.29 81.64 63.51 66.42 43.00 17.67 6.00 0.33 0.00 0.00 11.17 42.87
Qwen2-1.5B 1.5B FP8 -- -- 61.97 63.38 53.88 57.27 54.28 83.77 66.33 68.93 42.33 19.33 7.00 0.00 0.00 0.00 11.44 45.21
Qwen2-7B 7B None 30.05 15 87.14 87.34 86.58 85.82 86.40 94.21 85.52 80.54 83.33 80.33 45.00 36.33 15.00 2.67 43.78 68.35
Qwen2-7B 7B GPTQ 8-bit 9.63 8.3 87.16 87.54 86.56 86.50 86.40 94.28 85.64 80.04 84.33 82.67 44.00 32.33 14.67 3.00 43.50 68.35
Qwen2-7B 7B GPTQ 4-bit 6.48 5.3 85.54 86.35 85.92 84.96 85.42 93.45 85.52 78.92 80.67 72.00 33.33 23.33 4.33 0.33 35.67 64.91
Qwen2-7B 7B W: INT8 & A: INT16 9.42 8.2 86.40 87.06 86.15 85.97 86.38 93.91 85.47 80.13 84.00 80.67 43.67 35.00 14.33 1.67 43.22 67.59
Qwen2-7B 7B W: INT8 & A: INT8 9.58 8.2 87.11 87.31 86.63 86.58 86.56 94.02 85.38 79.66 79.33 83.67 40.33 31.67 17.00 0.67 42.11 67.69
Qwen2-7B 7B W: INT4 & A: INT16 12.96 5.3 84.53 85.57 85.32 84.91 85.19 94.22 84.95 78.98 79.67 77.00 43.00 26.67 5.33 0.00 38.61 65.85
Qwen2-7B 7B FP8 -- -- 86.66 87.14 86.05 86.56 86.13 94.26 85.41 80.32 81.00 83.00 47.00 29.00 13.33 1.00 42.39 67.45
Qwen2.5-0.5B 0.5B None 2.02 0.95 46.80 46.88 42.73 43.19 42.28 62.50 44.28 46.90 11.67 3.67 0.33 0.00 0.00 0.00 2.61 24.37
Qwen2.5-0.5B 0.5B GPTQ 8-bit 0.71 0.62 46.85 47.18 42.20 44.20 42.25 61.74 44.43 46.19 12.67 3.00 0.00 0.00 0.00 0.00 2.61 24.23
Qwen2.5-0.5B 0.5B GPTQ 4-bit 1.12 0.45 34.62 32.85 28.15 27.52 27.80 52.58 37.63 36.42 5.33 3.33 0.33 0.00 0.00 0.00 1.50 19.04
Qwen2.5-1.5B 1.5B None 6.68 2.9 70.00 70.20 69.72 68.46 69.90 87.58 73.81 71.85 66.33 65.33 34.33 7.33 1.33 0.00 29.11 53.22
Qwen2.5-1.5B 1.5B GPTQ 8-bit 2.54 1.7 70.33 70.33 70.03 68.99 69.52 87.78 73.72 72.10 68.33 65.33 36.67 8.00 1.33 0.00 29.94 54.10
Qwen2.5-1.5B 1.5B GPTQ 4-bit 1.81 1.1 64.92 64.92 62.40 63.28 62.42 86.25 70.25 69.10 60.00 46.67 12.67 7.33 0.00 0.00 21.11 47.02
Qwen2.5-3B 3B None 12.42 5.8 84.74 84.38 85.44 84.96 85.44 93.49 83.73 76.25 78.33 75.33 47.67 34.33 2.67 1.00 39.89 65.78
Qwen2.5-3B 3B GPTQ 8-bit 4.21 3.3 85.17 84.99 84.38 84.38 84.71 93.55 83.53 76.77 80.33 75.00 47.67 32.67 2.00 1.00 39.78 65.87
Qwen2.5-3B 3B GPTQ 4-bit 2.88 2.0 81.78 81.60 81.58 81.91 81.78 92.12 80.86 71.96 72.67 65.67 17.67 19.67 0.00 1.00 29.45 61.29
Qwen2.5-7B 7B None 30.05 15 91.76 92.19 91.05 91.89 91.33 96.03 90.53 82.66 94.33 90.00 69.67 47.00 39.33 5.67 57.67 78.36
Qwen2.5-7B 7B GPTQ 8-bit 9.63 8.3 91.84 92.22 91.81 91.56 91.31 96.03 90.64 82.58 94.00 92.00 71.33 49.00 41.33 5.67 58.89 78.87
Qwen2.5-7B 7B GPTQ 4-bit 6.48 5.3 90.62 91.23 90.65 90.73 90.85 95.62 89.19 82.69 80.67 15.00 58.33 15.67 31.67 1.00 33.72 68.35
Qwen2.5-14B 14B None 57.04 28 94.29 94.57 94.06 94.54 93.86 97.87 93.37 84.08 96.33 95.33 84.00 72.00 61.33 38.67 74.61 85.80
Qwen2.5-14B 14B GPTQ 8-bit 17.24 16 94.49 94.95 93.71 94.59 94.11 97.90 93.71 84.22 96.33 95.00 84.00 72.00 65.00 36.33 74.78 86.02
Qwen2.5-14B 14B GPTQ 4-bit 10.65 9.4 94.74 94.69 94.01 94.31 93.63 97.57 93.17 83.10 95.00 95.67 82.33 64.00 54.33 26.00 69.56 83.53
Qwen2.5-32B 32B None 125 62 95.40 95.78 95.20 95.55 94.92 98.26 95.25 87.11 99.00 99.33 93.33 92.33 79.00 60.00 87.17 91.77
Qwen2.5-32B 32B GPTQ 8-bit 33.81 33 95.73 95.86 95.50 95.60 95.25 98.34 95.16 86.62 99.00 99.00 93.33 92.33 79.67 61.00 87.39 91.86
Qwen2.5-32B 32B GPTQ 4-bit 52.42 19 95.73 95.73 94.92 95.43 95.12 98.09 95.19 87.06 100.00 100.00 98.33 91.67 77.33 56.33 87.28 91.80

Note: The sorting tasks involve arranging lists of numbers, with (+ve) containing only positive numbers and (mixed) containing both positive and negative numbers.

Key Findings

Breakthrough insights that challenge conventional wisdom about small language models and their reasoning capabilities

Small ≠ Weak

32B Qwen Rivals GPT-4 Turbo

Qwen 2.5-32B matches GPT-4-Turbo on intermediate-reasoning (MR-GSM8K 55.6 vs 53.0) and reaches 95% on GSM8K while using approximately 1/5 the parameters, overturning the "> 100B for reasoning" myth.

Quantization is Free

75% Memory Reduction

4- to 8-bit GPTQ cuts GPU memory by up to 75% yet preserves ≥ 99% of accuracy across GSM8K, ARC, CQA and robustness benchmarks, enabling laptop-scale deployment of formerly heavyweight models.

Keep Prompts Simple

Direct I/O Wins

On GSM8K, direct I/O prompts outperform or equal Chain-of-Thought and multi-shot variants; additional "think-step" instructions often confuse SLMs instead of helping them.

Sequence Length Limitation

80% → 40% Drop

Accuracy on sorting jumps from > 80% (8-item lists) to < 40% (32-item mixed lists), and negative numbers exacerbate errors, revealing a context-length bottleneck for algorithmic reasoning.

Robustness Scales

Size Matters for Defense

Larger SLMs (32B, 70B) remain the most resilient to adversarial GSM-Plus inputs, yet their quantized versions show negligible degradation, whereas pruned counterparts collapse.

Pruning Hurts

30-50 Point Loss

Weight-pruned 8B models lose 30-50 points on reasoning tasks and score ~0 on MR-GSM8K, showing that aggressive sparsification cripples logical consistency even after recovery fine-tuning.

BibTeX

@article{srivastava2025towards,
  title={Towards reasoning ability of small language models},
  author={Srivastava, Gaurav and Cao, Shuxiang and Wang, Xuan},
  journal={arXiv preprint arXiv:2502.11569},
  year={2025}
}