Towards Reasoning Ability of Small Language Models

¹Department of Computer Science, Virginia Tech, Blacksburg, VA, USA
²Department of Physics, Clarendon Laboratory, University of Oxford, OX1 3PU, UK
³NVIDIA Corporation

Abstract

Reasoning has long been viewed as an emergent property of large language models (LLMs), appearing at or above a certain scale (~100B parameters). However, recent studies challenge this assumption, showing that small language models (SLMs) can also achieve competitive reasoning performance. SLMs are increasingly favored for their efficiency and deployability. However, there is a lack of systematic study on the reasoning abilities of diverse SLMs, including those trained from scratch or derived from LLMs through quantization, pruning, and distillation. This raises a critical question: Can SLMs achieve reasoning abilities comparable to LLMs? In this work, we systematically survey, benchmark, and analyze 72 SLMs from six model families across 14 reasoning benchmarks. For reliable evaluation, we examine four evaluation methods and compare four LLM judges against human evaluations on 800 data points. We repeat all experiments three times to ensure a robust performance assessment. Additionally, we analyze the impact of different prompting strategies in small models. Beyond accuracy, we also evaluate model robustness under adversarial conditions and intermediate reasoning steps. Our findings challenge the assumption that scaling is the only way to achieve strong reasoning. Instead, we foresee a future where SLMs with strong reasoning capabilities can be developed through structured training or post-training compression. They can serve as efficient alternatives to LLMs for reasoning tasks.

Small Language Models Reasoning Leaderboard

Performance of small language models across 14 reasoning benchmarks and 6 sorting tasks along with GPU memory and disk size.
The models are sorted by their overall average performance on the reasoning benchmarks.

Model	Params	Quantization	GPU Memory (GB)	Disk Size (GB)	GSM8K (Direct I/O)	GSM8K (COT)	GSM8K (5-Shot)	GSM8K (5-Shot COT)	GSM8K (8-Shot)	ARC-Easy	ARC-Challenge	CommonsenseQA	Sort-8 (+ve)	Sort-8 (mixed)	Sort-16 (+ve)	Sort-16 (mixed)	Sort-32 (+ve)	Sort-32 (mixed)	Sorting Average	Overall Average
Qwen2.5-0.5B	0.5B	GPTQ 4-bit	1.12	0.45	34.62	32.85	28.15	27.52	27.80	52.58	37.63	36.42	5.33	3.33	0.33	0.00	0.00	0.00	1.50	19.04
Qwen2-0.5B	0.5B	GPTQ 4-bit	1.12	0.71	21.51	25.32	14.38	16.76	14.23	52.05	37.03	43.11	2.00	0.00	0.00	0.00	0.00	0.00	0.33	22.64
Qwen2-0.5B	0.5B	W: INT4 & A: INT16	1.51	0.71	25.42	27.32	18.09	18.35	16.40	50.56	36.63	42.42	5.67	0.67	0.00	0.00	0.00	0.00	1.06	24.05
Qwen2.5-0.5B	0.5B	GPTQ 8-bit	0.71	0.62	46.85	47.18	42.20	44.20	42.25	61.74	44.43	46.19	12.67	3.00	0.00	0.00	0.00	0.00	2.61	24.23
Qwen2.5-0.5B	0.5B	None	2.02	0.95	46.80	46.88	42.73	43.19	42.28	62.50	44.28	46.90	11.67	3.67	0.33	0.00	0.00	0.00	2.61	24.37
Qwen2-0.5B	0.5B	W: INT8 & A: INT8	1.38	0.87	37.60	37.50	26.23	26.99	25.78	55.36	40.27	47.45	7.33	0.33	0.00	0.00	0.00	0.00	1.28	25.35
Qwen2-0.5B	0.5B	GPTQ 8-bit	0.71	1.4	38.08	37.91	26.33	27.27	26.59	56.13	40.30	47.50	7.67	0.33	0.00	0.00	0.00	0.00	1.33	25.84
Qwen2-0.5B	0.5B	W: INT8 & A: INT16	1.38	0.61	37.68	38.13	26.43	26.54	26.81	56.51	39.87	47.23	11.67	0.33	0.00	0.00	0.00	0.00	2.00	26.10
Qwen2-0.5B	0.5B	None	2.02	0.95	37.25	38.31	26.38	28.46	26.76	56.41	40.44	48.13	10.33	0.00	0.00	0.00	0.00	0.00	1.72	26.25
Qwen2-0.5B	0.5B	FP8	--	--	35.20	35.94	23.17	25.25	22.52	56.61	40.13	46.76	6.33	0.67	0.00	0.00	0.00	0.00	1.17	29.26
Llama-3.2-1B	1B	FP8	2.47	1.9	36.42	39.63	31.16	30.88	31.87	67.03	48.01	48.48	42.67	1.33	0.67	0.00	0.00	0.00	7.45	29.47
Llama-3.2-1B	1B	None	4.73	2.4	36.39	38.99	33.69	32.73	33.13	67.23	47.50	48.38	44.67	1.33	1.00	0.00	0.00	0.00	7.83	29.67
Llama-3.2-1B	1B	W: INT8 & A: INT8	1.53	1.9	36.87	39.63	32.07	30.58	32.88	67.45	47.90	48.10	50.33	1.67	1.33	0.00	0.00	0.00	8.89	30.17
Llama-3.2-1B	1B	FP8-dynamic	2.47	2.0	36.21	40.86	32.93	31.24	32.83	67.02	48.69	48.40	45.67	2.00	2.67	0.00	0.00	0.00	8.39	30.29
Minitron	4B	None	16.01	7.9	27.95	28.68	35.41	34.80	34.07	--	--	--	--	--	--	--	--	--	--	32.18
SmolLM2	1.7B	None	6.55	3.2	46.17	43.75	44.23	41.47	44.78	75.04	54.21	53.18	55.33	28.00	14.67	2.67	0.33	0.00	16.83	36.99
Hymba	1.5B	None	--	2.9	53.75	53.53	52.87	52.99	52.74	84.57	66.78	64.73	34.67	12.00	1.00	0.00	0.00	0.00	7.95	38.10
Qwen2-1.5B	1.5B	GPTQ 4-bit	1.81	2.4	56.31	57.54	49.41	52.99	49.66	82.03	63.99	68.99	33.00	13.00	3.33	0.00	0.00	0.00	8.22	42.00
Qwen2-1.5B	1.5B	W: INT4 & A: INT16	3.14	1.6	57.90	58.55	48.40	53.10	48.29	81.64	63.51	66.42	43.00	17.67	6.00	0.33	0.00	0.00	11.17	42.87
Mistral-v0.3	7B	W: INT4 & A: INT16	11.17	3.9	53.93	56.03	51.91	53.47	50.87	88.33	74.97	69.83	54.00	25.00	16.00	3.00	4.00	0.00	17.00	42.95
Mistral-v0.3	7B	W: INT8 & A: INT8	34.84	7.1	52.11	55.60	53.88	55.75	52.79	88.65	75.97	70.52	55.33	38.67	14.67	5.00	0.67	1.00	19.22	44.04
Qwen2-1.5B	1.5B	FP8	--	--	61.97	63.38	53.88	57.27	54.28	83.77	66.33	68.93	42.33	19.33	7.00	0.00	0.00	0.00	11.44	45.21
Mistral-v0.3	7B	FP8	--	--	54.13	54.99	53.96	57.67	53.85	88.64	76.39	69.48	58.33	45.67	21.00	5.33	2.67	0.00	22.17	46.09
Qwen2-1.5B	1.5B	W: INT8 & A: INT8	2.48	2.2	62.45	63.00	54.13	58.73	55.75	83.64	66.84	69.72	46.33	21.33	5.67	0.00	0.00	0.00	12.22	46.13
Mistral-v0.3	7B	W: INT8 & A: INT16	14.36	7.1	54.26	55.85	54.13	56.86	52.82	89.07	76.68	70.22	62.00	45.00	24.00	5.33	2.00	0.00	23.06	46.15
Qwen2-1.5B	1.5B	None	7.09	2.9	62.83	64.85	56.46	59.51	55.88	84.34	67.29	69.78	44.67	21.33	7.33	0.00	0.00	0.00	12.22	46.19
Qwen2-1.5B	1.5B	GPTQ 8-bit	2.54	3.1	62.85	63.86	57.16	59.79	57.24	84.19	66.55	69.97	46.33	20.00	7.33	0.00	0.00	0.00	12.28	46.42
Qwen2-1.5B	1.5B	W: INT8 & A: INT16	2.51	1.7	62.98	64.04	56.41	59.72	57.19	83.96	66.84	70.19	47.67	21.33	5.67	0.00	0.00	0.00	12.45	46.67
Mistral-v0.3	7B	None	27.67	14	54.84	55.98	54.76	57.90	54.23	88.99	76.82	69.83	60.33	48.33	21.33	5.67	2.00	1.00	23.11	46.93
Qwen2.5-1.5B	1.5B	GPTQ 4-bit	1.81	1.1	64.92	64.92	62.40	63.28	62.42	86.25	70.25	69.10	60.00	46.67	12.67	7.33	0.00	0.00	21.11	47.02
Qwen2.5-1.5B	1.5B	None	6.68	2.9	70.00	70.20	69.72	68.46	69.90	87.58	73.81	71.85	66.33	65.33	34.33	7.33	1.33	0.00	29.11	53.22
Qwen2.5-1.5B	1.5B	GPTQ 8-bit	2.54	1.7	70.33	70.33	70.03	68.99	69.52	87.78	73.72	72.10	68.33	65.33	36.67	8.00	1.33	0.00	29.94	54.10
Qwen2.5-3B	3B	GPTQ 4-bit	2.88	2.0	81.78	81.60	81.58	81.91	81.78	92.12	80.86	71.96	72.67	65.67	17.67	19.67	0.00	1.00	29.45	61.29
Qwen2-7B	7B	GPTQ 4-bit	6.48	5.3	85.54	86.35	85.92	84.96	85.42	93.45	85.52	78.92	80.67	72.00	33.33	23.33	4.33	0.33	35.67	64.91
Llama-3.2-3B	3B	FP8	6.44	4.2	74.07	75.31	72.91	73.04	71.19	88.03	74.03	68.74	96.00	41.33	61.67	18.00	19.00	0.00	39.33	65.57
Qwen2.5-3B	3B	None	12.42	5.8	84.74	84.38	85.44	84.96	85.44	93.49	83.73	76.25	78.33	75.33	47.67	34.33	2.67	1.00	39.89	65.78
Llama-3.2-3B	3B	FP8-dynamic	6.44	4.2	73.49	75.13	72.71	73.34	72.05	87.53	73.58	69.75	94.00	52.33	60.33	16.00	18.00	0.00	40.11	65.80
Qwen2-7B	7B	W: INT4 & A: INT16	12.96	5.3	84.53	85.57	85.32	84.91	85.19	94.22	84.95	78.98	79.67	77.00	43.00	26.67	5.33	0.00	38.61	65.85
Qwen2.5-3B	3B	GPTQ 8-bit	4.21	3.3	85.17	84.99	84.38	84.38	84.71	93.55	83.53	76.77	80.33	75.00	47.67	32.67	2.00	1.00	39.78	65.87
Llama-3.2-3B	3B	W: INT8 & A: INT8	3.66	4.2	72.58	75.23	73.39	73.74	72.68	87.22	74.37	69.31	95.67	49.67	62.67	17.33	15.00	0.00	40.06	66.06
Llama-3.2-3B	3B	None	13.21	6.0	73.54	75.18	74.02	72.73	72.61	87.84	74.63	69.72	96.67	55.33	73.33	17.33	17.00	0.00	43.28	67.19
Qwen2-7B	7B	FP8	--	--	86.66	87.14	86.05	86.56	86.13	94.26	85.41	80.32	81.00	83.00	47.00	29.00	13.33	1.00	42.39	67.45
Qwen2-7B	7B	W: INT8 & A: INT16	9.42	8.2	86.40	87.06	86.15	85.97	86.38	93.91	85.47	80.13	84.00	80.67	43.67	35.00	14.33	1.67	43.22	67.59
Qwen2-7B	7B	W: INT8 & A: INT8	9.58	8.2	87.11	87.31	86.63	86.58	86.56	94.02	85.38	79.66	79.33	83.67	40.33	31.67	17.00	0.67	42.11	67.69
Qwen2-7B	7B	None	30.05	15	87.14	87.34	86.58	85.82	86.40	94.21	85.52	80.54	83.33	80.33	45.00	36.33	15.00	2.67	43.78	68.35
Qwen2-7B	7B	GPTQ 8-bit	9.63	8.3	87.16	87.54	86.56	86.50	86.40	94.28	85.64	80.04	84.33	82.67	44.00	32.33	14.67	3.00	43.50	68.35
Qwen2.5-7B	7B	GPTQ 4-bit	6.48	5.3	90.62	91.23	90.65	90.73	90.85	95.62	89.19	82.69	80.67	15.00	58.33	15.67	31.67	1.00	33.72	68.35
Phi-3.5-mini	3.8B	None	14.6	7.2	85.47	87.14	82.97	80.74	82.89	95.09	86.89	76.11	90.33	77.33	68.67	18.33	29.00	0.33	47.33	70.44
Llama-3.1	8B	W: INT4 & A: INT16	12.6	5.4	82.21	83.80	82.13	80.74	81.70	90.49	76.62	73.57	82.67	66.67	69.67	52.00	56.67	6.67	55.73	74.38
Phi-3-small	7B	None	--	17.95	70.10	81.73	83.14	86.02	83.62	97.12	91.38	79.85	98.00	93.33	69.00	52.00	9.33	0.67	53.72	74.69
Mistral-Nemo	12B	W: INT4 & A: INT16	61.98	7.8	84.74	85.67	84.61	83.67	84.99	91.82	81.80	71.33	97.00	79.00	77.33	42.33	59.67	7.33	60.44	76.02
Mistral-Nemo	12B	FP8	--	--	87.31	86.58	85.67	85.77	85.29	92.19	83.16	73.41	95.00	78.67	77.33	50.33	48.33	9.00	59.78	76.43
Mistral-Nemo	12B	None	57.89	23	86.76	86.08	85.57	84.94	85.34	92.79	83.70	72.78	95.00	81.33	78.33	54.67	49.33	6.67	60.89	77.02
Llama-3.1	8B	FP8-dynamic	21.09	8.5	83.27	84.86	82.97	83.88	84.69	92.33	81.00	74.09	81.67	74.67	74.33	53.00	65.33	5.00	59.00	77.13
Llama-3.1	8B	FP8	14.44	8.5	82.89	84.63	83.42	84.94	83.83	92.17	79.52	73.93	81.00	81.33	72.00	51.67	61.00	6.00	58.83	77.15
Llama-3.1	8B	W: INT8 & A: INT8	8.98	8.5	83.37	85.27	83.45	84.41	83.32	92.33	79.98	73.63	82.33	77.00	70.67	58.00	62.33	4.67	59.17	77.53
Llama-3.1	8B	None	30.65	15	83.45	85.27	83.45	84.51	83.50	92.07	79.58	74.28	86.00	78.67	74.67	56.33	59.67	5.33	60.11	77.93
Llama-3.1	8B	W: INT8 & A: INT16	15.94	8.5	83.95	84.89	83.78	83.75	83.62	92.34	80.32	73.87	86.33	79.00	73.67	56.00	65.00	5.67	60.95	78.01
Qwen2.5-7B	7B	None	30.05	15	91.76	92.19	91.05	91.89	91.33	96.03	90.53	82.66	94.33	90.00	69.67	47.00	39.33	5.67	57.67	78.36
Qwen2.5-7B	7B	GPTQ 8-bit	9.63	8.3	91.84	92.22	91.81	91.56	91.31	96.03	90.64	82.58	94.00	92.00	71.33	49.00	41.33	5.67	58.89	78.87
Qwen2.5-14B	14B	GPTQ 4-bit	10.65	9.4	94.74	94.69	94.01	94.31	93.63	97.57	93.17	83.10	95.00	95.67	82.33	64.00	54.33	26.00	69.56	83.53
Qwen2.5-14B	14B	None	57.04	28	94.29	94.57	94.06	94.54	93.86	97.87	93.37	84.08	96.33	95.33	84.00	72.00	61.33	38.67	74.61	85.80
Qwen2.5-14B	14B	GPTQ 8-bit	17.24	16	94.49	94.95	93.71	94.59	94.11	97.90	93.71	84.22	96.33	95.00	84.00	72.00	65.00	36.33	74.78	86.02
Qwen2.5-32B	32B	None	125	62	95.40	95.78	95.20	95.55	94.92	98.26	95.25	87.11	99.00	99.33	93.33	92.33	79.00	60.00	87.17	91.77
Qwen2.5-32B	32B	GPTQ 4-bit	52.42	19	95.73	95.73	94.92	95.43	95.12	98.09	95.19	87.06	100.00	100.00	98.33	91.67	77.33	56.33	87.28	91.80
Qwen2.5-32B	32B	GPTQ 8-bit	33.81	33	95.73	95.86	95.50	95.60	95.25	98.34	95.16	86.62	99.00	99.00	93.33	92.33	79.67	61.00	87.39	91.86
Llama-3.1	70B	W: INT8 & A: INT16	138.64	68	92.92	94.36	93.96	94.39	93.51	97.59	92.89	80.04	100.00	99.00	98.00	95.00	99.00	85.33	96.06	93.14
Llama-3.1	70B	W: INT4 & A: INT16	107.34	38	95.15	95.20	94.82	95.12	94.90	98.26	94.51	82.77	100.00	100.00	98.67	97.00	99.67	76.33	95.28	94.09
Llama-3.1	70B	None	269.17	132	95.10	95.27	94.72	94.44	94.64	98.34	94.43	83.73	100.00	100.00	99.00	97.00	100.00	88.00	97.33	94.74
Llama-3.1	70B	W: INT8 & A: INT8	69.34	68	94.72	95.00	94.52	94.62	94.54	98.43	94.62	83.92	100.00	100.00	99.33	96.67	100.00	85.33	96.89	94.82
Llama-3.1	70B	FP8	107.32	68	94.87	95.40	94.67	94.52	94.74	98.36	94.71	83.87	100.00	100.00	98.67	96.33	100.00	86.67	96.95	94.92
Llama-3.1	70B	FP8-dynamic	176.63	68	94.64	95.38	95.00	95.10	94.52	98.46	94.54	83.70	100.00	100.00	98.67	97.67	100.00	86.00	97.06	95.04

Note: The sorting tasks involve arranging lists of numbers, with (+ve) containing only positive numbers and (mixed) containing both positive and negative numbers.

Key Findings

Small ≠ Weak — 32 B Qwen rivals GPT‑4 Turbo!

Qwen 2.5‑32B matches GPT‑4‑Turbo on intermediate‑reasoning (MR‑GSM8K 55.6 vs 53.0) and reaches 95 % on GSM8K while using ≈1⁄5 the parameters, overturning the “> 100 B for reasoning” myth.

Quantization is (almost) free!

4‑ to 8‑bit GPTQ cuts GPU memory by up to 75 % yet preserves ≥ 99 % of accuracy across GSM8K, ARC, CQA and robustness benchmarks, enabling laptop‑scale deployment of formerly heavyweight models.

Keep prompts simple!

On GSM8K, direct I/O prompts outperform or equal Chain‑of‑Thought and multi‑shot variants; additional “think‑step” instructions often confuse SLMs instead of helping them.

Sequence length is the Achilles’ heel!

Accuracy on sorting jumps from > 80 % (8‑item lists) to < 40 % (32‑item mixed lists), and negative numbers exacerbate errors, revealing a context‑length bottleneck for algorithmic reasoning.

Robustness scales with size—but survives quantization!

Larger SLMs (32B, 70B) remain the most resilient to adversarial GSM‑Plus inputs, yet their quantized versions show negligible degradation, whereas pruned counterparts collapse.

Pruning hurts, sometimes fatally!

Weight‑pruned 8 B models lose 30–50 points on reasoning tasks and score ~0 on MR‑GSM8K, showing that aggressive sparsification cripples logical consistency even after recovery fine‑tuning.

@article{srivastava2025towards, title={Towards reasoning ability of small language models}, author={Srivastava, Gaurav and Cao, Shuxiang and Wang, Xuan}, journal={arXiv preprint arXiv:2502.11569}, year={2025} }