ThinkSLM: Towards Reasoning in Small Language Models

^♠Department of Computer Science, Virginia Tech, Blacksburg, VA, USA
^♥Department of Physics, Clarendon Laboratory, University of Oxford, OX1 3PU, UK
^♦NVIDIA Corporation

EMNLP 2025 Main Conference

Abstract

Reasoning has long been viewed as an emergent property of large language models (LLMs), appearing at or above a certain scale (~100B parameters). However, recent studies challenge this assumption, showing that small language models (SLMs) can also achieve competitive reasoning performance. SLMs are increasingly favored for their efficiency and deployability. However, there is a lack of systematic study on the reasoning abilities of diverse SLMs, including those trained from scratch or derived from LLMs through quantization, pruning, and distillation. This raises a critical question: Can SLMs achieve reasoning abilities comparable to LLMs? In this work, we systematically survey, benchmark, and analyze 72 SLMs from six model families across 14 reasoning benchmarks. For reliable evaluation, we examine four evaluation methods and compare four LLM judges against human evaluations on 800 data points. We repeat all experiments three times to ensure a robust performance assessment. Additionally, we analyze the impact of different prompting strategies in small models. Beyond accuracy, we also evaluate model robustness under adversarial conditions and intermediate reasoning steps. Our findings challenge the assumption that scaling is the only way to achieve strong reasoning. Instead, we foresee a future where SLMs with strong reasoning capabilities can be developed through structured training or post-training compression. They can serve as efficient alternatives to LLMs for reasoning tasks.

Small Language Models Reasoning Leaderboard

Performance of small language models across 14 reasoning benchmarks and 6 sorting tasks along with GPU memory and disk size.
The models are sorted by their overall average performance on the reasoning benchmarks.

Model	Params	Quantization	GPU Memory (GB)	Disk Size (GB)	Overall Average	GSM8K (Direct I/O)	GSM8K (COT)	GSM8K (5-Shot)	GSM8K (5-Shot COT)	GSM8K (8-Shot)	ARC-Easy	ARC-Challenge	CommonsenseQA	Sort-8 (+ve)	Sort-8 (mixed)	Sort-16 (+ve)	Sort-16 (mixed)	Sort-32 (+ve)	Sort-32 (mixed)	Sorting Average
Llama-3.2-1B	1B	None	4.73	2.4	36.39	38.99	33.69	32.73	33.13	67.23	47.50	48.38	44.67	1.33	1.00	0.00	0.00	0.00	7.83	29.67
Llama-3.2-1B	1B	W: INT8 & A: INT8	1.53	1.9	36.87	39.63	32.07	30.58	32.88	67.45	47.90	48.10	50.33	1.67	1.33	0.00	0.00	0.00	8.89	30.17
Llama-3.2-1B	1B	FP8	2.47	1.9	36.42	39.63	31.16	30.88	31.87	67.03	48.01	48.48	42.67	1.33	0.67	0.00	0.00	0.00	7.45	29.47
Llama-3.2-1B	1B	FP8-dynamic	2.47	2.0	36.21	40.86	32.93	31.24	32.83	67.02	48.69	48.40	45.67	2.00	2.67	0.00	0.00	0.00	8.39	30.29
Llama-3.2-3B	3B	None	13.21	6.0	73.54	75.18	74.02	72.73	72.61	87.84	74.63	69.72	96.67	55.33	73.33	17.33	17.00	0.00	43.28	67.19
Llama-3.2-3B	3B	W: INT8 & A: INT8	3.66	4.2	72.58	75.23	73.39	73.74	72.68	87.22	74.37	69.31	95.67	49.67	62.67	17.33	15.00	0.00	40.06	66.06
Llama-3.2-3B	3B	FP8	6.44	4.2	74.07	75.31	72.91	73.04	71.19	88.03	74.03	68.74	96.00	41.33	61.67	18.00	19.00	0.00	39.33	65.57
Llama-3.2-3B	3B	FP8-dynamic	6.44	4.2	73.49	75.13	72.71	73.34	72.05	87.53	73.58	69.75	94.00	52.33	60.33	16.00	18.00	0.00	40.11	65.80
Llama-3.1	8B	None	30.65	15	83.45	85.27	83.45	84.51	83.50	92.07	79.58	74.28	86.00	78.67	74.67	56.33	59.67	5.33	60.11	77.93
Llama-3.1	8B	W: INT8 & A: INT8	8.98	8.5	83.37	85.27	83.45	84.41	83.32	92.33	79.98	73.63	82.33	77.00	70.67	58.00	62.33	4.67	59.17	77.53
Llama-3.1	8B	W: INT8 & A: INT16	15.94	8.5	83.95	84.89	83.78	83.75	83.62	92.34	80.32	73.87	86.33	79.00	73.67	56.00	65.00	5.67	60.95	78.01
Llama-3.1	8B	W: INT4 & A: INT16	12.6	5.4	82.21	83.80	82.13	80.74	81.70	90.49	76.62	73.57	82.67	66.67	69.67	52.00	56.67	6.67	55.73	74.38
Llama-3.1	8B	FP8	14.44	8.5	82.89	84.63	83.42	84.94	83.83	92.17	79.52	73.93	81.00	81.33	72.00	51.67	61.00	6.00	58.83	77.15
Llama-3.1	8B	FP8-dynamic	21.09	8.5	83.27	84.86	82.97	83.88	84.69	92.33	81.00	74.09	81.67	74.67	74.33	53.00	65.33	5.00	59.00	77.13
Llama-3.1	70B	None	269.17	132	95.10	95.27	94.72	94.44	94.64	98.34	94.43	83.73	100.00	100.00	99.00	97.00	100.00	88.00	97.33	94.74
Llama-3.1	70B	W: INT8 & A: INT8	69.34	68	94.72	95.00	94.52	94.62	94.54	98.43	94.62	83.92	100.00	100.00	99.33	96.67	100.00	85.33	96.89	94.82
Llama-3.1	70B	W: INT8 & A: INT16	138.64	68	92.92	94.36	93.96	94.39	93.51	97.59	92.89	80.04	100.00	99.00	98.00	95.00	99.00	85.33	96.06	93.14
Llama-3.1	70B	W: INT4 & A: INT16	107.34	38	95.15	95.20	94.82	95.12	94.90	98.26	94.51	82.77	100.00	100.00	98.67	97.00	99.67	76.33	95.28	94.09
Llama-3.1	70B	FP8	107.32	68	94.87	95.40	94.67	94.52	94.74	98.36	94.71	83.87	100.00	100.00	98.67	96.33	100.00	86.67	96.95	94.92
Llama-3.1	70B	FP8-dynamic	176.63	68	94.64	95.38	95.00	95.10	94.52	98.46	94.54	83.70	100.00	100.00	98.67	97.67	100.00	86.00	97.06	95.04
Mistral-v0.3	7B	None	27.67	14	54.84	55.98	54.76	57.90	54.23	88.99	76.82	69.83	60.33	48.33	21.33	5.67	2.00	1.00	23.11	46.93
Mistral-v0.3	7B	W: INT8 & A: INT8	34.84	7.1	52.11	55.60	53.88	55.75	52.79	88.65	75.97	70.52	55.33	38.67	14.67	5.00	0.67	1.00	19.22	44.04
Mistral-v0.3	7B	W: INT8 & A: INT16	14.36	7.1	54.26	55.85	54.13	56.86	52.82	89.07	76.68	70.22	62.00	45.00	24.00	5.33	2.00	0.00	23.06	46.15
Mistral-v0.3	7B	W: INT4 & A: INT16	11.17	3.9	53.93	56.03	51.91	53.47	50.87	88.33	74.97	69.83	54.00	25.00	16.00	3.00	4.00	0.00	17.00	42.95
Mistral-v0.3	7B	FP8	--	--	54.13	54.99	53.96	57.67	53.85	88.64	76.39	69.48	58.33	45.67	21.00	5.33	2.67	0.00	22.17	46.09
Mistral-Nemo	12B	None	57.89	23	86.76	86.08	85.57	84.94	85.34	92.79	83.70	72.78	95.00	81.33	78.33	54.67	49.33	6.67	60.89	77.02
Mistral-Nemo	12B	W: INT4 & A: INT16	61.98	7.8	84.74	85.67	84.61	83.67	84.99	91.82	81.80	71.33	97.00	79.00	77.33	42.33	59.67	7.33	60.44	76.02
Mistral-Nemo	12B	FP8	--	--	87.31	86.58	85.67	85.77	85.29	92.19	83.16	73.41	95.00	78.67	77.33	50.33	48.33	9.00	59.78	76.43
SmolLM2	1.7B	None	6.55	3.2	46.17	43.75	44.23	41.47	44.78	75.04	54.21	53.18	55.33	28.00	14.67	2.67	0.33	0.00	16.83	36.99
Minitron	4B	None	16.01	7.9	27.95	28.68	35.41	34.80	34.07	--	--	--	--	--	--	--	--	--	--	32.18
Hymba	1.5B	None	--	2.9	53.75	53.53	52.87	52.99	52.74	84.57	66.78	64.73	34.67	12.00	1.00	0.00	0.00	0.00	7.95	38.10
Phi-3.5-mini	3.8B	None	14.6	7.2	85.47	87.14	82.97	80.74	82.89	95.09	86.89	76.11	90.33	77.33	68.67	18.33	29.00	0.33	47.33	70.44
Phi-3-small	7B	None	--	17.95	70.10	81.73	83.14	86.02	83.62	97.12	91.38	79.85	98.00	93.33	69.00	52.00	9.33	0.67	53.72	74.69
Qwen2-0.5B	0.5B	None	2.02	0.95	37.25	38.31	26.38	28.46	26.76	56.41	40.44	48.13	10.33	0.00	0.00	0.00	0.00	0.00	1.72	26.25
Qwen2-0.5B	0.5B	GPTQ 8-bit	0.71	1.4	38.08	37.91	26.33	27.27	26.59	56.13	40.30	47.50	7.67	0.33	0.00	0.00	0.00	0.00	1.33	25.84
Qwen2-0.5B	0.5B	GPTQ 4-bit	1.12	0.71	21.51	25.32	14.38	16.76	14.23	52.05	37.03	43.11	2.00	0.00	0.00	0.00	0.00	0.00	0.33	22.64
Qwen2-0.5B	0.5B	W: INT8 & A: INT16	1.38	0.61	37.68	38.13	26.43	26.54	26.81	56.51	39.87	47.23	11.67	0.33	0.00	0.00	0.00	0.00	2.00	26.10
Qwen2-0.5B	0.5B	W: INT8 & A: INT8	1.38	0.87	37.60	37.50	26.23	26.99	25.78	55.36	40.27	47.45	7.33	0.33	0.00	0.00	0.00	0.00	1.28	25.35
Qwen2-0.5B	0.5B	W: INT4 & A: INT16	1.51	0.71	25.42	27.32	18.09	18.35	16.40	50.56	36.63	42.42	5.67	0.67	0.00	0.00	0.00	0.00	1.06	24.05
Qwen2-0.5B	0.5B	FP8	--	--	35.20	35.94	23.17	25.25	22.52	56.61	40.13	46.76	6.33	0.67	0.00	0.00	0.00	0.00	1.17	29.26
Qwen2-1.5B	1.5B	None	7.09	2.9	62.83	64.85	56.46	59.51	55.88	84.34	67.29	69.78	44.67	21.33	7.33	0.00	0.00	0.00	12.22	46.19
Qwen2-1.5B	1.5B	GPTQ 8-bit	2.54	3.1	62.85	63.86	57.16	59.79	57.24	84.19	66.55	69.97	46.33	20.00	7.33	0.00	0.00	0.00	12.28	46.42
Qwen2-1.5B	1.5B	GPTQ 4-bit	1.81	2.4	56.31	57.54	49.41	52.99	49.66	82.03	63.99	68.99	33.00	13.00	3.33	0.00	0.00	0.00	8.22	42.00
Qwen2-1.5B	1.5B	W: INT8 & A: INT16	2.51	1.7	62.98	64.04	56.41	59.72	57.19	83.96	66.84	70.19	47.67	21.33	5.67	0.00	0.00	0.00	12.45	46.67
Qwen2-1.5B	1.5B	W: INT8 & A: INT8	2.48	2.2	62.45	63.00	54.13	58.73	55.75	83.64	66.84	69.72	46.33	21.33	5.67	0.00	0.00	0.00	12.22	46.13
Qwen2-1.5B	1.5B	W: INT4 & A: INT16	3.14	1.6	57.90	58.55	48.40	53.10	48.29	81.64	63.51	66.42	43.00	17.67	6.00	0.33	0.00	0.00	11.17	42.87
Qwen2-1.5B	1.5B	FP8	--	--	61.97	63.38	53.88	57.27	54.28	83.77	66.33	68.93	42.33	19.33	7.00	0.00	0.00	0.00	11.44	45.21
Qwen2-7B	7B	None	30.05	15	87.14	87.34	86.58	85.82	86.40	94.21	85.52	80.54	83.33	80.33	45.00	36.33	15.00	2.67	43.78	68.35
Qwen2-7B	7B	GPTQ 8-bit	9.63	8.3	87.16	87.54	86.56	86.50	86.40	94.28	85.64	80.04	84.33	82.67	44.00	32.33	14.67	3.00	43.50	68.35
Qwen2-7B	7B	GPTQ 4-bit	6.48	5.3	85.54	86.35	85.92	84.96	85.42	93.45	85.52	78.92	80.67	72.00	33.33	23.33	4.33	0.33	35.67	64.91
Qwen2-7B	7B	W: INT8 & A: INT16	9.42	8.2	86.40	87.06	86.15	85.97	86.38	93.91	85.47	80.13	84.00	80.67	43.67	35.00	14.33	1.67	43.22	67.59
Qwen2-7B	7B	W: INT8 & A: INT8	9.58	8.2	87.11	87.31	86.63	86.58	86.56	94.02	85.38	79.66	79.33	83.67	40.33	31.67	17.00	0.67	42.11	67.69
Qwen2-7B	7B	W: INT4 & A: INT16	12.96	5.3	84.53	85.57	85.32	84.91	85.19	94.22	84.95	78.98	79.67	77.00	43.00	26.67	5.33	0.00	38.61	65.85
Qwen2-7B	7B	FP8	--	--	86.66	87.14	86.05	86.56	86.13	94.26	85.41	80.32	81.00	83.00	47.00	29.00	13.33	1.00	42.39	67.45
Qwen2.5-0.5B	0.5B	None	2.02	0.95	46.80	46.88	42.73	43.19	42.28	62.50	44.28	46.90	11.67	3.67	0.33	0.00	0.00	0.00	2.61	24.37
Qwen2.5-0.5B	0.5B	GPTQ 8-bit	0.71	0.62	46.85	47.18	42.20	44.20	42.25	61.74	44.43	46.19	12.67	3.00	0.00	0.00	0.00	0.00	2.61	24.23
Qwen2.5-0.5B	0.5B	GPTQ 4-bit	1.12	0.45	34.62	32.85	28.15	27.52	27.80	52.58	37.63	36.42	5.33	3.33	0.33	0.00	0.00	0.00	1.50	19.04
Qwen2.5-1.5B	1.5B	None	6.68	2.9	70.00	70.20	69.72	68.46	69.90	87.58	73.81	71.85	66.33	65.33	34.33	7.33	1.33	0.00	29.11	53.22
Qwen2.5-1.5B	1.5B	GPTQ 8-bit	2.54	1.7	70.33	70.33	70.03	68.99	69.52	87.78	73.72	72.10	68.33	65.33	36.67	8.00	1.33	0.00	29.94	54.10
Qwen2.5-1.5B	1.5B	GPTQ 4-bit	1.81	1.1	64.92	64.92	62.40	63.28	62.42	86.25	70.25	69.10	60.00	46.67	12.67	7.33	0.00	0.00	21.11	47.02
Qwen2.5-3B	3B	None	12.42	5.8	84.74	84.38	85.44	84.96	85.44	93.49	83.73	76.25	78.33	75.33	47.67	34.33	2.67	1.00	39.89	65.78
Qwen2.5-3B	3B	GPTQ 8-bit	4.21	3.3	85.17	84.99	84.38	84.38	84.71	93.55	83.53	76.77	80.33	75.00	47.67	32.67	2.00	1.00	39.78	65.87
Qwen2.5-3B	3B	GPTQ 4-bit	2.88	2.0	81.78	81.60	81.58	81.91	81.78	92.12	80.86	71.96	72.67	65.67	17.67	19.67	0.00	1.00	29.45	61.29
Qwen2.5-7B	7B	None	30.05	15	91.76	92.19	91.05	91.89	91.33	96.03	90.53	82.66	94.33	90.00	69.67	47.00	39.33	5.67	57.67	78.36
Qwen2.5-7B	7B	GPTQ 8-bit	9.63	8.3	91.84	92.22	91.81	91.56	91.31	96.03	90.64	82.58	94.00	92.00	71.33	49.00	41.33	5.67	58.89	78.87
Qwen2.5-7B	7B	GPTQ 4-bit	6.48	5.3	90.62	91.23	90.65	90.73	90.85	95.62	89.19	82.69	80.67	15.00	58.33	15.67	31.67	1.00	33.72	68.35
Qwen2.5-14B	14B	None	57.04	28	94.29	94.57	94.06	94.54	93.86	97.87	93.37	84.08	96.33	95.33	84.00	72.00	61.33	38.67	74.61	85.80
Qwen2.5-14B	14B	GPTQ 8-bit	17.24	16	94.49	94.95	93.71	94.59	94.11	97.90	93.71	84.22	96.33	95.00	84.00	72.00	65.00	36.33	74.78	86.02
Qwen2.5-14B	14B	GPTQ 4-bit	10.65	9.4	94.74	94.69	94.01	94.31	93.63	97.57	93.17	83.10	95.00	95.67	82.33	64.00	54.33	26.00	69.56	83.53
Qwen2.5-32B	32B	None	125	62	95.40	95.78	95.20	95.55	94.92	98.26	95.25	87.11	99.00	99.33	93.33	92.33	79.00	60.00	87.17	91.77
Qwen2.5-32B	32B	GPTQ 8-bit	33.81	33	95.73	95.86	95.50	95.60	95.25	98.34	95.16	86.62	99.00	99.00	93.33	92.33	79.67	61.00	87.39	91.86
Qwen2.5-32B	32B	GPTQ 4-bit	52.42	19	95.73	95.73	94.92	95.43	95.12	98.09	95.19	87.06	100.00	100.00	98.33	91.67	77.33	56.33	87.28	91.80

Note: The sorting tasks involve arranging lists of numbers, with (+ve) containing only positive numbers and (mixed) containing both positive and negative numbers.

Key Findings

Breakthrough insights that challenge conventional wisdom about small language models and their reasoning capabilities

Small ≠ Weak

32B Qwen Rivals GPT-4 Turbo

Qwen 2.5-32B matches GPT-4-Turbo on intermediate-reasoning (MR-GSM8K 55.6 vs 53.0) and reaches 95% on GSM8K while using approximately 1/5 the parameters, overturning the "> 100B for reasoning" myth.

Quantization is Free

75% Memory Reduction

4- to 8-bit GPTQ cuts GPU memory by up to 75% yet preserves ≥ 99% of accuracy across GSM8K, ARC, CQA and robustness benchmarks, enabling laptop-scale deployment of formerly heavyweight models.

Keep Prompts Simple

Direct I/O Wins

On GSM8K, direct I/O prompts outperform or equal Chain-of-Thought and multi-shot variants; additional "think-step" instructions often confuse SLMs instead of helping them.

Sequence Length Limitation

80% → 40% Drop

Accuracy on sorting jumps from > 80% (8-item lists) to < 40% (32-item mixed lists), and negative numbers exacerbate errors, revealing a context-length bottleneck for algorithmic reasoning.

Robustness Scales

Size Matters for Defense

Larger SLMs (32B, 70B) remain the most resilient to adversarial GSM-Plus inputs, yet their quantized versions show negligible degradation, whereas pruned counterparts collapse.

Pruning Hurts

30-50 Point Loss

Weight-pruned 8B models lose 30-50 points on reasoning tasks and score ~0 on MR-GSM8K, showing that aggressive sparsification cripples logical consistency even after recovery fine-tuning.

@article{srivastava2025towards, title={Towards reasoning ability of small language models}, author={Srivastava, Gaurav and Cao, Shuxiang and Wang, Xuan}, journal={arXiv preprint arXiv:2502.11569}, year={2025} }