Reasoning has long been viewed as an emergent property of large language models (LLMs), appearing at or above a certain scale (~100B parameters). However, recent studies challenge this assumption, showing that small language models (SLMs) can also achieve competitive reasoning performance. SLMs are increasingly favored for their efficiency and deployability. However, there is a lack of systematic study on the reasoning abilities of diverse SLMs, including those trained from scratch or derived from LLMs through quantization, pruning, and distillation. This raises a critical question: Can SLMs achieve reasoning abilities comparable to LLMs? In this work, we systematically survey, benchmark, and analyze 72 SLMs from six model families across 14 reasoning benchmarks. For reliable evaluation, we examine four evaluation methods and compare four LLM judges against human evaluations on 800 data points. We repeat all experiments three times to ensure a robust performance assessment. Additionally, we analyze the impact of different prompting strategies in small models. Beyond accuracy, we also evaluate model robustness under adversarial conditions and intermediate reasoning steps. Our findings challenge the assumption that scaling is the only way to achieve strong reasoning. Instead, we foresee a future where SLMs with strong reasoning capabilities can be developed through structured training or post-training compression. They can serve as efficient alternatives to LLMs for reasoning tasks.
Performance of small language models across 14 reasoning benchmarks and 6
sorting tasks along with GPU memory and disk size.
The models are sorted by their overall average performance on the reasoning benchmarks.
Model | Params | Quantization | GPU Memory (GB) | Disk Size (GB) | Overall Average | GSM8K (Direct I/O) | GSM8K (COT) | GSM8K (5-Shot) | GSM8K (5-Shot COT) | GSM8K (8-Shot) | ARC-Easy | ARC-Challenge | CommonsenseQA | Sort-8 (+ve) | Sort-8 (mixed) | Sort-16 (+ve) | Sort-16 (mixed) | Sort-32 (+ve) | Sort-32 (mixed) | Sorting Average |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Llama-3.2-1B | 1B | None | 4.73 | 2.4 | 36.39 | 38.99 | 33.69 | 32.73 | 33.13 | 67.23 | 47.50 | 48.38 | 44.67 | 1.33 | 1.00 | 0.00 | 0.00 | 0.00 | 7.83 | 29.67 |
Llama-3.2-1B | 1B | W: INT8 & A: INT8 | 1.53 | 1.9 | 36.87 | 39.63 | 32.07 | 30.58 | 32.88 | 67.45 | 47.90 | 48.10 | 50.33 | 1.67 | 1.33 | 0.00 | 0.00 | 0.00 | 8.89 | 30.17 |
Llama-3.2-1B | 1B | FP8 | 2.47 | 1.9 | 36.42 | 39.63 | 31.16 | 30.88 | 31.87 | 67.03 | 48.01 | 48.48 | 42.67 | 1.33 | 0.67 | 0.00 | 0.00 | 0.00 | 7.45 | 29.47 |
Llama-3.2-1B | 1B | FP8-dynamic | 2.47 | 2.0 | 36.21 | 40.86 | 32.93 | 31.24 | 32.83 | 67.02 | 48.69 | 48.40 | 45.67 | 2.00 | 2.67 | 0.00 | 0.00 | 0.00 | 8.39 | 30.29 |
Llama-3.2-3B | 3B | None | 13.21 | 6.0 | 73.54 | 75.18 | 74.02 | 72.73 | 72.61 | 87.84 | 74.63 | 69.72 | 96.67 | 55.33 | 73.33 | 17.33 | 17.00 | 0.00 | 43.28 | 67.19 |
Llama-3.2-3B | 3B | W: INT8 & A: INT8 | 3.66 | 4.2 | 72.58 | 75.23 | 73.39 | 73.74 | 72.68 | 87.22 | 74.37 | 69.31 | 95.67 | 49.67 | 62.67 | 17.33 | 15.00 | 0.00 | 40.06 | 66.06 |
Llama-3.2-3B | 3B | FP8 | 6.44 | 4.2 | 74.07 | 75.31 | 72.91 | 73.04 | 71.19 | 88.03 | 74.03 | 68.74 | 96.00 | 41.33 | 61.67 | 18.00 | 19.00 | 0.00 | 39.33 | 65.57 |
Llama-3.2-3B | 3B | FP8-dynamic | 6.44 | 4.2 | 73.49 | 75.13 | 72.71 | 73.34 | 72.05 | 87.53 | 73.58 | 69.75 | 94.00 | 52.33 | 60.33 | 16.00 | 18.00 | 0.00 | 40.11 | 65.80 |
Llama-3.1 | 8B | None | 30.65 | 15 | 83.45 | 85.27 | 83.45 | 84.51 | 83.50 | 92.07 | 79.58 | 74.28 | 86.00 | 78.67 | 74.67 | 56.33 | 59.67 | 5.33 | 60.11 | 77.93 |
Llama-3.1 | 8B | W: INT8 & A: INT8 | 8.98 | 8.5 | 83.37 | 85.27 | 83.45 | 84.41 | 83.32 | 92.33 | 79.98 | 73.63 | 82.33 | 77.00 | 70.67 | 58.00 | 62.33 | 4.67 | 59.17 | 77.53 |
Llama-3.1 | 8B | W: INT8 & A: INT16 | 15.94 | 8.5 | 83.95 | 84.89 | 83.78 | 83.75 | 83.62 | 92.34 | 80.32 | 73.87 | 86.33 | 79.00 | 73.67 | 56.00 | 65.00 | 5.67 | 60.95 | 78.01 |
Llama-3.1 | 8B | W: INT4 & A: INT16 | 12.6 | 5.4 | 82.21 | 83.80 | 82.13 | 80.74 | 81.70 | 90.49 | 76.62 | 73.57 | 82.67 | 66.67 | 69.67 | 52.00 | 56.67 | 6.67 | 55.73 | 74.38 |
Llama-3.1 | 8B | FP8 | 14.44 | 8.5 | 82.89 | 84.63 | 83.42 | 84.94 | 83.83 | 92.17 | 79.52 | 73.93 | 81.00 | 81.33 | 72.00 | 51.67 | 61.00 | 6.00 | 58.83 | 77.15 |
Llama-3.1 | 8B | FP8-dynamic | 21.09 | 8.5 | 83.27 | 84.86 | 82.97 | 83.88 | 84.69 | 92.33 | 81.00 | 74.09 | 81.67 | 74.67 | 74.33 | 53.00 | 65.33 | 5.00 | 59.00 | 77.13 |
Llama-3.1 | 70B | None | 269.17 | 132 | 95.10 | 95.27 | 94.72 | 94.44 | 94.64 | 98.34 | 94.43 | 83.73 | 100.00 | 100.00 | 99.00 | 97.00 | 100.00 | 88.00 | 97.33 | 94.74 |
Llama-3.1 | 70B | W: INT8 & A: INT8 | 69.34 | 68 | 94.72 | 95.00 | 94.52 | 94.62 | 94.54 | 98.43 | 94.62 | 83.92 | 100.00 | 100.00 | 99.33 | 96.67 | 100.00 | 85.33 | 96.89 | 94.82 |
Llama-3.1 | 70B | W: INT8 & A: INT16 | 138.64 | 68 | 92.92 | 94.36 | 93.96 | 94.39 | 93.51 | 97.59 | 92.89 | 80.04 | 100.00 | 99.00 | 98.00 | 95.00 | 99.00 | 85.33 | 96.06 | 93.14 |
Llama-3.1 | 70B | W: INT4 & A: INT16 | 107.34 | 38 | 95.15 | 95.20 | 94.82 | 95.12 | 94.90 | 98.26 | 94.51 | 82.77 | 100.00 | 100.00 | 98.67 | 97.00 | 99.67 | 76.33 | 95.28 | 94.09 |
Llama-3.1 | 70B | FP8 | 107.32 | 68 | 94.87 | 95.40 | 94.67 | 94.52 | 94.74 | 98.36 | 94.71 | 83.87 | 100.00 | 100.00 | 98.67 | 96.33 | 100.00 | 86.67 | 96.95 | 94.92 |
Llama-3.1 | 70B | FP8-dynamic | 176.63 | 68 | 94.64 | 95.38 | 95.00 | 95.10 | 94.52 | 98.46 | 94.54 | 83.70 | 100.00 | 100.00 | 98.67 | 97.67 | 100.00 | 86.00 | 97.06 | 95.04 |
Mistral-v0.3 | 7B | None | 27.67 | 14 | 54.84 | 55.98 | 54.76 | 57.90 | 54.23 | 88.99 | 76.82 | 69.83 | 60.33 | 48.33 | 21.33 | 5.67 | 2.00 | 1.00 | 23.11 | 46.93 |
Mistral-v0.3 | 7B | W: INT8 & A: INT8 | 34.84 | 7.1 | 52.11 | 55.60 | 53.88 | 55.75 | 52.79 | 88.65 | 75.97 | 70.52 | 55.33 | 38.67 | 14.67 | 5.00 | 0.67 | 1.00 | 19.22 | 44.04 |
Mistral-v0.3 | 7B | W: INT8 & A: INT16 | 14.36 | 7.1 | 54.26 | 55.85 | 54.13 | 56.86 | 52.82 | 89.07 | 76.68 | 70.22 | 62.00 | 45.00 | 24.00 | 5.33 | 2.00 | 0.00 | 23.06 | 46.15 |
Mistral-v0.3 | 7B | W: INT4 & A: INT16 | 11.17 | 3.9 | 53.93 | 56.03 | 51.91 | 53.47 | 50.87 | 88.33 | 74.97 | 69.83 | 54.00 | 25.00 | 16.00 | 3.00 | 4.00 | 0.00 | 17.00 | 42.95 |
Mistral-v0.3 | 7B | FP8 | -- | -- | 54.13 | 54.99 | 53.96 | 57.67 | 53.85 | 88.64 | 76.39 | 69.48 | 58.33 | 45.67 | 21.00 | 5.33 | 2.67 | 0.00 | 22.17 | 46.09 |
Mistral-Nemo | 12B | None | 57.89 | 23 | 86.76 | 86.08 | 85.57 | 84.94 | 85.34 | 92.79 | 83.70 | 72.78 | 95.00 | 81.33 | 78.33 | 54.67 | 49.33 | 6.67 | 60.89 | 77.02 |
Mistral-Nemo | 12B | W: INT4 & A: INT16 | 61.98 | 7.8 | 84.74 | 85.67 | 84.61 | 83.67 | 84.99 | 91.82 | 81.80 | 71.33 | 97.00 | 79.00 | 77.33 | 42.33 | 59.67 | 7.33 | 60.44 | 76.02 |
Mistral-Nemo | 12B | FP8 | -- | -- | 87.31 | 86.58 | 85.67 | 85.77 | 85.29 | 92.19 | 83.16 | 73.41 | 95.00 | 78.67 | 77.33 | 50.33 | 48.33 | 9.00 | 59.78 | 76.43 |
SmolLM2 | 1.7B | None | 6.55 | 3.2 | 46.17 | 43.75 | 44.23 | 41.47 | 44.78 | 75.04 | 54.21 | 53.18 | 55.33 | 28.00 | 14.67 | 2.67 | 0.33 | 0.00 | 16.83 | 36.99 |
Minitron | 4B | None | 16.01 | 7.9 | 27.95 | 28.68 | 35.41 | 34.80 | 34.07 | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | 32.18 |
Hymba | 1.5B | None | -- | 2.9 | 53.75 | 53.53 | 52.87 | 52.99 | 52.74 | 84.57 | 66.78 | 64.73 | 34.67 | 12.00 | 1.00 | 0.00 | 0.00 | 0.00 | 7.95 | 38.10 |
Phi-3.5-mini | 3.8B | None | 14.6 | 7.2 | 85.47 | 87.14 | 82.97 | 80.74 | 82.89 | 95.09 | 86.89 | 76.11 | 90.33 | 77.33 | 68.67 | 18.33 | 29.00 | 0.33 | 47.33 | 70.44 |
Phi-3-small | 7B | None | -- | 17.95 | 70.10 | 81.73 | 83.14 | 86.02 | 83.62 | 97.12 | 91.38 | 79.85 | 98.00 | 93.33 | 69.00 | 52.00 | 9.33 | 0.67 | 53.72 | 74.69 |
Qwen2-0.5B | 0.5B | None | 2.02 | 0.95 | 37.25 | 38.31 | 26.38 | 28.46 | 26.76 | 56.41 | 40.44 | 48.13 | 10.33 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 1.72 | 26.25 |
Qwen2-0.5B | 0.5B | GPTQ 8-bit | 0.71 | 1.4 | 38.08 | 37.91 | 26.33 | 27.27 | 26.59 | 56.13 | 40.30 | 47.50 | 7.67 | 0.33 | 0.00 | 0.00 | 0.00 | 0.00 | 1.33 | 25.84 |
Qwen2-0.5B | 0.5B | GPTQ 4-bit | 1.12 | 0.71 | 21.51 | 25.32 | 14.38 | 16.76 | 14.23 | 52.05 | 37.03 | 43.11 | 2.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.33 | 22.64 |
Qwen2-0.5B | 0.5B | W: INT8 & A: INT16 | 1.38 | 0.61 | 37.68 | 38.13 | 26.43 | 26.54 | 26.81 | 56.51 | 39.87 | 47.23 | 11.67 | 0.33 | 0.00 | 0.00 | 0.00 | 0.00 | 2.00 | 26.10 |
Qwen2-0.5B | 0.5B | W: INT8 & A: INT8 | 1.38 | 0.87 | 37.60 | 37.50 | 26.23 | 26.99 | 25.78 | 55.36 | 40.27 | 47.45 | 7.33 | 0.33 | 0.00 | 0.00 | 0.00 | 0.00 | 1.28 | 25.35 |
Qwen2-0.5B | 0.5B | W: INT4 & A: INT16 | 1.51 | 0.71 | 25.42 | 27.32 | 18.09 | 18.35 | 16.40 | 50.56 | 36.63 | 42.42 | 5.67 | 0.67 | 0.00 | 0.00 | 0.00 | 0.00 | 1.06 | 24.05 |
Qwen2-0.5B | 0.5B | FP8 | -- | -- | 35.20 | 35.94 | 23.17 | 25.25 | 22.52 | 56.61 | 40.13 | 46.76 | 6.33 | 0.67 | 0.00 | 0.00 | 0.00 | 0.00 | 1.17 | 29.26 |
Qwen2-1.5B | 1.5B | None | 7.09 | 2.9 | 62.83 | 64.85 | 56.46 | 59.51 | 55.88 | 84.34 | 67.29 | 69.78 | 44.67 | 21.33 | 7.33 | 0.00 | 0.00 | 0.00 | 12.22 | 46.19 |
Qwen2-1.5B | 1.5B | GPTQ 8-bit | 2.54 | 3.1 | 62.85 | 63.86 | 57.16 | 59.79 | 57.24 | 84.19 | 66.55 | 69.97 | 46.33 | 20.00 | 7.33 | 0.00 | 0.00 | 0.00 | 12.28 | 46.42 |
Qwen2-1.5B | 1.5B | GPTQ 4-bit | 1.81 | 2.4 | 56.31 | 57.54 | 49.41 | 52.99 | 49.66 | 82.03 | 63.99 | 68.99 | 33.00 | 13.00 | 3.33 | 0.00 | 0.00 | 0.00 | 8.22 | 42.00 |
Qwen2-1.5B | 1.5B | W: INT8 & A: INT16 | 2.51 | 1.7 | 62.98 | 64.04 | 56.41 | 59.72 | 57.19 | 83.96 | 66.84 | 70.19 | 47.67 | 21.33 | 5.67 | 0.00 | 0.00 | 0.00 | 12.45 | 46.67 |
Qwen2-1.5B | 1.5B | W: INT8 & A: INT8 | 2.48 | 2.2 | 62.45 | 63.00 | 54.13 | 58.73 | 55.75 | 83.64 | 66.84 | 69.72 | 46.33 | 21.33 | 5.67 | 0.00 | 0.00 | 0.00 | 12.22 | 46.13 |
Qwen2-1.5B | 1.5B | W: INT4 & A: INT16 | 3.14 | 1.6 | 57.90 | 58.55 | 48.40 | 53.10 | 48.29 | 81.64 | 63.51 | 66.42 | 43.00 | 17.67 | 6.00 | 0.33 | 0.00 | 0.00 | 11.17 | 42.87 |
Qwen2-1.5B | 1.5B | FP8 | -- | -- | 61.97 | 63.38 | 53.88 | 57.27 | 54.28 | 83.77 | 66.33 | 68.93 | 42.33 | 19.33 | 7.00 | 0.00 | 0.00 | 0.00 | 11.44 | 45.21 |
Qwen2-7B | 7B | None | 30.05 | 15 | 87.14 | 87.34 | 86.58 | 85.82 | 86.40 | 94.21 | 85.52 | 80.54 | 83.33 | 80.33 | 45.00 | 36.33 | 15.00 | 2.67 | 43.78 | 68.35 |
Qwen2-7B | 7B | GPTQ 8-bit | 9.63 | 8.3 | 87.16 | 87.54 | 86.56 | 86.50 | 86.40 | 94.28 | 85.64 | 80.04 | 84.33 | 82.67 | 44.00 | 32.33 | 14.67 | 3.00 | 43.50 | 68.35 |
Qwen2-7B | 7B | GPTQ 4-bit | 6.48 | 5.3 | 85.54 | 86.35 | 85.92 | 84.96 | 85.42 | 93.45 | 85.52 | 78.92 | 80.67 | 72.00 | 33.33 | 23.33 | 4.33 | 0.33 | 35.67 | 64.91 |
Qwen2-7B | 7B | W: INT8 & A: INT16 | 9.42 | 8.2 | 86.40 | 87.06 | 86.15 | 85.97 | 86.38 | 93.91 | 85.47 | 80.13 | 84.00 | 80.67 | 43.67 | 35.00 | 14.33 | 1.67 | 43.22 | 67.59 |
Qwen2-7B | 7B | W: INT8 & A: INT8 | 9.58 | 8.2 | 87.11 | 87.31 | 86.63 | 86.58 | 86.56 | 94.02 | 85.38 | 79.66 | 79.33 | 83.67 | 40.33 | 31.67 | 17.00 | 0.67 | 42.11 | 67.69 |
Qwen2-7B | 7B | W: INT4 & A: INT16 | 12.96 | 5.3 | 84.53 | 85.57 | 85.32 | 84.91 | 85.19 | 94.22 | 84.95 | 78.98 | 79.67 | 77.00 | 43.00 | 26.67 | 5.33 | 0.00 | 38.61 | 65.85 |
Qwen2-7B | 7B | FP8 | -- | -- | 86.66 | 87.14 | 86.05 | 86.56 | 86.13 | 94.26 | 85.41 | 80.32 | 81.00 | 83.00 | 47.00 | 29.00 | 13.33 | 1.00 | 42.39 | 67.45 |
Qwen2.5-0.5B | 0.5B | None | 2.02 | 0.95 | 46.80 | 46.88 | 42.73 | 43.19 | 42.28 | 62.50 | 44.28 | 46.90 | 11.67 | 3.67 | 0.33 | 0.00 | 0.00 | 0.00 | 2.61 | 24.37 |
Qwen2.5-0.5B | 0.5B | GPTQ 8-bit | 0.71 | 0.62 | 46.85 | 47.18 | 42.20 | 44.20 | 42.25 | 61.74 | 44.43 | 46.19 | 12.67 | 3.00 | 0.00 | 0.00 | 0.00 | 0.00 | 2.61 | 24.23 |
Qwen2.5-0.5B | 0.5B | GPTQ 4-bit | 1.12 | 0.45 | 34.62 | 32.85 | 28.15 | 27.52 | 27.80 | 52.58 | 37.63 | 36.42 | 5.33 | 3.33 | 0.33 | 0.00 | 0.00 | 0.00 | 1.50 | 19.04 |
Qwen2.5-1.5B | 1.5B | None | 6.68 | 2.9 | 70.00 | 70.20 | 69.72 | 68.46 | 69.90 | 87.58 | 73.81 | 71.85 | 66.33 | 65.33 | 34.33 | 7.33 | 1.33 | 0.00 | 29.11 | 53.22 |
Qwen2.5-1.5B | 1.5B | GPTQ 8-bit | 2.54 | 1.7 | 70.33 | 70.33 | 70.03 | 68.99 | 69.52 | 87.78 | 73.72 | 72.10 | 68.33 | 65.33 | 36.67 | 8.00 | 1.33 | 0.00 | 29.94 | 54.10 |
Qwen2.5-1.5B | 1.5B | GPTQ 4-bit | 1.81 | 1.1 | 64.92 | 64.92 | 62.40 | 63.28 | 62.42 | 86.25 | 70.25 | 69.10 | 60.00 | 46.67 | 12.67 | 7.33 | 0.00 | 0.00 | 21.11 | 47.02 |
Qwen2.5-3B | 3B | None | 12.42 | 5.8 | 84.74 | 84.38 | 85.44 | 84.96 | 85.44 | 93.49 | 83.73 | 76.25 | 78.33 | 75.33 | 47.67 | 34.33 | 2.67 | 1.00 | 39.89 | 65.78 |
Qwen2.5-3B | 3B | GPTQ 8-bit | 4.21 | 3.3 | 85.17 | 84.99 | 84.38 | 84.38 | 84.71 | 93.55 | 83.53 | 76.77 | 80.33 | 75.00 | 47.67 | 32.67 | 2.00 | 1.00 | 39.78 | 65.87 |
Qwen2.5-3B | 3B | GPTQ 4-bit | 2.88 | 2.0 | 81.78 | 81.60 | 81.58 | 81.91 | 81.78 | 92.12 | 80.86 | 71.96 | 72.67 | 65.67 | 17.67 | 19.67 | 0.00 | 1.00 | 29.45 | 61.29 |
Qwen2.5-7B | 7B | None | 30.05 | 15 | 91.76 | 92.19 | 91.05 | 91.89 | 91.33 | 96.03 | 90.53 | 82.66 | 94.33 | 90.00 | 69.67 | 47.00 | 39.33 | 5.67 | 57.67 | 78.36 |
Qwen2.5-7B | 7B | GPTQ 8-bit | 9.63 | 8.3 | 91.84 | 92.22 | 91.81 | 91.56 | 91.31 | 96.03 | 90.64 | 82.58 | 94.00 | 92.00 | 71.33 | 49.00 | 41.33 | 5.67 | 58.89 | 78.87 |
Qwen2.5-7B | 7B | GPTQ 4-bit | 6.48 | 5.3 | 90.62 | 91.23 | 90.65 | 90.73 | 90.85 | 95.62 | 89.19 | 82.69 | 80.67 | 15.00 | 58.33 | 15.67 | 31.67 | 1.00 | 33.72 | 68.35 |
Qwen2.5-14B | 14B | None | 57.04 | 28 | 94.29 | 94.57 | 94.06 | 94.54 | 93.86 | 97.87 | 93.37 | 84.08 | 96.33 | 95.33 | 84.00 | 72.00 | 61.33 | 38.67 | 74.61 | 85.80 |
Qwen2.5-14B | 14B | GPTQ 8-bit | 17.24 | 16 | 94.49 | 94.95 | 93.71 | 94.59 | 94.11 | 97.90 | 93.71 | 84.22 | 96.33 | 95.00 | 84.00 | 72.00 | 65.00 | 36.33 | 74.78 | 86.02 |
Qwen2.5-14B | 14B | GPTQ 4-bit | 10.65 | 9.4 | 94.74 | 94.69 | 94.01 | 94.31 | 93.63 | 97.57 | 93.17 | 83.10 | 95.00 | 95.67 | 82.33 | 64.00 | 54.33 | 26.00 | 69.56 | 83.53 |
Qwen2.5-32B | 32B | None | 125 | 62 | 95.40 | 95.78 | 95.20 | 95.55 | 94.92 | 98.26 | 95.25 | 87.11 | 99.00 | 99.33 | 93.33 | 92.33 | 79.00 | 60.00 | 87.17 | 91.77 |
Qwen2.5-32B | 32B | GPTQ 8-bit | 33.81 | 33 | 95.73 | 95.86 | 95.50 | 95.60 | 95.25 | 98.34 | 95.16 | 86.62 | 99.00 | 99.00 | 93.33 | 92.33 | 79.67 | 61.00 | 87.39 | 91.86 |
Qwen2.5-32B | 32B | GPTQ 4-bit | 52.42 | 19 | 95.73 | 95.73 | 94.92 | 95.43 | 95.12 | 98.09 | 95.19 | 87.06 | 100.00 | 100.00 | 98.33 | 91.67 | 77.33 | 56.33 | 87.28 | 91.80 |
Note: The sorting tasks involve arranging lists of numbers, with (+ve) containing only positive numbers and (mixed) containing both positive and negative numbers.
Breakthrough insights that challenge conventional wisdom about small language models and their reasoning capabilities
Qwen 2.5-32B matches GPT-4-Turbo on intermediate-reasoning (MR-GSM8K 55.6 vs 53.0) and reaches 95% on GSM8K while using approximately 1/5 the parameters, overturning the "> 100B for reasoning" myth.
4- to 8-bit GPTQ cuts GPU memory by up to 75% yet preserves ≥ 99% of accuracy across GSM8K, ARC, CQA and robustness benchmarks, enabling laptop-scale deployment of formerly heavyweight models.
On GSM8K, direct I/O prompts outperform or equal Chain-of-Thought and multi-shot variants; additional "think-step" instructions often confuse SLMs instead of helping them.
Accuracy on sorting jumps from > 80% (8-item lists) to < 40% (32-item mixed lists), and negative numbers exacerbate errors, revealing a context-length bottleneck for algorithmic reasoning.
Larger SLMs (32B, 70B) remain the most resilient to adversarial GSM-Plus inputs, yet their quantized versions show negligible degradation, whereas pruned counterparts collapse.
Weight-pruned 8B models lose 30-50 points on reasoning tasks and score ~0 on MR-GSM8K, showing that aggressive sparsification cripples logical consistency even after recovery fine-tuning.
@article{srivastava2025towards,
title={Towards reasoning ability of small language models},
author={Srivastava, Gaurav and Cao, Shuxiang and Wang, Xuan},
journal={arXiv preprint arXiv:2502.11569},
year={2025}
}