Small ≠ Weak — 32 B Qwen rivals GPT‑4 Turbo!
Qwen 2.5‑32B matches GPT‑4‑Turbo on intermediate‑reasoning (MR‑GSM8K 55.6 vs 53.0) and reaches 95 % on GSM8K while using ≈1⁄5 the parameters, overturning the “> 100 B for reasoning” myth.
Quantization is (almost) free!
4‑ to 8‑bit GPTQ cuts GPU memory by up to 75 % yet preserves ≥ 99 % of accuracy across GSM8K, ARC, CQA and robustness benchmarks, enabling laptop‑scale deployment of formerly heavyweight models.
Keep prompts simple!
On GSM8K, direct I/O prompts outperform or equal Chain‑of‑Thought and multi‑shot variants; additional “think‑step” instructions often confuse SLMs instead of helping them.
Sequence length is the Achilles’ heel!
Accuracy on sorting jumps from > 80 % (8‑item lists) to < 40 % (32‑item mixed lists), and negative numbers exacerbate errors, revealing a context‑length bottleneck for algorithmic reasoning.
Robustness scales with size—but survives quantization!
Larger SLMs (32B, 70B) remain the most resilient to adversarial GSM‑Plus inputs, yet their quantized versions show negligible degradation, whereas pruned counterparts collapse.
Pruning hurts, sometimes fatally!
Weight‑pruned 8 B models lose 30–50 points on reasoning tasks and score ~0 on MR‑GSM8K, showing that aggressive sparsification cripples logical consistency even after recovery fine‑tuning.