Scrappy Tools

Teaching a 2B model to outperform a 9B model at CLI tool-calling — for $15.

We fine-tuned Qwen3.5 models to route real CLI commands—GitHub CLI, Docker, git, Hugging Face, gcloud—using real command outputs, not synthetic data. A 2-billion parameter model ended up beating a 9-billion parameter model across every category.

The Problem

Our Approach — "Real Data Only"

Seed Generation
305 seed prompts across 8 categories: gh (86), gcloud (48), docker (40), git (35), hf (28), projects (27), honesty (21), package (20). Each prompt represents a real task a developer would actually ask.
Multi-Teacher Expansion
Gemini 2.5 Flash generates diverse prompt variations for volume. 30 hand-crafted Claude examples cover quality edge cases and tricky multi-step scenarios. Human-written honesty examples teach the model to say "I don't know" when appropriate.
Real Execution
Every command is actually run on real systems. Real stdout, real stderr, real exit codes. No made-up outputs. If a command fails, we capture that failure—the model needs to learn what errors look like too.
Schema-Grounded
Full OpenAI-compatible function schemas extracted from actual CLI --help output. Not hand-written, not hallucinated—parsed from the tools themselves.
gh: 84 tools, 655 props hf: 65 tools, 356 props gcloud: 63 tools docker: 37 tools
Training
QLoRA on consumer/prosumer GPUs—an RTX 5090 (32GB) and a DGX Spark GB10. Total compute cost across all training runs: ~$15.

Results — The 2B Surprise

We expected the 9B fine-tune to dominate. It didn't. The 2B model beat it in every single category.

Category Raw Qwen3.5-9B
(baseline)
Fine-tuned 9B
V1
Fine-tuned 2B
V3
Overall 0.20 0.39 0.55
docker 0.35 0.55 0.89
gh 0.10 0.50 0.61
git 0.30 0.30 0.67
hf 0.08 0.42 0.56
package 0.25 0.62 0.62
honesty 0.25 0.00 0.42
gcloud 0.17 0.08 0.25
reasoning 0.12 0.29
+175%
improvement over baseline
89%
Docker routing accuracy
1.9 GB
quantized model size
$15
total compute cost

Key Takeaways

  • The 2B model beats the fine-tuned 9B in every category—not just overall, but every single one.
  • Docker routing hits 89% accuracy. That's production-ready for most workflows.
  • Honesty recovered from catastrophic regression in V1 (0.00 → 0.42). The 9B model learned to never refuse; the 2B model learned when to say "I can't do that."
  • At 1.9GB quantized, the 2B model fits on literally any GPU—even a Raspberry Pi with enough RAM.

Training Details

Base models Qwen3.5-2B / Qwen3.5-9B
Method QLoRA (r=16 2B, r=32 9B)
Training data 1,642 examples (821 tool + 821 reasoning)
Training time ~2.3h (2B), ~4.7h (9B)
Hardware RTX 5090 32GB + DGX Spark GB10
Export Merged → GGUF quantized
2B size q8_0 = 1.9GB
9B size q4_k_m = 5.6GB
Framework Unsloth + TRL SFTTrainer
Total cost ~$15

Lessons Learned

  1. 01 Bigger isn't always better for fine-tuning. The 9B base was already competent enough that training on 1,642 examples added more noise than signal. The 2B had more room to grow.
  2. 02 2B is the sweet spot. Weak enough to improve dramatically from targeted data, small enough for fast iteration cycles (2.3 hours vs. 4.7), tiny enough to deploy anywhere.
  3. 03 Real data beats synthetic data. Commands that actually ran on real systems—with real outputs, real errors, real edge cases—produce better training signal than LLM-generated fiction.
  4. 04 Multi-teacher helps. Gemini for volume, Claude for quality edge cases, humans for honesty calibration. Each source catches blind spots the others miss.
  5. 05 Tool distribution matters. 302 examples of one tool vs. 5 of another means the model ignores the rare tool. Balancing the training set across categories was as important as the data quality.
  6. 06 $15 total compute. Fine-tuning is accessible to anyone with a single GPU. You don't need a cluster. You don't need a cloud budget. You need good data and a clear objective.

Why This Matters — The Pipeline, Not Just the Model

The models are useful, but they're not the point. The point is demonstrating a repeatable pipeline that takes a domain, builds real training data, and produces a small model that works.

Get the Models

Everything is open source. Model weights, training data, CLI schemas, eval scripts. Take it, use it, build on it.

View on GitHub

Includes: training data · CLI schemas (gh, hf, gcloud, docker) · eval suite · GGUF weights
License: Apache 2.0