The Problem
- LLMs are great at conversation but struggle with precise CLI tool-calling. They hallucinate flags, invent subcommands, and get argument order wrong.
- Bigger models help, but running a 70B model just to format a
gh pr createcommand is wasteful. - Small models (2–4B parameters) can run locally on any hardware, but they need targeted training to get tool-calling right.
- Synthetic training data—where an LLM invents command outputs—teaches models to hallucinate confidently rather than route correctly.
Our Approach — "Real Data Only"
gh (86), gcloud (48), docker (40), git (35), hf (28), projects (27), honesty (21), package (20). Each prompt represents a real task a developer would actually ask.
--help output. Not hand-written, not hallucinated—parsed from the tools themselves.
Results — The 2B Surprise
We expected the 9B fine-tune to dominate. It didn't. The 2B model beat it in every single category.
| Category | Raw Qwen3.5-9B (baseline) |
Fine-tuned 9B V1 |
Fine-tuned 2B V3 |
|---|---|---|---|
| Overall | 0.20 | 0.39 | 0.55 |
| docker | 0.35 | 0.55 | 0.89 |
| gh | 0.10 | 0.50 | 0.61 |
| git | 0.30 | 0.30 | 0.67 |
| hf | 0.08 | 0.42 | 0.56 |
| package | 0.25 | 0.62 | 0.62 |
| honesty | 0.25 | 0.00 | 0.42 |
| gcloud | 0.17 | 0.08 | 0.25 |
| reasoning | 0.12 | — | 0.29 |
Key Takeaways
- The 2B model beats the fine-tuned 9B in every category—not just overall, but every single one.
- Docker routing hits 89% accuracy. That's production-ready for most workflows.
- Honesty recovered from catastrophic regression in V1 (0.00 → 0.42). The 9B model learned to never refuse; the 2B model learned when to say "I can't do that."
- At 1.9GB quantized, the 2B model fits on literally any GPU—even a Raspberry Pi with enough RAM.
Training Details
Lessons Learned
- 01 Bigger isn't always better for fine-tuning. The 9B base was already competent enough that training on 1,642 examples added more noise than signal. The 2B had more room to grow.
- 02 2B is the sweet spot. Weak enough to improve dramatically from targeted data, small enough for fast iteration cycles (2.3 hours vs. 4.7), tiny enough to deploy anywhere.
- 03 Real data beats synthetic data. Commands that actually ran on real systems—with real outputs, real errors, real edge cases—produce better training signal than LLM-generated fiction.
- 04 Multi-teacher helps. Gemini for volume, Claude for quality edge cases, humans for honesty calibration. Each source catches blind spots the others miss.
- 05 Tool distribution matters. 302 examples of one tool vs. 5 of another means the model ignores the rare tool. Balancing the training set across categories was as important as the data quality.
- 06 $15 total compute. Fine-tuning is accessible to anyone with a single GPU. You don't need a cluster. You don't need a cloud budget. You need good data and a clear objective.
Why This Matters — The Pipeline, Not Just the Model
The models are useful, but they're not the point. The point is demonstrating a repeatable pipeline that takes a domain, builds real training data, and produces a small model that works.
- Same approach, any domain. CLI tools today. Fleet management, bookkeeping, voice commands tomorrow. The pipeline is domain-agnostic.
- Small models + targeted training = production-ready AI without cloud dependency. No API keys, no rate limits, no vendor lock-in. A 1.9GB model running on local hardware you control.
- Accessible by design. $15 in compute, consumer GPUs, open-source tooling. If you have a use case and a dataset, you can build this.
- This is how ScrappyLabs builds. Real hardware, real data, real results. We don't simulate our way to production.
Get the Models
Everything is open source. Model weights, training data, CLI schemas, eval scripts. Take it, use it, build on it.
View on GitHub
Includes: training data · CLI schemas (gh, hf, gcloud, docker) · eval suite · GGUF weights
License: Apache 2.0