Scrappy Tools — Teaching Small Models to Use CLI Tools

The Problem

LLMs are great at conversation but struggle with precise CLI tool-calling. They hallucinate flags, invent subcommands, and get argument order wrong.
Bigger models help, but running a 70B model just to format a gh pr create command is wasteful.
Small models (2–4B parameters) can run locally on any hardware, but they need targeted training to get tool-calling right.
Synthetic training data—where an LLM invents command outputs—teaches models to hallucinate confidently rather than route correctly.

Our Approach — "Real Data Only"

Seed Generation

305 seed prompts across 8 categories: gh (86), gcloud (48), docker (40), git (35), hf (28), projects (27), honesty (21), package (20). Each prompt represents a real task a developer would actually ask.

Multi-Teacher Expansion

Gemini 2.5 Flash generates diverse prompt variations for volume. 30 hand-crafted Claude examples cover quality edge cases and tricky multi-step scenarios. Human-written honesty examples teach the model to say "I don't know" when appropriate.

Real Execution

Every command is actually run on real systems. Real stdout, real stderr, real exit codes. No made-up outputs. If a command fails, we capture that failure—the model needs to learn what errors look like too.

Schema-Grounded

Full OpenAI-compatible function schemas extracted from actual CLI --help output. Not hand-written, not hallucinated—parsed from the tools themselves.

gh: 84 tools, 655 props hf: 65 tools, 356 props gcloud: 63 tools docker: 37 tools

Training

QLoRA on consumer/prosumer GPUs—an RTX 5090 (32GB) and a DGX Spark GB10. Total compute cost across all training runs: ~$15.

Results — The 2B Surprise

We expected the 9B fine-tune to dominate. It didn't. The 2B model beat it in every single category.

Category	Raw Qwen3.5-9B (baseline)	Fine-tuned 9B V1	Fine-tuned 2B V3
Overall	0.20	0.39	0.55
docker	0.35	0.55	0.89
gh	0.10	0.50	0.61
git	0.30	0.30	0.67
hf	0.08	0.42	0.56
package	0.25	0.62	0.62
honesty	0.25	0.00	0.42
gcloud	0.17	0.08	0.25
reasoning	0.12	—	0.29

+175%

improvement over baseline

89%

Docker routing accuracy

1.9 GB

quantized model size

$15

total compute cost

Key Takeaways

The 2B model beats the fine-tuned 9B in every category—not just overall, but every single one.
Docker routing hits 89% accuracy. That's production-ready for most workflows.
Honesty recovered from catastrophic regression in V1 (0.00 → 0.42). The 9B model learned to never refuse; the 2B model learned when to say "I can't do that."
At 1.9GB quantized, the 2B model fits on literally any GPU—even a Raspberry Pi with enough RAM.

Training Details

Base models Qwen3.5-2B / Qwen3.5-9B

Method QLoRA (r=16 2B, r=32 9B)

Training data 1,642 examples (821 tool + 821 reasoning)

Training time ~2.3h (2B), ~4.7h (9B)

Hardware RTX 5090 32GB + DGX Spark GB10

Export Merged → GGUF quantized

2B size q8_0 = 1.9GB

9B size q4_k_m = 5.6GB

Framework Unsloth + TRL SFTTrainer

Total cost ~$15

Lessons Learned

01 Bigger isn't always better for fine-tuning. The 9B base was already competent enough that training on 1,642 examples added more noise than signal. The 2B had more room to grow.
02 2B is the sweet spot. Weak enough to improve dramatically from targeted data, small enough for fast iteration cycles (2.3 hours vs. 4.7), tiny enough to deploy anywhere.
03 Real data beats synthetic data. Commands that actually ran on real systems—with real outputs, real errors, real edge cases—produce better training signal than LLM-generated fiction.
04 Multi-teacher helps. Gemini for volume, Claude for quality edge cases, humans for honesty calibration. Each source catches blind spots the others miss.
05 Tool distribution matters. 302 examples of one tool vs. 5 of another means the model ignores the rare tool. Balancing the training set across categories was as important as the data quality.
06 $15 total compute. Fine-tuning is accessible to anyone with a single GPU. You don't need a cluster. You don't need a cloud budget. You need good data and a clear objective.

Why This Matters — The Pipeline, Not Just the Model

The models are useful, but they're not the point. The point is demonstrating a repeatable pipeline that takes a domain, builds real training data, and produces a small model that works.

Same approach, any domain. CLI tools today. Fleet management, bookkeeping, voice commands tomorrow. The pipeline is domain-agnostic.
Small models + targeted training = production-ready AI without cloud dependency. No API keys, no rate limits, no vendor lock-in. A 1.9GB model running on local hardware you control.
Accessible by design. $15 in compute, consumer GPUs, open-source tooling. If you have a use case and a dataset, you can build this.
This is how ScrappyLabs builds. Real hardware, real data, real results. We don't simulate our way to production.

Get the Models

Everything is open source. Model weights, training data, CLI schemas, eval scripts. Take it, use it, build on it.

View on GitHub

Includes: training data · CLI schemas (gh, hf, gcloud, docker) · eval suite · GGUF weights
License: Apache 2.0