Building a Construction SLM

6 MIN READ · May 18, 2026

Field notes on building a small language model for construction text — the classification problem that broke a frontier API, the seven-billion-parameter base that fixed it, and the hybrid architecture that ships in production today.

Frontier models can write a sonnet about an RFI. They can't classify one.

That is the sentence we wrote on the whiteboard after our third week of running the Construction Control Tower on a frontier API. The classification accuracy on incoming construction email — the single most upstream task in the pipeline — was 71%. Good enough for a demo. Catastrophic for a superintendent who's about to trust the routing.

So we built a small model. A 7B parameter base, fine-tuned on construction text. It now does the classification at 94.3% accuracy, in 180 ms, at roughly 1/40th the per-call cost of the frontier model we benchmarked against.

This is the field note on what it took.

What we tried first

The instinct, like everyone else's, was to ride a frontier model as far as possible. GPT-4o, then Claude 3.5 Sonnet, with a careful system prompt and a few-shot library of construction message examples.

The system prompt was 2,400 tokens. The few-shot library was another 8,000. We were spending $0.012 per inbound email on classification alone, before we even did the downstream work — drafting a response, extracting the spec reference, routing to the right project. At a mid-tier GC's volume of roughly 4,200 inbound construction emails per day, we were looking at $50/day on the cheapest model and over $200/day on the best.

But the cost was the second problem. The first was that the model didn't know what an RFI was. It knew what RFI stood for. It could not reliably tell the difference between an RFI, a submittal, an ASI, a CCR, an RFP, and the dozen other three-letter acronyms that look identical to an LLM that has never read CSI MasterFormat.

The training data

We had two advantages most people don't.

One — we had clean, structured email from a live deployment. Six months of an actual GC's inbox, with their PMs' routing decisions captured as labels. That gave us a few hundred thousand classified construction emails.

Two — we had domain ground truth. CSI MasterFormat divisions and sections. AIA contract document references. ASTM standards. The CFR sections that get cited in safety incidents. None of this is in the frontier model's training distribution at any useful density. All of it is encodable as a vocabulary the small model can be taught.

We assembled the training set as three layers:

Classification corpus — 380,000 construction emails, multi-labeled across RFI / submittal / change order / ASI / pay app / safety / schedule / other.
Reference vocabulary — every CSI MasterFormat division and section, every AIA document code, parsed and embedded as a glossary task in the fine-tuning data.
Counterfactual pairs — adversarial examples: "subject says RFI, body is a change order"; "subject is blank, body is a pay application question." This is the layer that broke the frontier model and that the SLM learned to handle.

The model

We fine-tuned a Llama 3.1 8B base on a single 8x A100 node over 11 hours. QLoRA with rank 64. We tried larger bases (70B) and smaller (3B). The 8B was the inflection point: meaningfully better than 3B on the counterfactual pairs, statistically indistinguishable from 70B on the in-distribution classification.

That alone is a useful finding. The frontier-vs-fine-tuned argument in construction text isn't about model size. It's about training distribution. A small model trained on the right text beats a large model trained on the wrong text.

The latency number is the one the PMs care about. At p95 under 400 ms, the classification feels instant. At p95 over two seconds, it doesn't.

Where the SLM is wrong

It is wrong about anything outside the construction-text distribution. Ask it to summarize a contract clause that cites a Massachusetts state law it hasn't seen — it will produce confident garbage. Ask it to compose a polite three-line response to a delayed RFI — it will be stiff and weirdly formal.

So we kept a frontier model in the loop. The architecture now looks like this:

Classification, extraction, routing, deduplication — Construction SLM. Fast, cheap, accurate where it matters.
Drafting, summarization, tone-adjusted response — Frontier model (Claude or GPT), called only on the messages that need it (roughly 22% of inbound).

This brought our blended per-email cost from $0.012 to roughly $0.0009. At the GC's volume that's the difference between an AI line item the CFO questions and one nobody notices.

What we'd do differently

Three things.

Build the eval set before the training set. We didn't. We trained, then evaluated, then realized our eval set had the same selection bias as our training set. We had to rebuild it from a held-out customer's inbox, which is now standard practice for every SLM project we run.

Pre-bake the reference vocabulary harder. CSI MasterFormat isn't optional. We treated it as a fine-tuning task; we should have treated it as a constrained-decoding constraint at inference time. The next iteration will use a grammar-constrained decoder for any output that contains a section reference.

Don't make the model decide when to escalate to the frontier. We tried. It was bad at it — biased toward escalating, since the training data didn't punish that. Now a deterministic policy decides which messages get the frontier pass, based on classification confidence and message length.

The bigger argument

Most enterprise AI teams are running frontier models on tasks where a 7B model trained on the right text would beat them on every metric that matters: accuracy, latency, cost, predictability.

The reason they don't is that fine-tuning has a reputation for being hard, and the frontier-model marketing is loud. Neither is a good reason. A construction SLM took us six weeks of engineering work and a single fine-tuning run. The cost of not doing it was a deployment that would have died at scale.

If you are building vertical AI — for construction, manufacturing, healthcare, GovTech, anywhere with a real domain vocabulary — start asking yourself which tasks are actually frontier-shaped and which are SLM-shaped. The answer is almost never "all frontier."

The Construction Control Tower is built on a hybrid architecture of small, fine-tuned models for the structured parts of the workflow and frontier models for the parts that need general reasoning.