On the morning of March 4th, a form submission arrived at our intake endpoint at 04:18:23.117 UTC. By 04:18:23.164 UTC — forty-seven milliseconds later — we'd assigned it a spam score of 3, flagged it as a hot enterprise lead, drafted a reply, and routed it to the head of sales' inbox. The submission was real. The reply went out. The deal closed six weeks later.
This post is about what happens in those forty-seven milliseconds. Specifically: why we run seven different classifiers in parallel instead of one large model, what each one does, and why we think the ensemble approach is the only honest way to score a form submission you've never seen before.
The single-model temptation
The lazy way to build a spam scorer is to fine-tune a large language model on a labeled corpus and call it done. We tried this. It works, sort of. The numbers look good in the eval. But it produces a kind of confidence that's hard to act on — a single 0–100 number with no decomposition. When it's wrong, you can't tell why it's wrong. And it will be wrong, expensively, at scale.
A single number is an opinion. Seven numbers, voted in parallel, is closer to a verdict.
We wanted decisions we could audit, override, and explain — not a verdict from a black box. So we broke the problem into seven pieces and built a classifier for each. The pieces are independent. They run in parallel. The orchestration layer combines them, but each one can be inspected and replaced.
The seven classifiers
- Spam confidence. A logistic regression on lexical, structural, and behavioural features. The unglamorous workhorse. Catches 70% of the obvious cases on its own.
- Lead tier. Hot, warm, cold. Trained on outcome data from our design partners — labeled by whether the lead converted, not by a human guess.
- Sentiment. Positive, negative, neutral. Useful less for 'is this spam' and more for routing — a furious customer needs different handling than a curious one.
- Urgency. Time-sensitivity classifier. Looks for phrases like 'by Friday' and 'ASAP' but also softer signals like deadlines mentioned in the past tense.
- Auto-tags. Topical labels — 'pricing', 'enterprise', 'demo', 'support'. A multi-label classifier with around 40 categories.
- Category. Routing classifier. Where should this go? Sales, support, partnerships, careers, recruiter spam.
- Smart reply. A small drafting model that produces a single-paragraph response in the recipient's tone. The user can send it, edit it, or ignore it.
Each classifier emits a structured output. The orchestration layer combines them into a single verdict object. Here's roughly what comes back from the API for the submission I mentioned at the top:
{
"id": "sub_3kQp9MZ",
"received_at": "2026-03-04T04:18:23.117Z",
"ai": {
"spam_score": 3,
"tier": "hot",
"sentiment": "neutral",
"urgency": "high",
"tags": ["pricing", "enterprise", "decision-maker"],
"category": "sales",
"reply": "Hi Alex — thanks for reaching out…"
},
"verdict": "deliver",
"latency_ms": 47
}Why parallel beats sequential
The obvious objection to seven models: latency. Seven inference calls should be slower than one. In practice it isn't, because we run them in parallel on a shared embedding. The first model that produces the embedding pays the cost; the other six get it for free. Total wall-clock latency is bound by the slowest classifier, not the sum.
The non-obvious benefit: independence. When one classifier drifts, we notice — because its outputs diverge from the others. We can roll it back, retrain it, or replace it, without touching the rest of the system. A monolithic model doesn't give you this. You retrain everything, or you retrain nothing.
What we got wrong
The first version of the ensemble used a hard-voted majority — each classifier got one vote, the majority won. This was wrong in a specific way: the urgency classifier almost never disagreed with the rest, which meant it was contributing nothing. We switched to a learned weighting (a small linear model on top of the seven outputs) and the false-negative rate dropped by a third.
If you want to see this running on your own forms, the API is open and documented. There's a generous free tier and you don't need a credit card. The fastest way to find out whether the seven-classifier thing actually helps you is to try it on your traffic for a week.