Picture a 12-person logistics shop that coordinates regional freight across the St. Lawrence corridor, doing the operational work of a company twice its size. Its inbox is a living thing. Carrier confirmations, delay escalations, rate queries, customs flags, and the occasional shipper who's had enough and needs someone, anyone, to respond right now at 11:30 on a Tuesday night.
Before the build, "right now" meant morning. The team covered business hours with reasonable discipline; after-hours was a single on-call rotation that, in practice, meant emails sat until 7 a.m. Average after-hours first response: six hours and twelve minutes. Not catastrophic by industry standards. Still a competitive liability when two of their five largest accounts were actively shopping alternatives.
What made this hard
A naive "auto-reply" was never an option. Logistics operations carry real liability. An incorrect delivery window, a miscommunicated customs status, or a botched carrier substitution can cascade into thousands in demurrage or a broken client relationship. The system needed to be genuinely correct or explicitly uncertain, not confidently wrong.
The second constraint was data fragmentation. The team operated across three systems: a legacy TMS (transportation management system) with a read-only API, a carrier portal they scraped because no API existed, and a shared ops inbox in Gmail that was, in practice, the source of truth for anything that had fallen through the cracks of the other two. Any agent touching client queries needed to reason across all three, accurately.
Third: the team had to trust it. A system that the ops staff perceives as a liability they're responsible for cleaning up is worse than no system at all. Every draft it produced, every escalation it surfaced, had to feel right. Not plausible, right.
Four layers, one loop
The ingest layer is dumb by design. Three sources (Gmail via webhook, TMS via polling every 90 seconds, and a carrier portal via lightweight Playwright scraper) all funnel into a unified event queue. No processing happens here. It's a pipe.
The classifier is where the first real decision lives. Every inbound message gets an intent label and a confidence score. Labels cover fourteen categories: ETD query, shipment status, rate request, carrier substitution, customs flag, escalation, and eight others. Anything below 0.82 confidence is immediately flagged for human routing. The system doesn't try to classify ambiguous things.
The grounded copilot, the layer that actually drafts responses, can only assert things it can cite. It pulls context from the TMS retriever, matches it against the inbound query, and generates a draft with inline citations to the source records. If a shipment status can't be retrieved, the draft says so explicitly. The copilot has no access to the send button. It produces text. The confidence budget decides what happens next.
What gets automated, and why
| Intent | Volume | Handling | Rationale |
|---|---|---|---|
| Shipment status query | 34% | auto-send | Fully grounded in TMS. No ambiguity, no liability. |
| ETD / ETA confirmation | 18% | auto-send | Single source of truth. Carrier scraper validates. |
| Rate query (existing account) | 11% | auto-send | Rates table is current. Confidence budget rarely depletes. |
| Proof of delivery request | 10% | auto-send | Document retrieval, no judgment required. |
| Carrier substitution request | 9% | draft + review | Copilot drafts options. Ops confirms before send. |
| Customs / compliance flag | 6% | draft + review | Liability surface too high for autonomous send. |
| Claim / dispute initiation | 4% | escalate | Requires named ops owner. No automation. |
| Ambiguous / multi-intent | 8% | escalate | Below confidence threshold. Human classifies and routes. |
The decision of what to automate is not a confidence question. It's a liability question. Shipment status and ETD confirmation are fully retrievable from a single system of record, so they auto-send. Carrier substitution is also retrievable, but the downstream consequences of a wrong call (demurrage, late penalties, missed customs windows) are large enough that the team wanted a human in the loop regardless of the copilot's confidence score. Claim and dispute initiation never automates. Ever.
Twelve weeks post-launch
| metric | before | after | signal |
|---|---|---|---|
| p50 after-hours first reply | 6h 12m | 4 min | −99% |
| inbound routed autonomously | 0% | 73% | net new |
| incorrect auto-sends | n/a | 0 / 90d | target |
| ops capacity per FTE | 1.0× | 2.1× | +110% |
| headcount delta | 12 | 12 | unchanged |
The zero incorrect auto-sends figure is the one the team cares most about. The confidence budget and citation-only drafting policy meant the system learned to fail visibly rather than fail quietly. Across the modeled 90-day window there were eleven cases where the copilot had enough confidence to draft but not enough grounding to cite. All eleven were correctly caught by the fallback gate and escalated. No shipper received a fabricated status.
The 2.1× ops capacity figure is the one leadership cares most about. The team isn't larger. What changed is the mix of work: the hours previously spent on status queries, POD requests, and routine ETD confirmations now go toward the complex cases that actually require a human (the disputes, the carrier negotiations, the exception handling). The ceiling on what 12 people can do moved without adding headcount.
What didn't work
The carrier scraper is fragile. Two of the nine carriers in their network update their portal markup irregularly, which breaks the scraper silently. You don't get an error, you get stale data that the confidence budget treats as fresh. The fix is a staleness timestamp on every scraped record and a hard cap on how old carrier data can be before the copilot is forced to say it doesn't know. That cap is now 4 hours. Above that, the draft is flagged regardless of confidence score.
The initial classifier training set was too clean. Real operations email is messy: forwarded chains with three layers of quoted text, messages in both French and English mid-sentence, abbreviations that aren't in any training corpus. The classifier's first-week accuracy was 0.74 on real traffic versus 0.91 on held-out test data. Two weeks of production shadow-mode with human labels closed that gap to 0.88.
Finally: the escalation UX wasn't good enough at launch. When the system escalated something, it sent a Slack ping with a link and the intent label. That's not enough context for someone woken up at 2 a.m. Version two bundles the escalation with the full draft, the retrieved context, and the specific reason the confidence budget was triggered. Response time on escalated threads dropped by 40% after that change.
If you'd like us to look at where an inbound automation stack would and wouldn't pay off in your operation, the contact form is the fastest way. We do free 30-minute reviews for production systems.