How AI helpdesk projects actually fail

1. The "looks confident, is wrong" failure

An LLM-based agent answers a ticket. The answer is grammatically correct, well-formatted, and references the right ticket fields. It's also wrong in a way only someone who knows your client's setup would catch.

This isn't a model problem. It's a review-loop problem. The team didn't define which actions require human approval before the agent took them. So the wrong answer went straight to the client. The client trusts your team's voice. Now they don't.

Mitigation: every action with non-trivial blast radius (anything that touches billing, security, or external communication) drafts only. A human approves. The agent is a faster typist, not a faster decision-maker.

2. The "automated the wrong 30%" failure

The team picks the easiest 30% of the queue to automate. Password resets, simple license adds, the obvious stuff. The agent does it well.

But the easy 30% wasn't the expensive 30%. Your L1 was already handling those in two minutes each. The 30% that was actually killing your margin (the long-tail multi-step requests where a tech has to flip between Syncro, M365, AD, and a vendor portal) is untouched. After 90 days, the agent is saving ~3 minutes per ticket on the wrong tickets. Margin doesn't move. Owner asks why.

Mitigation: before you build, pull 90 days of ticket data and sort by total tech-minutes per category, not by ticket count. Automate the categories at the top of that list. Sometimes that's still password resets. Often it isn't.

3. The "scope creep killed it" failure

The agent works for password resets. The team adds DL changes. Works. Adds M365 onboarding. Works. Adds offboarding. Now it's a system with eight specialist agents, none of which were designed together, all of which share state in subtly wrong ways.

When something breaks, no one can tell which agent caused it. Logs are scattered across services. The team rolls back to manual for everything because the cost of debugging exceeds the cost of just doing the work.

Mitigation: design for the second specialist before you ship the first. Shared dispatcher, shared logging surface, shared escalation pathway. Otherwise you're not building a system, you're stapling chatbots together.

The pattern under the pattern

All three failures share something: the project optimized for the demo, not the operating year. Demo-day metrics are accuracy on a curated set of tickets. Operating-year metrics are: can your L1 still trust this thing on day 240, when the original developer is gone and a new client tenant just got onboarded?

If the answer is "we'd have to rebuild it," the project failed even if it works today.

Which of these have you watched die in the wild?

← All posts