A support chatbot that knows when to say 'I don't know'

A customer asks your support bot whether they can return an item after thirty days. The bot answers in a warm, complete sentence: yes, returns are accepted within sixty days, no questions asked. It sounds exactly like a policy. It reads like every other answer the bot has given. The only problem is that no such policy exists — your window is fourteen days, and the bot has just promised a stranger something you now have to either honour or walk back. Nobody typed that rule. The model produced it because it was asked a question and answering fluently is the one thing it always does.

That single failure is the whole subject of this piece, because it's not a bug you patch. It's a property of how the system is wired when you wire it the obvious way. Understand why it happens, and you understand what the actual product is — and why the chat box was never the hard part.

Why the confident wrong answer happens

A language model completes text. Give it a question, and it returns the most plausible continuation — the answer that sounds like a correct answer for that kind of question. When it has seen the real policy, that plausible continuation is usually right. When it hasn't, the model doesn't stop. It produces a continuation anyway, and a made-up return window reads precisely as confident as a true one, because to the model they're the same kind of sentence. There's no internal flag that separates "I know this" from "this is what an answer here tends to look like."

So the danger isn't that the model is unreliable in general. It's that on the exact questions where it's wrong, it gives you no signal that it's wrong. The failure is silent and it's well-dressed. Build a support bot by handing the model a chat box and your customers' questions, and you haven't built a support product — you've built a confident-sounding answer generator pointed at people who will act on what it says.

The two disciplines that turn that into a product

Everything that makes this safe is two pieces of engineering, and neither of them is the model.

The first is grounding the answers in your real documents. Instead of asking the model to answer from whatever it absorbed in training, you give it the relevant passages from your material — your policies, your help articles, your product docs — and ask it to answer from those and only those. The technical name for this is retrieval-augmented generation, RAG, and in plain terms it works like this: when a question comes in, you first search your own document store for the parts most likely to contain the answer, then hand those passages to the model along with the question, with an instruction to base its reply on what you just gave it. The search is the load-bearing step: you're not trusting the model to know your refund policy, you're finding the policy yourself and putting it in front of the model to phrase. The bot speaks from your corpus, not its guesses.

That search usually runs over a vector index — a way of storing your documents so you can retrieve them by meaning rather than by exact keyword, so that a customer asking about "sending something back" still finds the page titled "Returns." The index is yours, it's built from your content, and it's the thing a general-purpose bot has never seen. We'll come back to why that matters for whether this is a business at all.

The second discipline is the harder one to build because it runs against the grain of the model: letting the bot say "I don't know" honestly. Grounding gets the right passages in front of the model when they exist. But sometimes the customer asks about something your documents simply don't cover, and a grounded bot will still try to stitch an answer out of whatever was nearest. The product has to do the opposite — recognise when the retrieved material doesn't actually answer the question and decline, in plain words, rather than improvise. "I don't have that in our documentation — let me get you to a person" is the correct output, and getting the system to prefer that over a fluent guess is most of the engineering. A bot that admits the gap keeps your trust. A bot that fills every gap with confident text spends it, one wrong refund window at a time.

Validate once — don't retry until it sounds right

There's a tempting shortcut that quietly makes things worse. When an answer fails a check — off-topic, or not grounded in the retrieved passages — the reflex is to send it back to the model and try again until something passes. That feels like robustness; it's the opposite. A retry loop hides the failure rate behind whichever attempt happened to look acceptable, so you never learn how often the bot was about to invent something. The discipline is to validate once, and when it fails, log it and surface it rather than rolling the dice until a plausible reply comes back. A logged "couldn't answer this from the docs" is something you can act on — a missing document, or a common question you didn't expect. A failure papered over by a retry is the same information, deleted.

The parts that aren't the model

Follow one question through the system: it arrives, you search your document index for the passages most likely to answer it, you ask the model to answer from those, you check the answer is grounded, and you either send it or hand the conversation to a human. Exactly one of those steps is the model call. The rest is search, validation, escalation, and record-keeping.

And that rest is mostly already built, if you start from a foundation rather than a blank repository:

Accounts and auth — so a conversation belongs to a customer, and an operator signs in to read it.
Billing — it's a subscription product; the bot is the feature, the subscription is the business.
A content store and the index over it — the documents the bot answers from, kept somewhere you control and search.
Background jobs — re-indexing when your docs change, so the bot answers from this week's policy, not last quarter's.
A log and an admin — where every answer, and every honest "I don't know," is recorded, so you can see what the bot said and why a question went unanswered.

That content store, that logging, and the validate-once rule are not hypothetical here. They run in production under CompanyGraph today — a content system holding the source material a public site is built from, a logger that records what happened, and the same discipline of checking an output once and logging the failure instead of retrying blind. CompanyGraph does not run a public support bot; it is checkable evidence that the parts a support bot leans on are real and operated, not that the bot itself has shipped. The genuinely new work for this product is the retrieval over your specific documents and the tuning of when to decline. The rest is wiring into a base that already carries it.

The line between yours and theirs

The model stays rented, and that's the right call — training one is a different business entirely, and the capability note on generating text with a model you don't own walks through exactly what to keep on your own side when you do. For a support bot the answer is sharp: the model is theirs, but the document index, the retrieval, the grounding rules, the decision about when to say "I don't know," and the full log of what was asked and answered are all yours. If the only record of your bot's behaviour lives in a vendor dashboard, you can't tune it, can't audit a wrong answer, and can't move when a better or cheaper model appears. Keep the index and the logs at home and the model becomes a part you can swap.

The hard part

The technical risks above are solvable, and the work is real but bounded. The unbounded problem is the same one that decides most of these ideas: distribution.

"AI support chatbot" is a crowded category. Nearly every help-desk and support platform already ships one, and a founder evaluating yours is comparing it against tools their team may already pay for. As a horizontal product — a general bot for any company's support — this drowns, and not because the engineering is worse. It drowns because the buyer has no reason to prefer it, and you'd be competing on a feature everyone already has.

The wedge that works is narrow and specific. Pick a domain where the answers live in a body of documents a general bot has never ingested — a regulated niche, a technical product with deep internal manuals, a field where the right answer is buried in a corpus only insiders hold. Two things line up there. The grounding becomes a real advantage, because your index contains knowledge no general model absorbed. And the honest "I don't know" stops being a nice-to-have and becomes the selling point, because in a domain where a wrong answer is expensive — a misquoted regulation, a misstated medical or legal or financial detail — a bot that reliably declines what it can't ground is worth more than one that's right most of the time and confidently wrong the rest. The narrower and more consequential the domain, the more both disciplines pay off.

The verdict

This is a real SaaS in a narrow niche, and not much of one as a horizontal product. The model call is an afternoon; the grounding and the honest refusal are the product, and they're worth building. But aimed at everyone, you're one more entry in a category full of incumbents. Aimed at a specific domain where you can ground answers no general bot can reach, and where an honest "I don't know" is itself the feature, it has a reason to exist.

A foundation is what makes that bet cheap to place. With accounts, billing, the content store, jobs, and logging already running, the new work is the retrieval over your documents and the tuning of when to decline — so a pointed version, built for one consequential domain, can reach real users without first rebuilding the boring half. The honest question was never "can I wire a model to a chat box." You can, today. It's whether your domain is narrow and costly enough that grounded answers and an honest silence are worth paying for. The only way to learn that is to put the pointed version in front of the people who'd feel the cost.

The content store, jobs, billing, and logging behind this — the parts that turn a model call into a support product you can stand behind — are the foundation. If you have a pointed version of an idea like this, that's exactly what one workflow is meant to prove.

Articles describe the Foundation. The Foundation Map is the thing itself — accounts, admin, email, logging, and deployment, with one real workflow running through them.

Other articles in this cluster →Send your first workflow →

← All articles