Search your own content — own the documents the index is built from

In the support-chatbot build idea, the part that did the real work wasn't the model. It was the index — the thing that found the right passage for the model to answer from. The model was rented and swappable; the index was the asset that made the answers yours. This note is about that index on its own, and the single ownership decision it forces: do you own it, or rent search as a service?

What the thing actually is

Start with the capability, plainly, because the ownership question only makes sense once the shape is clear. A search index over your content is a way of storing your documents so you can retrieve them by meaning rather than exact words — so a customer asking about "sending something back" still lands on the page titled "Returns." The modern form is a vector index: each document is turned into a list of numbers, an embedding, that places it by meaning, and an incoming question is matched against those numbers to pull the closest passages.

There are two ways to have one. You can rent it whole — hand a managed search service your documents, let it host the index, and query its API. Or you can own it — keep the embeddings and the source documents in your own database and run the retrieval yourself. The two feel similar on the first day and diverge sharply later, and which way you lean is the actual subject here.

The ownership question turns on one fact

What must stay yours? The answer follows from a single fact about this capability: the index is built from your content. It is the one component in the whole pipeline a general provider has never seen — your policies, your manuals, your accumulated corpus. The model is generic; the embedding maths is generic; the index is not, because it's made of the thing only you have.

So three pieces are yours by nature. The content is yours. The decision of what goes into the index — what's authoritative, what's stale, what should never be retrieved — is a product judgment, not a setting. And the retrieval logic — what counts as a match, how many passages you pull, how you rank them when several are close — is product behaviour you'll tune against real questions. None of that is infrastructure you'd want a vendor making decisions about on your behalf.

Where to put the seam

Split the capability where the seam genuinely is, not where the marketing puts it. Turning a piece of text into an embedding is a stateless call — you send text, you get back numbers — and that's a commodity, the same kind of rented intelligence as generation. Rent it freely, and swap providers when a better or cheaper embedding model appears.

On the other side of the seam sit three things that are not interchangeable, and most of the confusion about "owning your search" comes from lumping them together. The source documents are the must-own — they're your content, everything else is rebuildable from them, and they are rebuildable from nothing. The retrieval rules — what counts as a match, how many passages you pull, how you rank them — are product judgment, and stay yours too. The stored vectors are the part you can most safely rent: a managed vector store hosting the embeddings is a real job worth paying for, and self-hosting it is the stronger version of ownership, not the price of admission. So the question that actually decides the design isn't "should I run my own vector database" — it's narrower: is your only copy of what we know, and the ability to rebuild the index from it, sitting somewhere you can't easily leave?

What breaks when it's hacked in

The failure mode is the one every capability in this set shares: the provider becomes the only record. Here it has a specific and expensive shape. Embedding models improve, and the day you move to a better one, every vector in your index is computed the old way and has to be rebuilt — which you can only do if you still hold the source documents. Upload your corpus, keep only the vendor's index, discard the originals, and you've quietly lost the ability to re-index when models change, to audit why a wrong passage was retrieved, or to move when the pricing does. The rule that prevents all of it is small: keep the source documents at home. The index is always rebuildable from the documents; it is never rebuildable from nothing.

This is where the foundation's experience is real but partial, and worth marking honestly rather than overstating. CompanyGraph runs a content system that holds the source material a public site is built from — the documents are owned and at home, which is exactly the half that survives a provider change. The retrieve-by-meaning layer over them is a capability the foundation has exercised in parts, not a public search product shipped at scale — so take this as a note written from owning the content, not from operating a large vector search in production. The half that's lived is the half the ownership argument turns on.

Where it shows up

It shows up anywhere a product answers from its own material. The support chatbot is the sharpest case — there the index was the product and the model was the rented part, and getting that backwards is how a defensible product turns into a thin wrapper. The same split runs one layer up at the model itself: what has to stay yours when the model is someone else's is the identical ownership argument applied to generation.

The answer, then, is the same shape every time, with the load-bearing part named precisely. Rent the embedding model; rent the vector store too if it earns its keep — but own the source documents and the rebuild path, because that's what keeps the index yours to remake on your own schedule rather than yours to rent back one query at a time. Owning the running index outright is the stronger version; owning what it's built from is the version you can't skip. Search over your own content sits too close to the product to hand the content away whole. Keeping it is part of the foundation for the same reason keeping the billing record and the file index is: the capability is theirs to provide, but the meaning is yours to keep.

Articles describe the Foundation. The Foundation Map is the thing itself — accounts, admin, email, logging, and deployment, with one real workflow running through them.

Other articles in this cluster →Send your first workflow →

← All articles