What one capability unlocks
How do you trust the text an OCR API pulls from a document?
OCR returns text and a confidence. The confidence is the feature.
A document-extraction API gives you two things back, and most of the value is in the one nobody designs for. You send it a scanned invoice, a receipt, a form, and it returns the text it read — the vendor name, the line items, the total. Alongside each piece of text it also returns a number: how sure it was. The temptation is to take the text and ignore the number, because the text is what you wanted and the number is just metadata. That reading is backwards. The provider has already solved the hard problem of turning pixels into characters. The number is the part it handed back to you to decide about, and what you do with the low end of it is the whole product.
What the capability gives you
You're renting reading. AWS Textract, Azure's document services, Google Document AI — they take an image or a PDF of a document a human filled in or printed, and they return structured text: not just a blob of characters but, where the document has shape, the fields and tables and key-value pairs, each with a confidence between zero and one. That is a genuinely hard capability and it is correctly rented; training a document model is a different company. On a clean, well-lit, typed document the text comes back right and the confidence comes back high, and it feels finished.
It feels finished because the demo is always a clean document. Real documents are photographed at an angle, faxed twice, handwritten in a hurry, smudged where a thumb sat, printed in a font the model half-knows. On those, the text still comes back — but some of it comes back wrong, and the model usually knows it might be wrong, and says so in the number. The confidence isn't decoration on top of the answer. It's the model being honest about which parts of its answer to trust, and a product that throws that honesty away has thrown away the one thing standing between it and a confident mistake.
What stays yours, and what doesn't
The model stays external — the OCR engine, the layout detection, the character recognition are theirs and rightly rented. Three things have to stay yours, and together they're the difference between "we read the document" and "we can stand behind what we read":
- The confidence threshold. Your policy, in your code, for what counts as sure enough. Above the line, accept the value automatically; below it, route the field to a person. Where that line sits is a product decision — a marketing form and a financial filing tolerate very different amounts of doubt — and it belongs to the side that knows the cost of being wrong: you.
- The human-correction record. When a person fixes a low-confidence field, that correction is evidence — proof that a named human looked at this value and verified it. Stored, it's the difference between a number a machine guessed and a number someone is accountable for. Discarded, every value looks equally trustworthy, which is to say none of them are.
- The binding back to the source. Each extracted value has to stay tied to the page and the region it came from, so a reviewer can see the value next to the pixels it was read from and the original is always one click away. A number with no path back to the image it came from can't be checked, only believed.
What you leave with the provider is everything upstream of those: the recognition itself, the model versions, the improvements that arrive on their schedule without asking you. Rent the reading. Keep the deciding.
What breaks when it's hacked in
The failure mode is short and expensive: treat the API as a function that returns the total, drop the confidence on the floor, and post whatever comes back. It works on every clean document, which is exactly why it ships. Then a photographed invoice comes through with an 8 the model read as a 3 at confidence 0.55 — the model flagged its own doubt, and nobody was listening. The total posts automatically. No threshold caught it because there was no threshold. No human saw it because there was no queue. No correction record exists because nothing was ever corrected, and no binding to the source survives, so when the figure is finally questioned there's no page to go back to and no trail of who approved it. One misread digit became a booked transaction with no one accountable, and it got there through a number the provider supplied specifically to stop it.
The fix isn't a better model — the better model still returns a confidence, and still gets the smudged digit wrong sometimes. The fix is owning what happens below the line. The foundation this rides on is the same one a real product already needs: a background job, because extraction runs on documents in batches and shouldn't block a request; somewhere durable for the source files to live so the binding back to the page survives; a log of every extraction and its confidences; and an admin queue where the below-threshold fields wait for a person and the corrections are recorded. Those modules aren't hypothetical. CompanyGraph runs a background batch runner, a logging service, and an operator admin in production today — that's the part you can check is real. A correction queue is the same shape as an admin the studio has built and operates; what hasn't been built is this — an extraction product with a confidence policy and a correction record wired through it. The foundation is evidence the parts exist, not a claim this idea has been done.
Where it shows up in a build idea
This capability is a quiet cousin to two ideas in this set, and the family resemblance is the point. It's the same machine-defers-to-human shape as the moderation build idea where a queue is the product, not the classifier: in both, the external model returns a judgement and its own uncertainty, and the product is the policy for the cases it isn't sure about, plus the record of who decided. And it sits at the front of the report-generator build idea — documents in, branded documents out: extraction is how unstructured documents become the structured data those reports are built from, and a report built on numbers nobody verified inherits every misread digit underneath it.
The verdict is plain and it isn't deflating: extraction is real and worth renting, but the API is the part that's already done. Own the confidence threshold, own the human-correction record, own the binding back to the source page — and treat blind trust in a number that can be wrong as the failure mode it is. The text is the easy part. The number, and what you do below it, is yours.
Owning the threshold and the correction record this way is part of the foundation — the background jobs, logging, and operator admin a correction queue rides on run in production on CompanyGraph today. If you have a document-extraction idea, the next step is to be honest about where your confidence line sits and what you record below it.
Articles describe the Foundation. The Foundation Map is the thing itself — accounts, admin, email, logging, and deployment, with one real workflow running through them.