Monitoring is easy. Knowing when to wake someone is the product.

It is 2:14am and the phone goes off for the third night running. The on-call engineer reads it half-awake: a sensor on a freezer in a depot two timezones away briefly reported a temperature it has reported a hundred times before, and cleared itself before anyone could look. Nothing was wrong. They acknowledge it, drop the phone, and lie there annoyed. The fourth time it happens, they swipe it away without reading. By the end of the week, the alert app is something they have learned to ignore — which means the night the freezer actually fails, the page lands in a channel they have already tuned out.

That scene is the whole subject of this piece, and it inverts what a monitoring demo shows you. A demo is a dashboard lighting up — tiles going green and red, a live graph, a satisfying flood of telemetry. It looks like the product. It is the easy twenty percent. The product is the opposite of the demo: not the lighting-up, but the restraint. Waking the right person at the right moment, and — the harder, quieter half — staying silent every other moment, so that when a page does arrive it still carries weight.

The asymmetry that decides everything

Two failures are possible here, and they are not symmetric, which is the thing to hold onto before any code is written.

A false page erodes trust. Each one that turns out to be nothing teaches the person on call to take the next one a little less seriously, until the app is noise they swipe away. It is a slow failure, and it is invisible in any test where you are the only user, because you would never ignore your own demo.

A missed page ends the contract. The one event that mattered — the freezer that actually failed, the pump that actually stopped, the gateway that actually went dark — did not reach a human in time, and the cost of that landed on your customer. That is the failure they remember, and it is the one they cancel over.

So the product is not "send alerts." It is "be trusted enough that the one page that matters gets read, and quiet enough the rest of the time that it stays trusted." Everything technical below is in service of that one asymmetric goal. A clever dashboard does not help with it. Reliable, restrained delivery is the entire game.

The capability chain

Strip it to what actually has to happen, end to end, from a device to a sleeping human:

A device reports a reading → the reading is ingested and stored → a rule is evaluated against it → a decision is made about whether this is worth waking someone → the page is delivered through a channel that will actually reach them → and the fact that you paged, and whether it was acknowledged, is recorded. Six steps. Exactly one of them — the rule comparison — is the part that demos. The other five are ingestion, storage, judgment, delivery, and memory, and they are where the product lives, because they are the part the 2am scene turns on.

Notice where the weight sits. The reading crossing a line is trivial and identical for every customer. Deciding whether a given crossing is worth a human's sleep, getting the page to land, and proving it landed — that is the work, and none of it is in the quickstart.

What it rides on

Named as the modules a real version needs:

Accounts and auth — so a device belongs to a customer, a fleet belongs to an account, and a page goes to the right on-call person and no one else.
A device cloud — AWS IoT or Azure IoT to handle the connections, the device identities, and the firehose of telemetry coming off the fleet. This is rented, and rightly so.
A time-series store — the readings have to land somewhere built for a high volume of timestamped points, so you can evaluate a rule against the latest value and show the history behind it.
Background jobs — a scheduled runner that evaluates rules, rolls up summaries, and drives the escalation logic on a cadence, rather than blocking on every incoming reading.
Delivery with a record — email for the low-urgency, and a voice or SMS channel for the page that has to wake someone, each leaving a record of what was sent and whether it was acknowledged.
A log and an admin — so an operator can answer "why did this page fire" and, far more often, "why didn't this one," without opening a database at 3am.

Four of those modules are not speculative. Background jobs, logging, and an admin run in production today underneath CompanyGraph — a scheduled batch runner working through a hundred-thousand-stock market dataset, a logger every subsystem ships errors to, and an operator console where a stuck job is something you can see. Webhooks — the inbound seam a device cloud or a delivery provider calls back into — are the same shape CompanyGraph already uses to take callbacks from its providers. That is the honest extent of the claim: the foundation carries the scheduling, the logging, the operability, and the webhook plumbing. It has never watched a fleet of physical devices, and that part is new.

What you own, and what you don't

The decision that keeps this product yours is drawn early, and it is the usual one: rent the hard external job, own the judgment and the record.

The device cloud is rented. Maintaining the connections to thousands of devices, the firmware-facing protocols, the device identity and certificate handling — that is a specialist's domain, and AWS or Azure IoT exists to carry exactly it. You are not going to out-engineer their ingestion layer, and you should not try. The raw transport of bytes off a fleet stays external.

What must be yours is everything from the reading onward. The rules each customer set. The escalation policy — who gets paged first, who gets it if the first person does not acknowledge, and how long that takes. The suppression state that stops a flapping sensor from paging ten times. And the delivery record: who was paged, through which channel, at what time, and whether they acknowledged. That last one is not optional, because the question you will be asked after an incident is not "did the API return 200" — it is "did a human actually get woken, and when." Owning the consent to contact a person and the proof that you reached them is its own discipline, the same one the capability note on owning consent and the delivery record when you page a human is entirely about. If your only account of who you paged lives in a provider's dashboard, you cannot answer the question that matters, and answering it is the product.

The hard part

This is the section not to bury, because for this idea the hard part is genuinely hard — harder than the crowded-distribution problem most build ideas end on. There is real technical risk here, and it is not an afternoon.

Ingestion at scale. A fleet that grows from a hundred devices to a hundred thousand does not scale linearly in your head. The volume of incoming readings, the cost of storing every timestamped point, the rate at which rules have to be evaluated — all of it climbs with the fleet, and a design that was comfortable at small scale can fall over at large scale in ways that only show up under real load. This is the part where "I wired it in an afternoon" meets a wall, and the wall is real.

Alert-storm suppression — the load-bearing one. When something goes wrong in a connected system, it rarely goes wrong quietly. A network partition does not take down one device; it takes down a region, and ten thousand devices all report unreachable in the same minute. A power event trips a hundred sensors at once. The naive system pages for every one of them, and now the single page that matters — the one device whose failure is the actual problem — is buried under ten thousand that are merely symptoms of the same root cause. So the work is the suppression: grouping related alerts so a regional outage is one page and not ten thousand, recognising a flapping sensor and holding it rather than re-paging on every twitch, distinguishing the cause from its symptoms, and escalating only what a human actually needs to act on right now. That logic is the product. It is unglamorous, it is where most of the build goes, and it is invisible in a demo because a demo never has a storm.

Underneath both sits the asymmetry from the top: reliable delivery beats a clever dashboard, every time. A beautiful visualization that occasionally fails to page is worth less than a plain text message that always arrives. The engineering attention belongs on the path that has to work at 2am during the worst hour the system will ever have — not on the tiles that look good at noon.

The verdict

This is a real SaaS, with real budgets behind it — operations teams pay for confidence that they will be told when something breaks and left alone when it does not. But it carries genuine technical risk, and the honest verdict is that this is not a weekend project and not an afternoon. The easy part — read a value, compare it to a line — is the part everyone can build. The product is the part the 2am scene exposes: the suppression that keeps the page trusted, the escalation that gets it to a human who is awake, the record that proves you did, and an ingestion path that holds up when the fleet is large and the night is bad.

That is also why having the boring parts already in place changes the math. The scheduling, the logging, the operability, the webhook seam — when those are already carried, support is run from an admin rather than a database, and the work goes where it should: into the suppression logic and the delivery path that were always going to be the hard, differentiating part, rather than into rebuilding scaffolding that was never the point. A foundation does not make the hard part easy. It clears everything that was getting in the way of the hard part, so your attention lands on the one thing this product is actually about.

There is a near-twin worth reading next, because it is the same discipline on a different stream of numbers. Watching a market-data feed and deciding when a move is worth telling someone about is the same restraint problem — the one building alerts on a data feed without teaching people to ignore them works through in detail, grounded in a product Smallbox actually runs. Devices or prices, the lesson is identical: the value was never the reading. It was knowing when to break the silence.

This rides on the foundation — the scheduled jobs, the logging, the webhook seam, and the admin that turn a stream of device readings into a page a human can trust. If you have a pointed version of an idea like this, that's exactly what one workflow is meant to prove.

Articles describe the Foundation. The Foundation Map is the thing itself — accounts, admin, email, logging, and deployment, with one real workflow running through them.

Other articles in this cluster →Send your first workflow →

← All articles