How we run sub-minute uptime checks from 6 regions
A look under the hood at SiteChecker's monitoring infrastructure — the trade-offs, the costs, and why we picked the stack we did.
When we set out to build SiteChecker, we had one non-negotiable: every monitor should run from at least 6 geographically distinct regions, every 60 seconds, without bankrupting us.
This post walks through how we got there.
The architecture in one diagram
(Diagram lives next to this MDX file as
./architecture.png— see how images-in-subfolder posts work below.)
The high-level flow is:
- A scheduler (one per region) reads the list of due monitors from MongoDB.
- Each check runs in a small worker that posts the result back to the API.
- The API computes status transitions and fans out alerts via the notifications service.
Why MongoDB for time-series
MongoDB’s time-series collections turned out to be a near-perfect fit:
- Native bucketing keeps storage tight (~80% smaller than a naive schema).
- Secondary indexes on
monitorIdmake per-monitor queries fast. - TTL indexes give us automatic data retention with zero ops.
We benchmarked it head-to-head against Timescale, and the operational simplicity won out.
Multi-region scheduling — the gotcha
The hard part of multi-region monitoring isn’t running the checks. It’s making sure every region runs the check at roughly the same wall-clock minute, so you don’t get a confusing flapping pattern when a site goes down in just one region.
We solved it with a deterministic schedule: each monitor’s “minute slot” is
derived from hash(monitorId) % 60. Every region picks up that same minute,
independently, with no coordination needed.
What’s next
We’re working on:
- 30-second resolution for the Business tier.
- Synthetic user flows — full Playwright scripts instead of just HTTP checks.
- A public status page for SiteChecker itself.
Until next time 👋