When do costs influence system and architectural decisions?

3 minute read

Have you discussed cost during System Design interviews?

System design interviews are great at testing whether you know how to scale a system, but they rarely ask the question that matters most in practice, especially at an early-stage or bootstrapped startup: how much does this cost?

This blind spot isn’t accidental. Interview culture rewards demonstrating knowledge of distributed systems, replication strategies, and fault tolerance. Business judgment, the kind that keeps a company alive while you’re still figuring out product-market fit, rarely gets points.

Here’s a concrete example from something I’m building right now.

My service requires running long workflows asynchronously. I had four options on the table:

DBOS — has excellent developer experience, deep workflow visibility, robust failure handling.

Temporal — same strengths as DBOS, battle-tested at scale.

BullMQ + Redis — solid option but adds a new service to manage, and handling failures and retries isn’t as smooth in my opinion.

pg-boss — runs on top of Postgres, minimal visibility out of the box, retries are on you.

DBOS and Temporal are genuinely impressive. The developer experience is excellent, you get deep observability into every workflow, and failure handling is largely taken care of. But they’re expensive, and more importantly, the integration is deep. If you ever need to move away from them, you’ll pay a significant price in migration effort and engineering time.

BullMQ with Redis was a reasonable middle ground, but it would mean introducing Redis as a new component — an extra dependency, an extra point of failure, and an extra thing to maintain.

pg-boss wasn’t my first instinct. Visibility is limited, and I’d be responsible for handling retries myself. But it runs on Postgres, which I’m already using as my primary data store, meaning no new infrastructure to provision, monitor, or pay for.

So I chose pg-boss, not because it’s the best tool in isolation, but because it’s the right tool for where I am right now.

That said, it’s worth being clear-eyed about the tradeoff. Limited observability isn’t a minor inconvenience; early-stage is exactly when you’re moving fast and can least afford to spend hours debugging a silent workflow failure. I judged that risk acceptable because my workflows are straightforward enough and failures are recoverable, but that calculation matters and deserves to be made explicitly rather than ignored.

The other side of the equation is optionality. If the business grows and the complexity justifies it, migrating to Temporal or DBOS becomes a reasonable investment. But that’s a future problem, funded by the runway I preserved by not over-engineering too early.

Choosing boring, cheap infrastructure that fits what you already have isn’t a compromise. At the right stage, it’s the most important engineering decision you can make.