BACK TO ENGINEERING
Runtime 18 min read

We Run Cron Jobs Inside the Bun Server Process (And You Should Too)

Article Hero

Every Monday at 6 AM, the server sends reminder emails for upcoming mentoring sessions.

Every hour, it checks for expired session requests and cleans them up.

Every night at midnight, it purges stale authentication tokens.

Every Friday, it generates weekly earnings reports for mentors.

These jobs run inside the same Bun process that serves the API. No crontab. No Bull. No Agenda. No separate worker process. No Redis queue. Just setInterval with a few hundred lines of scheduling logic.

And before you tell me this is a terrible idea — hear me out. Because the conventional wisdom that says "never run scheduled tasks in your web server" is based on assumptions that don't apply to our architecture.


I – The Conventional Wisdom (And Why It's Usually Right)

The standard advice is clear: don't run cron jobs in your web server process.

The reasoning is sound. In a typical deployment, you have multiple instances of your server behind a load balancer. If each instance runs the same scheduled job, you get duplicate execution. Send one reminder email? No — you sent four, because you have four server instances.

There's also the resource argument. A long-running job (like generating reports or processing bulk emails) consumes CPU and memory that could be serving HTTP requests. If the job spikes and the server slows down, your users experience degraded performance because a background task is hogging resources.

And the reliability argument. If the server crashes mid-job, the job is lost. No retry. No recovery. The weekly report just... doesn't happen.

These are legitimate concerns. For a horizontally-scaled deployment with multiple server instances, external job runners are the right answer.

But we don't have multiple instances.


II – The One-Server Reality

The mentoring platform runs on a single Hetzner VPS. One server. One Elysia.js process. One Bun runtime.

There's no load balancer distributing traffic to multiple instances. There's no horizontal scaling. There's no replica set. It's a monolith — in the best sense of the word — running on a single machine with Coolify and Traefik handling reverse proxying.

In a single-instance deployment, the duplicate execution problem vanishes. There's only one process. It runs the job once.

The resource contention problem is real but manageable. Our scheduled jobs are lightweight — sending a few emails, running a few database queries, generating a small report. None of them take more than a few seconds. None of them consume significant memory. The API serves requests without noticing.

The crash recovery problem is the one genuine concern. If the server crashes during a job, that job doesn't complete. But our jobs are idempotent — running them twice produces the same result as running them once. If the server crashes during the reminder email job and restarts, the next scheduled run will pick up any emails that weren't sent. No data corruption. No duplicate sends.

For our specific deployment — single instance, lightweight jobs, idempotent design — in-process scheduling is simpler, cheaper, and more reliable than any external system.


III – What We're Scheduling

Let me catalog the jobs that run inside the Bun process.

Session reminders. One hour before a scheduled mentoring session, both the mentor and mentee receive an email reminder with session details and a Zoom link. This job runs every 15 minutes, queries for sessions starting in the next 60-75 minutes that haven't had reminders sent, and dispatches the emails.

Expired request cleanup. When a mentee requests a session with a mentor, the mentor has 48 hours to accept or decline. After that, the request expires automatically. This job runs every hour, finds pending requests older than 48 hours, marks them as expired, and notifies the mentee.

Stale token purge. Magic link tokens, email verification tokens, and password reset tokens have a TTL. After expiration, they're useless but still consume database space. This job runs nightly, deleting all tokens past their expiration time.

Weekly mentor reports. Every Friday, mentors receive an email summary of their week: sessions completed, hours billed, earnings, new mentee connections, and upcoming sessions. This job runs once per week.

Session auto-end. If a session has been running for more than 3 hours (our maximum session length), something went wrong — someone forgot to end it. This job runs every 30 minutes, finds sessions that have exceeded the maximum duration, and ends them automatically with appropriate billing.

Health check pings. Every 5 minutes, the scheduler logs a heartbeat to the database. This serves two purposes: it confirms the scheduler is running, and it provides a time series that monitoring can alert on if heartbeats stop.

Six jobs. Each one is simple. Together, they keep the platform running smoothly without human intervention.


IV – The Scheduler: More Than setInterval

You might think in-process scheduling means six setInterval calls and calling it a day.

It's not that simple.

setInterval has a dirty secret: it drifts. JavaScript's event loop is single-threaded. If a long-running synchronous operation blocks the event loop, setInterval callbacks get delayed. Over hours and days, the drift accumulates.

A job scheduled for "every 60 minutes" might run at 60 minutes, then 61 minutes later, then 59 minutes later, then 62 minutes later. After a week, your "hourly" job is running at unpredictable times.

For most of our jobs, minor drift doesn't matter. Whether the token purge runs at midnight or 12:03 AM is irrelevant.

But for session reminders, drift is dangerous. If the reminder job drifts late enough to miss the 60-minute window before a session starts, the reminder doesn't send. The mentee doesn't get their Zoom link email. They miss the session. That's a real-world consequence.

Our scheduler uses drift correction. Instead of scheduling the next run with a fixed interval, it calculates the exact timestamp for the next run and uses setTimeout with the precise delta. After each run, it recalculates the next timestamp based on the intended schedule, not the actual completion time.

This means if a job was supposed to run at 10:00:00 but actually ran at 10:00:03 (because the event loop was busy), the next run is scheduled for 11:00:00 — not 11:00:03. The drift corrects itself on every cycle.


V – Job Deduplication: The Overlapping Run Problem

What happens if a job takes longer than its interval?

If the token purge is scheduled every 60 minutes and one run takes 90 minutes (because the database is under heavy load), the next run starts while the first one is still going. Now you have two instances of the same job running concurrently.

For idempotent jobs, this is wasteful but not catastrophic. For non-idempotent jobs, it's a bug.

Our scheduler enforces single-execution per job type.

Each job has a running flag. Before a job starts, the scheduler checks the flag. If it's set, the new run is skipped and a warning is logged. When the job completes (successfully or with an error), the flag is cleared.

This is the simplest possible deduplication — a boolean per job. It works because we're single-process. There's no distributed coordination needed. No Redis locks. No database advisory locks. Just a JavaScript variable.

The skipped-run warning is important. If a job consistently overlaps with its next scheduled run, that's a signal that either the interval is too short or the job is too slow. The warning makes this visible in logs so we can investigate.

In six months of production, we've seen exactly two overlap warnings — both during database maintenance windows when queries were temporarily slow. Normal operation has never triggered an overlap.


VI – Error Isolation: The Nuclear Containment Pattern

This is the most critical design decision in the entire scheduler.

A failed job must never crash the server.

If the email sending service is down and the reminder job throws an unhandled exception, the Elysia server must keep running. If a database query in the token purge job fails with a connection error, the API must keep serving requests.

Our approach: every job runs inside what I call a nuclear containment wrapper. The wrapper catches every possible error — synchronous throws, rejected promises, and even errors in the error handler itself.

The containment wrapper does four things:

  1. Catches the error
  2. Logs it with full context (job name, run timestamp, error message, stack trace)
  3. Records the failure in the health check table (so monitoring can alert)
  4. Ensures the job is rescheduled for its next run (failure doesn't kill the schedule)

What the containment wrapper does NOT do: retry. If a job fails, it fails. The next scheduled run will try again. We deliberately chose not to implement immediate retries for two reasons.

First, if a job fails because of a transient issue (email service blip, database connection drop), the issue likely resolved itself by the next scheduled run. An immediate retry would probably fail too, and exponential backoff for a job that runs every 15 minutes is pointless.

Second, retry logic adds complexity that we'd need to test and maintain. The scheduling intervals are already short enough that "wait for the next run" is a perfectly acceptable retry strategy.

The exception is the session reminder job. For reminders, missing a send window has real consequences. If the reminder job fails, it runs a single immediate retry after 60 seconds. If that fails too, it logs a critical alert. This is the only job with retry logic, because it's the only job where "wait for the next run" isn't acceptable.


VII – The Email Sending Pattern

Four of our six jobs send emails. The email sending pattern is standardized across all of them.

Emails are not sent inline during the job. The job prepares a list of emails to send — recipient, template, data — and hands them to an email dispatch function. The dispatch function processes them sequentially with a small delay between each send.

Why sequentially with a delay? Rate limiting. Our SMTP provider (and most providers) throttle senders who fire hundreds of emails simultaneously. Sending one email every 200ms keeps us well within rate limits and avoids triggering spam filters.

Why not a queue? Because queues require infrastructure. A Redis queue, a database-backed queue, a file-based queue — each adds a moving part. For our volume (50-200 emails per day), sequential processing with a delay is sufficient. The entire daily email volume processes in under a minute.

The dispatch function is also responsible for recording the send in the database. Each email gets a log entry: recipient, template, send time, and status (sent, failed, bounced). This audit trail is essential for debugging "I didn't get the email" support requests.

Bounce handling is passive. We don't process bounces in real-time. The SMTP provider records bounces, and we review them weekly. If an email bounces repeatedly, we flag the user's email address and stop sending to it. This is manual because bounces are rare (under 0.5% of sends) and the complexity of automated bounce handling isn't justified by the volume.


VIII – The Timing Strategy: When Jobs Run

Not all jobs can run at the same time. If six jobs fire simultaneously, the server briefly spikes in CPU and memory. Even though each job is lightweight individually, the aggregate impact is noticeable.

We stagger job execution.

The session reminder job runs at minutes :00, :15, :30, :45. The expired request cleanup runs at minute :10. The stale token purge runs at 00:05 (five minutes past midnight). The weekly report runs Fridays at 06:00. The session auto-end runs at minutes :20 and :50. The health check runs at minutes :05, :10, :15, :20, :25, etc.

No two jobs start at the same minute. This is a simple rule that eliminates resource contention without any complex scheduling logic.

The stagger pattern also helps with debugging. If the logs show a spike at :10, we know it's the request cleanup job. If the spike is at :00, it's the reminder job. The timing fingerprint identifies the job.


IX – Timezone Handling: The Silent Killer

Here's a bug that takes three months to discover.

Your server runs in UTC. Your job schedules are in UTC. Your users are in 40 timezones. When should you send a session reminder?

The answer is not "one hour before the session in UTC." The answer is "one hour before the session in the mentee's local timezone."

Wait — that doesn't make sense. A session starts at a specific instant in time, regardless of timezone. One hour before that instant is the same instant everywhere in the world.

True. But when you check matters.

If the session starts at 14:00 UTC and you check for reminders at 13:00 UTC, the reminder fires. But if the reminder job drifts to 13:02, and your query says "sessions starting in the next 58-60 minutes" (because you're trying to avoid duplicates), you might miss it.

Our approach: generous windows with deduplication.

The reminder job checks for sessions starting in the next 75 minutes that haven't had a reminder sent. The 75-minute window (instead of 60) gives a 15-minute buffer for drift, restarts, and query timing. The "hasn't had a reminder sent" check prevents duplicates if the job runs multiple times within the window.

The deduplication is the key. The window can be generous because duplicates are impossible. And a generous window means drift, restarts, and edge cases can't cause missed reminders.

This pattern — wide windows with idempotent execution — applies to all our scheduled jobs, not just reminders. It's the fundamental principle that makes in-process scheduling reliable.


Want to simplify your scheduling architecture?

The mentoring platform at mentoring.oakoliver.com runs all its scheduled tasks — reminders, cleanup, reports, and more — inside a single Bun process. No external job runner. No message queue. No Redis. If you're building on Bun and want to discuss architecture decisions around scheduling, background jobs, or operational simplicity, book a session at mentoring.oakoliver.com.

Or explore the full Oak Oliver engineering blog at oakoliver.com.


X – Monitoring: How We Know Jobs Are Running

An in-process scheduler has an observability challenge: if the scheduler dies (because the server crashed), there's nothing to report that it died.

We solve this with external monitoring.

The health check job writes a heartbeat to the database every 5 minutes. An external monitoring service (a simple cron job on a separate machine, ironically) queries the database every 10 minutes and checks for a recent heartbeat. If the most recent heartbeat is older than 15 minutes, it sends an alert.

This is the only external cron in our system. And its only job is to monitor the internal scheduler. If the external cron fails, we notice manually because the monitoring dashboard goes stale. It's a simple, two-layer watchdog system.

Each job also logs its execution time, result (success/failure), and any notable metrics (number of emails sent, number of tokens purged, number of sessions auto-ended). These logs feed into our dashboard where we can see:

  • Last successful run time for each job
  • Average execution duration over the past week
  • Failure count over the past 24 hours
  • Next scheduled run time

The dashboard makes the scheduler visible. Without it, the scheduler is a silent background process that you only think about when something goes wrong. With it, the scheduler is a first-class citizen of the application that you can monitor, debug, and optimize.


XI – Graceful Shutdown: The Often-Forgotten Edge Case

When the server shuts down — for a deployment, a restart, or a crash recovery — running jobs need to complete gracefully.

Our shutdown handler:

  1. Stops scheduling new job runs (disables all timers)
  2. Waits for currently running jobs to complete (with a 30-second timeout)
  3. If jobs don't complete within 30 seconds, logs a warning and proceeds with shutdown

Step 2 is the critical one. Without it, a job that's halfway through sending emails would be killed mid-send. The next server startup would re-run the job, but without knowing which emails were already sent. With idempotent design and the "reminder already sent" flag, this wouldn't cause duplicates — but it would waste time re-processing.

The 30-second timeout is a safety valve. If a job is genuinely stuck (deadlocked database query, unresponsive SMTP server), we can't wait forever. The timeout lets the shutdown proceed, and the stuck job is logged for investigation.

In Bun, the process listens for SIGTERM and SIGINT signals. Coolify sends SIGTERM during deployments. The shutdown handler catches it, runs the graceful shutdown sequence, and then exits cleanly.


XII – The Anti-Pattern: Database-Backed Job Queues

Let me address the solution most developers would reach for instead.

A database-backed job queue (like Agenda for MongoDB, or a custom queue table in PostgreSQL) stores jobs as rows. A worker process polls the table for due jobs, locks them, processes them, and marks them as complete.

For our use case, this is over-engineering.

A database queue solves the distributed coordination problem — multiple workers can pull from the same queue without duplicate execution. But we don't have multiple workers. We have one process.

A database queue provides persistence — if the server crashes, pending jobs survive in the database and get picked up after restart. But our jobs are time-triggered, not event-triggered. There's nothing to "pick up" — the next scheduled run will execute the job.

A database queue adds observability — you can query the job table to see what's pending, what's running, what failed. But our in-process scheduler already provides this via logs and the health dashboard.

Every benefit of a database queue is already solved by simpler means in our architecture. And the queue adds real costs: a polling interval (latency), database load (queries every few seconds), lock contention (row locking), and an entire subsystem to maintain.

I'm not saying database queues are bad. For event-driven jobs (process this upload, send this webhook, handle this payment callback), queues are the right tool. But for time-driven jobs on a single-instance server, setInterval with drift correction is simpler and equally reliable.


XIII – The Redis Queue Temptation

The other common suggestion: use Redis with Bull or BullMQ.

Redis is fast. Bull has great DX. BullMQ has a beautiful dashboard. The temptation is strong.

But Redis is another service to run, monitor, and maintain.

Our VPS runs 6 services via Coolify. Adding Redis means 7 services. Redis needs memory allocation, persistence configuration, connection pool management, and its own monitoring. If Redis goes down, the job queue stops — and we need alerts for that too.

For a system processing millions of jobs per day, Redis is worth the operational overhead. For a system processing 50-200 jobs per day, it's a burden with no corresponding benefit.

The operational complexity of your infrastructure should scale with your actual workload, not with your imagined future workload. If we ever reach the scale where in-process scheduling isn't sufficient — thousands of jobs per minute, multiple server instances, complex dependency chains between jobs — we'll migrate to a proper queue. But we'll add that complexity when it's needed, not before.


XIV – Testing the Scheduler

Scheduled jobs are notoriously hard to test. You can't wait 15 minutes for a real interval to fire in a test suite.

We test at three levels.

Level 1: Job logic in isolation. Each job's core logic is a pure function. The reminder job's logic takes a list of upcoming sessions and returns a list of emails to send. We test this function with various inputs — sessions with reminders already sent, sessions too far in the future, sessions in the past. No scheduler involved.

Level 2: Scheduler mechanics. The drift correction, deduplication, and error isolation are tested with fake timers. We advance time programmatically, trigger job runs, and verify that the scheduler behaves correctly — runs at the right times, skips overlapping runs, catches errors without crashing.

Level 3: Integration tests. A test starts the scheduler with accelerated intervals (seconds instead of minutes), lets it run through several cycles, and verifies the end results — emails in the test mailbox, expired requests in the database, purged tokens gone.

Level 1 tests run on every commit. They're fast (milliseconds) and catch logic bugs.

Level 2 tests run on every PR. They're medium-speed (a few seconds) and catch scheduling bugs.

Level 3 tests run nightly. They're slow (30-60 seconds) and catch integration bugs.

This testing pyramid gives us confidence that the scheduler works correctly without making the test suite slow.


XV – What We'd Add If We Needed To Scale

If the mentoring platform grows to the point where in-process scheduling isn't sufficient, here's the migration path we'd follow.

Step 1: Extract job logic into separate modules. Our jobs are already pure functions, so this is mostly reorganization.

Step 2: Add a thin queue interface. Instead of the scheduler calling job functions directly, it pushes job invocations onto a queue. The queue initially is an in-memory array (functionally identical to the current system).

Step 3: Replace the in-memory queue with Redis. Now jobs are dispatched to Redis. A separate worker process (or multiple workers) pulls from Redis and executes jobs.

Step 4: Move to multiple server instances. The scheduler runs on a single designated instance (or uses a distributed lock). Workers run on all instances.

Each step is incremental. We don't need to rewrite the scheduling system. We just add a layer of indirection at each step. The job logic never changes — only the dispatch mechanism.

This is the benefit of starting simple. The migration path from "setInterval in a monolith" to "Redis-backed distributed queue" is a series of small, low-risk changes. The migration path from "Redis-backed distributed queue" to "setInterval in a monolith" is... well, nobody's doing that. But simplifying is always an option.


XVI – Lessons from Production

Six months of running in-process scheduling on a production server. Here's what we learned.

Idempotency is the foundation. Every other design decision builds on it. Drift correction, deduplication, wide windows, graceful shutdown — all of these are safety nets. Idempotency is the trampoline. If a job runs twice, nothing bad happens. This single property eliminates more bugs than all the safety nets combined.

Logging is your debugging UI. Without a fancy dashboard (we added ours in month three), the only way to know what the scheduler was doing was reading logs. Structured, timestamped, searchable logs are non-negotiable for background processes.

Generous windows beat precise timing. A job that runs "sometime between 6:00 and 6:15" and checks for work to do is more reliable than a job that must run at exactly 6:00. Precision creates fragility. Generosity creates resilience.

The health heartbeat is the most important job. It's the simplest job — write a timestamp to the database. But it's the one that tells you everything else is working. If the heartbeat stops, something is fundamentally wrong. Everything else is a detail.

Server restarts are the stress test. We deploy several times a week. Each deployment triggers a shutdown and restart. The graceful shutdown sequence runs every time. The scheduler reinitializes every time. Jobs that were due during the restart window fire on the next cycle. Six months of deployments, zero missed jobs.


XVII – The Philosophical Argument

There's a deeper principle at work here that goes beyond scheduling.

The simplest architecture that solves the problem is the best architecture.

Not the most elegant. Not the most scalable. Not the most impressive on a system design interview whiteboard. The simplest.

In-process scheduling is "boring." It doesn't use a message broker. It doesn't have a retry queue with exponential backoff and dead letter routing. It doesn't scale horizontally. It doesn't have a web dashboard with real-time job monitoring (well, it does now, but it didn't for three months and nothing caught fire).

But it works. It's been running in production for six months. It's sent thousands of reminder emails, expired hundreds of stale requests, purged millions of dead tokens, and generated dozens of mentor reports. Zero missed jobs. Zero duplicate sends. Zero 3 AM pages.

The "proper" way to do this — Redis queue, separate worker process, distributed locking — would also work. But it would have taken longer to build, longer to debug, and introduced three more services that could fail.

Simplicity isn't a compromise. It's a feature.


XVIII – The Question That Guides Every Decision

When I'm tempted to add infrastructure — a queue, a cache, a separate service, a coordination layer — I ask myself one question:

What failure am I preventing that has actually happened, or will definitely happen at my current scale?

Not "might happen if we 10x." Not "could theoretically happen under unusual conditions." Has it happened, or will it definitely happen?

If the answer is "it might happen someday," I don't add the infrastructure. I add monitoring so I'll know if it starts happening. And I design the system so migration is easy when the time comes.

If the answer is "it's happening right now," I add the infrastructure immediately.

This approach means we build less, operate less, and maintain less. And when we do add complexity, it's in response to a real problem — not a hypothetical one.

So here's my question for you:

How many services in your infrastructure exist to solve problems you've never actually had?

If the answer is more than zero, each one is a liability disguised as a precaution.

– Antonio

"Simplicity is the ultimate sophistication."