Handling Worker Crashes with Heartbeat-Based Leases

Introduction

In a typical worker-based system, workers process jobs. Crazy, I know.

Before processing a job, however, a worker must determine whether that job is available to be processed.

This is usually done by assigning each job a status:
– pending
– running
– done
– failed

The basic workflow is simple:
● If a job is pending, a worker claims it and changes its status to running.
● If a job is already running or done, other workers leave it alone.
● If a job is failed, the system triggers whatever retry mechanism is in place.

That is the easy part.

The Real Problem: Workers Crash

In a perfect world, processing a job would look like this:

Step 1: Claim the job
Step 2: Process the job
Step 3: Mark the job as done or failed

Unfortunately, we do not live in a perfect world.

A worker may crash during step 2.

The job remains marked as running, even though no worker is actually processing it anymore.

Because the job still appears to be running, no other worker will pick it up.

The job is now stuck forever.

There is another failure scenario.

The worker may finish processing the job successfully, but fail while saving the final result during step 3.

Once again, the job remains stuck in the running state.

Why Not Use One Transaction?

At first glance, the obvious solution is to wrap the entire operation in a single database transaction.

That usually does not work.

Processing a job may:
● take several minutes or hours
● call external APIs
● send messages to other systems
● run on a separate machine
● depend on services outside your database

Holding a database transaction open during all of that work would be inefficient and unsafe.

The claim operation and the final update usually need to happen in separate transactions.

The First Solution: Add a Lease

A common solution is to give each claimed job a lease.

When a worker claims a job, it sets a lease_expires_at timestamp.

If the worker does not finish the job before that timestamp, the system assumes that the worker may have crashed.

Another worker can then reclaim the job.

A recovery query might look like this:

SELECT *
FROM jobs
WHERE status = &#39;running&#39;
AND lease_expires_at &lt;= NOW()
LIMIT 1;

In plain English: Give me one job that is still marked as running, but whose lease has expired.

When Fixed Leases Work Well

A fixed lease works well when the expected processing time is predictable.

For example, consider a short database operation:

Start a transaction
Check the available inventory
Reserve the required items
Create the order record
Commit the transaction

These operations should normally complete within milliseconds.

If the job is still running after several seconds, it is reasonable to assume that something went wrong.

The Problem with Fixed Lease Durations

Fixed leases become more difficult when processing time is unpredictable.

A job may take:

5 seconds
2 minutes
30 minutes
3 hours

In this situation, choosing the right lease duration becomes difficult.

If the Lease Is Too Short

A healthy worker may still be processing the job when the lease expires.

Another worker may reclaim the same job.

Now both workers are processing it at the same time.

This can cause duplicate work.

If the Lease Is Too Long

A crashed worker may leave a job stuck for a long time.

The system will not reclaim the job until the lease expires.

Recovery becomes unnecessarily slow.

The Better Solution: Heartbeat-Based Leases

For jobs with unpredictable durations, use a heartbeat-based lease.

When a worker claims a job, it starts a goroutine that runs for as long as the worker is processing that job.

The goroutine periodically extends the lease expiry time.

As long as the worker remains healthy:

Worker is alive
↓

Heartbeat goroutine is alive

↓

Lease keeps getting extended

↓

Job remains safely assigned to the worker

If the worker crashes:

Worker crashes
↓
Heartbeat goroutine stops
↓
Lease is no longer extended
↓
Lease expires
↓
Another worker can reclaim the job

Go Example

go func(ctx context.Context, jobID uuid.UUID) {
  ticker := time.NewTicker(20 * time.Second)
  defer ticker.Stop()
  for {
  select {
  case &lt;-ctx.Done():
    return
  
   case &lt;-ticker.C:
    if err := svc.JobItemProcessor.RefreshClaimHeartbeat(ctx, jobID); err != nil {
      return
    }
   }
  }
}(ctx, claimed.Item.ID.Bytes)

The heartbeat runs every 20 seconds.

Each heartbeat extends the lease into the future.

SQL Example: Refreshing the Lease

The heartbeat method might run a query like this:

UPDATE jobs
SET lease_expires_at = NOW() + INTERVAL '60 seconds'
WHERE id = $1
AND status = 'running';

This means: Keep this job assigned to the current worker for another 60 seconds.

SQL Example: Finding Expired Jobs

A separate recovery process can periodically search for expired leases:

SELECT * FROM jobs
WHERE status = 'running'
AND lease_expires_at <= NOW()
LIMIT 1;

This means: Find a job that still appears to be running, but has stopped sending heartbeats.

Choosing the Timing Values

The lease duration must be longer than the heartbeat interval.

For example:

Heartbeat interval: 20 seconds
Lease duration: 60 seconds

This gives the system enough time to tolerate temporary delays.
It also allows crashed jobs to be recovered reasonably quickly.

Conclusion

Here is a short conclusion for the article:

Conclusion

Managing unpredictable background jobs requires a system that can gracefully handle worker crashes without causing unnecessary delays or duplicate work. While fixed-duration leases fail when processing times vary widely, heartbeat-based leases provide a resilient solution. By allowing healthy workers to periodically extend their own leases, the system ensures active jobs remain safely assigned while ensuring that crashed workers are quickly detected and their jobs automatically recovered.

This article was written by Ahmad Adel . Ahmad is a freelance writer and also a backend developer.

Handling Worker Crashes with Heartbeat-Based Leases

Introduction

The Real Problem: Workers Crash

Why Not Use One Transaction?

The First Solution: Add a Lease

When Fixed Leases Work Well

The Problem with Fixed Lease Durations

If the Lease Is Too Short

If the Lease Is Too Long

The Better Solution: Heartbeat-Based Leases

Go Example

SQL Example: Refreshing the Lease

SQL Example: Finding Expired Jobs

Choosing the Timing Values

Conclusion

Conclusion

Related Articles

Process: A Tale of Code in Motion

Stack vs. Heap

Welcome to our Chatbox