Introduction
In a typical worker-based system, workers process jobs. Crazy, I know.
Before processing a job, however, a worker must determine whether that job is available to be processed.
This is usually done by assigning each job a status:
– pending
– running
– done
– failed
The basic workflow is simple:
● If a job is pending, a worker claims it and changes its status to running.
● If a job is already running or done, other workers leave it alone.
● If a job is failed, the system triggers whatever retry mechanism is in place.
That is the easy part.
The Real Problem: Workers Crash
In a perfect world, processing a job would look like this:
Step 1: Claim the job
Step 2: Process the job
Step 3: Mark the job as done or failed
Unfortunately, we do not live in a perfect world.
A worker may crash during step 2.
The job remains marked as running, even though no worker is actually processing it anymore.
Because the job still appears to be running, no other worker will pick it up.
The job is now stuck forever.
There is another failure scenario.
The worker may finish processing the job successfully, but fail while saving the final result during step 3.
Once again, the job remains stuck in the running state.
Why Not Use One Transaction?
At first glance, the obvious solution is to wrap the entire operation in a single database transaction.
That usually does not work.
Processing a job may:
● take several minutes or hours
● call external APIs
● send messages to other systems
● run on a separate machine
● depend on services outside your database
Holding a database transaction open during all of that work would be inefficient and unsafe.
The claim operation and the final update usually need to happen in separate transactions.
The First Solution: Add a Lease
A common solution is to give each claimed job a lease.
When a worker claims a job, it sets a lease_expires_at timestamp.
If the worker does not finish the job before that timestamp, the system assumes that the worker may have crashed.
Another worker can then reclaim the job.
A recovery query might look like this:
SELECT * FROM jobs WHERE status = 'running' AND lease_expires_at <= NOW() LIMIT 1;
In plain English: Give me one job that is still marked as running, but whose lease has expired.
When Fixed Leases Work Well
A fixed lease works well when the expected processing time is predictable.
For example, consider a short database operation:
- Start a transaction
- Check the available inventory
- Reserve the required items
- Create the order record
- Commit the transaction
These operations should normally complete within milliseconds.
If the job is still running after several seconds, it is reasonable to assume that something went wrong.
The Problem with Fixed Lease Durations
Fixed leases become more difficult when processing time is unpredictable.
A job may take:
5 seconds
2 minutes
30 minutes
3 hours
In this situation, choosing the right lease duration becomes difficult.
If the Lease Is Too Short
A healthy worker may still be processing the job when the lease expires.
Another worker may reclaim the same job.
Now both workers are processing it at the same time.
This can cause duplicate work.
If the Lease Is Too Long
A crashed worker may leave a job stuck for a long time.
The system will not reclaim the job until the lease expires.
Recovery becomes unnecessarily slow.
The Better Solution: Heartbeat-Based Leases
For jobs with unpredictable durations, use a heartbeat-based lease.
When a worker claims a job, it starts a goroutine that runs for as long as the worker is processing that job.
The goroutine periodically extends the lease expiry time.
As long as the worker remains healthy:
Worker is alive
↓
Heartbeat goroutine is alive
↓
Lease keeps getting extended
↓
Job remains safely assigned to the worker
If the worker crashes:
Worker crashes
↓
Heartbeat goroutine stops
↓
Lease is no longer extended
↓
Lease expires
↓
Another worker can reclaim the job
Go Example
go func(ctx context.Context, jobID uuid.UUID) {
ticker := time.NewTicker(20 * time.Second)
defer ticker.Stop()
for {
select {
case <-ctx.Done():
return
case <-ticker.C:
if err := svc.JobItemProcessor.RefreshClaimHeartbeat(ctx, jobID); err != nil {
return
}
}
}
}(ctx, claimed.Item.ID.Bytes)
The heartbeat runs every 20 seconds.
Each heartbeat extends the lease into the future.
SQL Example: Refreshing the Lease
The heartbeat method might run a query like this:
UPDATE jobs SET lease_expires_at = NOW() + INTERVAL '60 seconds' WHERE id = $1 AND status = 'running';
This means: Keep this job assigned to the current worker for another 60 seconds.
SQL Example: Finding Expired Jobs
A separate recovery process can periodically search for expired leases:
SELECT * FROM jobs WHERE status = 'running' AND lease_expires_at <= NOW() LIMIT 1;
This means: Find a job that still appears to be running, but has stopped sending heartbeats.
Choosing the Timing Values
The lease duration must be longer than the heartbeat interval.
For example:
Heartbeat interval: 20 seconds
Lease duration: 60 seconds
This gives the system enough time to tolerate temporary delays.
It also allows crashed jobs to be recovered reasonably quickly.
Conclusion
Here is a short conclusion for the article:
Conclusion
Managing unpredictable background jobs requires a system that can gracefully handle worker crashes without causing unnecessary delays or duplicate work. While fixed-duration leases fail when processing times vary widely, heartbeat-based leases provide a resilient solution. By allowing healthy workers to periodically extend their own leases, the system ensures active jobs remain safely assigned while ensuring that crashed workers are quickly detected and their jobs automatically recovered.
This article was written by Ahmad Adel . Ahmad is a freelance writer and also a backend developer.