The Cursor + Lease Pattern

Distributed systems fail.

Processes crash. Containers restart. Nodes disappear. Machines get patched. Pods get evicted. Networks split and heal.

None of that is exceptional. It is the normal operating condition of a distributed system.

Now imagine you are sending an emergency notification to millions of users in a region. The job starts. It processes hundreds of thousands of records. Everything looks healthy.

Then the worker crashes halfway through.

What happens next?

Do you start from the beginning and risk sending duplicates?
Do you try to guess where you left off?
Do you hope you didn’t skip anyone?

Large-scale batch work forces you to confront a simple requirement: progress must survive failure.

This is where the Cursor + Lease pattern becomes essential.

The Real Constraint: Progress Cannot Live in Memory

Suppose you need to notify every user in a region.

Loading all users into memory is not realistic. Processing them synchronously in an API request is not safe. Restarting from the beginning on every crash becomes increasingly expensive and error-prone as the dataset grows.

What you actually need is more fundamental:

A way to record progress.
A way to ensure only one worker owns the job at a time.
A way to resume safely after failure.
A way to tolerate duplicates when necessary.

The Cursor + Lease pattern addresses those needs directly.

The Cursor: Remember Where You Were

The first part of the pattern is the cursor.

Instead of treating a dataset as a single monolithic query, you process it in ordered batches and persist your position after each batch.

Rather than:

SELECT * FROM users WHERE region = 'North';

you page through deterministically:

SELECT user_id
FROM users
WHERE region = 'North'
  AND user_id > @lastProcessedId
ORDER BY user_id
LIMIT 10000;

After each batch, you store lastProcessedId in a durable table. That value becomes your checkpoint.

If the worker crashes, it does not need to rediscover where it left off. The cursor already knows.

The next worker simply resumes from the stored position.

The dataset becomes a path you walk incrementally instead of a cliff you must climb in one attempt.

The Lease: Ownership That Can Expire

Progress tracking alone is not enough. In distributed systems, multiple workers may exist. A scheduler might spin up additional instances. A container orchestrator might restart a pod.

You need to ensure that only one worker processes a given job at a time.

Instead of a permanent lock, the pattern uses a lease — a time-bound claim.

A worker acquires a job and sets a locked_until timestamp in the future. While it continues working, it periodically renews that lease.

If the worker crashes, the lease eventually expires. Another worker can see that the job is no longer actively owned and safely take over.

Ownership becomes renewable rather than permanent. The system heals itself when a worker disappears.

A Simple Schema

A table to support this might look like:

CREATE TABLE fanout_jobs (
    job_id UUID PRIMARY KEY,
    notification_id UUID NOT NULL,
    region VARCHAR(50) NOT NULL,
    cursor_user_id BIGINT DEFAULT 0,
    status VARCHAR(20) NOT NULL DEFAULT 'Pending',
    locked_until TIMESTAMP NULL,
    lease_owner VARCHAR(100) NULL,
    updated_at TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP
);

The cursor_user_id tracks progress.
The locked_until and lease_owner fields manage ownership.
Together, they allow a job to move forward safely across failures.

How a Worker Thinks

A worker loop often follows a structure like this:

while (true)
{
    var job = TryAcquireJob();

    if (job == null)
        break;

    while (true)
    {
        var users = GetNextBatch(
            region: job.Region,
            afterUserId: job.CursorUserId,
            limit: 10000
        );

        if (!users.Any())
        {
            MarkJobComplete(job);
            break;
        }

        EnqueueDeliveryBatch(job.NotificationId, users);

        job.CursorUserId = users.Last().UserId;
        UpdateCursor(job);

        RenewLease(job);
    }
}

The key operations are simple:

Acquire a lease.
Process a batch.
Persist the cursor.
Renew the lease.
Repeat.

Every batch commits progress. Every lease renewal proves liveness.

A Crash in the Middle

Consider a concrete scenario:

A worker processes users up to ID 500,000. It updates the cursor. It renews the lease. Then the container is killed.

When the lease expires, another worker checks the job table. It sees the stored cursor at 500,000. It resumes from 500,001.

The system continues forward.

There is no restart from zero. There is no need to infer progress from logs. The state required to resume already exists.

If delivery operations are idempotent, duplicate risk is controlled as well.

Why a Queue Alone Is Not Enough

Message queues are excellent at distributing work. They decouple producers and consumers. They provide buffering and retry semantics.

They do not inherently track where you are inside a large dataset. They do not paginate a table for you. They do not provide checkpointing across millions of rows.

Cursor + Lease complements queues. It handles the durable traversal of data. The queue handles delivery.

Each solves a different part of the problem.

Tradeoffs and Assumptions

The pattern assumes an ordered dataset with a stable sort key. It requires durable storage for job state. It introduces some additional coordination logic.

It also requires thinking about idempotency. A worker might crash after enqueuing a batch but before updating the cursor. Edge cases must be tolerated. Consumers should assume at-least-once semantics.

These costs are the price of resilience at scale.

Where It Fits

The pattern is particularly useful for:

Fanout jobs across large datasets
Data migrations
Rebuilding projections
Background indexing
Reconciliation tasks
Bulk notification systems

In small datasets or single-message processing flows, the overhead may not be justified. Stream processors with built-in checkpointing can provide similar guarantees automatically.

As always, complexity should match the problem.

A Broader Principle

Cursor + Lease reflects a larger truth about distributed systems:

Durable state outlives processes.

In-memory progress vanishes on crash. Durable progress does not. Restartability is more valuable than uninterrupted uptime.

Crashes will happen. Nodes will disappear. Deployments will interrupt work.

The goal is not to eliminate those events.

The goal is to make them harmless.

Cursor + Lease provides a simple, practical way to do exactly that.