Why does Celery crash when I pass GeoJSON directly in the task payload?

Redis and RabbitMQ impose default message-size limits (256 KB to 1 MB). Large GeoJSON or WKB blobs exceed these limits and either cause broker fragmentation or silent drops. Always pass a URI to cloud or network storage and let the worker fetch the data directly.

How do I prevent duplicate validation reports when a task retries?

Generate a deterministic task ID from a SHA-256 hash of the dataset URI, rule version, and target CRS. Pass this as Celery's task_id. Celery's result backend caches the output under that ID, so subsequent submissions with the same inputs return the cached report instantly without re-executing.

When should topology validation move to a separate queue?

Separate topology validation into its own queue as soon as it starts delaying attribute or schema checks. Topology rules (network connectivity, sliver polygon detection, parcel adjacency) require full-dataset context, high memory, and long timeouts. Sharing a queue with fast checks causes head-of-line blocking and degrades overall pipeline throughput.

Designing Async Validation Queues with Celery

You are building a spatial data pipeline where validation must not block ingestion — geometry complexity varies wildly, topology checks can run for minutes, and a compliance failure on one dataset should never halt the queue for the next. This page describes how to configure Celery to meet those requirements: routing tasks by computational weight, serializing oversized spatial payloads safely, and enforcing strict idempotency so compliance checks can be retried without producing duplicate reports. For the broader architectural context of where this fits, see Asynchronous Validation Workflows.

Prerequisites

Python 3.11+
Celery 5.3+ (pip install "celery[redis]>=5.3") — earlier versions lack broker_connection_retry_on_startup
Redis 7+ or RabbitMQ 3.12+ as the message broker
GeoPandas 0.14+ and Shapely 2.0+ — Shapely 2.x STRtree is required for thread-safe spatial indexing inside workers
pyproj 3.6+ for coordinate reference system (CRS) transformation with strict datum grids
A cloud or network storage path reachable by all workers (S3, GCS, or NFS mount) — workers must be able to fetch dataset files by URI without going through the ingestion service

Gotchas up front:

Redis maxmemory-policy must be set to noeviction for the result backend DB. Evicting result keys mid-task causes phantom task failures that are difficult to diagnose.
Celery’s default worker_prefetch_multiplier=4 is wrong for spatial workloads. Set it to 1 — topology tasks take variable and sometimes extreme amounts of time, so prefetching causes workers to hoard tasks they cannot yet start.
acks_late requires a broker that supports negative acknowledgment (NACK). Both Redis and RabbitMQ support this, but verify your broker version.

Queue Topology and Spatial Routing

Spatial validation introduces constraints that generic extract-transform-load pipelines rarely encounter. Geometries frequently exceed default message broker size limits, CRS normalization must happen before rule execution, and topology rules often require full-dataset context rather than row-by-row processing.

The solution is explicit routing based on computational weight:

fast queue: Lightweight attribute checks, schema validation, bounding-box verification. Low memory footprint, high concurrency (--concurrency=16).
heavy queue: CRS transformation, geometry simplification, coordinate precision normalization. Moderate memory, bounded concurrency (--concurrency=4).
topology queue: Network connectivity, sliver polygon detection, parcel adjacency and overlap rules. High memory, dedicated worker pools, longer task_soft_time_limit and task_time_limit.

Always serialize spatial payloads as file URIs (S3, GCS, or mounted NFS paths) or immutable dataset fingerprints. Passing raw GeoJSON or WKB through Redis or RabbitMQ causes broker fragmentation and out-of-memory crashes at scale.

Step-by-Step Procedure

Step 1 — Install and configure the Celery application

# celery_app.py
from celery import Celery

app = Celery(
    "spatial_validator",
    broker="redis://localhost:6379/0",
    backend="redis://localhost:6379/1",
)

app.conf.update(
    task_routes={
        "spatial_validator.tasks.validate_attributes": {"queue": "fast"},
        "spatial_validator.tasks.transform_crs":       {"queue": "heavy"},
        "spatial_validator.tasks.validate_topology":   {"queue": "topology"},
    },
    task_acks_late=True,
    task_reject_on_worker_lost=True,
    worker_prefetch_multiplier=1,          # never prefetch on variable-length spatial tasks
    broker_connection_retry_on_startup=True,
    task_soft_time_limit=1800,             # 30 min soft limit (topology queue override below)
    task_time_limit=2100,                  # 35 min hard kill
)

Verification: Start a worker against the fast queue and run celery -A celery_app inspect active_queues. Confirm the worker reports fast only.

celery -A celery_app worker --queues=fast --concurrency=16 --loglevel=info

Step 2 — Serialize payloads as URIs, not geometry blobs

Before writing any task, establish the payload contract. The rule engine described in Building Rule Engines with GeoPandas should hand off dataset references, not in-memory GeoDataFrames:

# payload contract — passed as JSON through the broker
payload = {
    "dataset_uri": "s3://spatial-qc-landing/parcels_2024.gpkg",
    "rule_version": "v2.1.0",
    "target_crs": "EPSG:4326",   # CRS expansion: European Petroleum Survey Group code 4326 = WGS 84
}

Verify the payload size stays under 10 KB. Use sys.getsizeof(json.dumps(payload)) in a test.

Step 3 — Implement the topology validation task with idempotency

# tasks.py
import hashlib
import logging
from typing import Any

from celery import Celery
from celery.exceptions import Ignore

app = Celery("spatial_validator")   # config loaded from celery_app.py

logger = logging.getLogger(__name__)


@app.task(
    bind=True,
    name="spatial_validator.tasks.validate_topology",
    acks_late=True,
    reject_on_worker_lost=True,
    autoretry_for=(OSError, TimeoutError),   # transient I/O failures only
    retry_backoff=True,
    retry_backoff_max=600,
    retry_jitter=True,
    max_retries=5,
    queue="topology",
    soft_time_limit=3600,
    time_limit=4200,
)
def validate_topology(
    self,
    dataset_uri: str,
    rule_version: str,
    target_crs: str = "EPSG:4326",
) -> dict[str, Any]:
    """
    Validate spatial topology for a dataset at dataset_uri.
    Returns a compliance report dict with error counts and a deterministic fingerprint.
    """
    # Step 3a: Deterministic task ID for idempotency
    fingerprint = hashlib.sha256(
        f"{dataset_uri}:{rule_version}:{target_crs}".encode()
    ).hexdigest()

    # If Celery already has a cached result for this fingerprint, skip re-execution.
    # Callers should dispatch with: validate_topology.apply_async(..., task_id=fingerprint)

    # Step 3b: Report progress so monitoring dashboards stay current
    self.update_state(
        state="PROGRESS",
        meta={"phase": "loading", "current": 0, "total": None},
    )

    # Step 3c: Fetch dataset from durable storage (production: use fsspec or boto3)
    logger.info("Fetching %s (CRS target: %s)", dataset_uri, target_crs)
    # import fsspec; ds = fsspec.open(dataset_uri, "rb")

    # Step 3d: Execute topology checks in spatial chunks
    # Apply CRS normalization before rule execution — see guidance at
    # /core-spatial-qc-fundamentals-standards/coordinate-reference-system-precision-standards/
    total_features = 100_000        # replace with metadata read from the file
    chunk_size = 5_000
    errors: list[dict] = []

    for chunk_start in range(0, total_features, chunk_size):
        current = min(chunk_start + chunk_size, total_features)
        self.update_state(
            state="PROGRESS",
            meta={
                "phase": "topology_check",
                "current": current,
                "total": total_features,
            },
        )
        # Production: use shapely.is_valid, shapely.make_valid, STRtree intersection checks
        # Append any failures to errors as {"feature_id": ..., "rule": ..., "message": ...}

    # Step 3e: Write report
    report = {
        "status": "passed" if not errors else "failed",
        "dataset_uri": dataset_uri,
        "rule_version": rule_version,
        "target_crs": target_crs,
        "error_count": len(errors),
        "errors": errors[:100],     # cap broker return payload
        "fingerprint": fingerprint,
    }
    logger.info("Topology validation complete: %s (%d errors)", report["status"], report["error_count"])
    return report

Verification: Dispatch a test task and poll the result:

result = validate_topology.apply_async(
    kwargs={"dataset_uri": "s3://bucket/test.gpkg", "rule_version": "v2.1.0"},
    task_id=fingerprint,
)
print(result.get(timeout=30))   # expect {"status": "passed", "error_count": 0, ...}

Step 4 — Dispatch with deterministic IDs from the ingestion service

# ingestion_service.py
import hashlib
from tasks import validate_topology

def submit_validation(dataset_uri: str, rule_version: str, target_crs: str = "EPSG:4326") -> str:
    fingerprint = hashlib.sha256(
        f"{dataset_uri}:{rule_version}:{target_crs}".encode()
    ).hexdigest()

    result = validate_topology.apply_async(
        kwargs={
            "dataset_uri": dataset_uri,
            "rule_version": rule_version,
            "target_crs": target_crs,
        },
        task_id=fingerprint,   # idempotency key
    )
    return result.id           # return to caller for polling

Verification: Submit the same dataset_uri + rule_version twice. The second call returns immediately with a cached AsyncResult state of SUCCESS.

Step 5 — Start workers with explicit queue bindings

# fast workers: high concurrency, no memory ceiling needed
celery -A celery_app worker --queues=fast --concurrency=16 \
       --max-tasks-per-child=500 --loglevel=warning

# heavy workers: bounded concurrency, moderate memory
celery -A celery_app worker --queues=heavy --concurrency=4 \
       --max-tasks-per-child=100 --loglevel=warning

# topology workers: single-process per worker, recycled every 50 tasks to prevent leaks
celery -A celery_app worker --queues=topology --concurrency=2 \
       --max-tasks-per-child=50 --loglevel=info

Verification: Open Flower (celery -A celery_app flower) and confirm three distinct worker pools registered against separate queues.

Interpreting Results

The report dict returned by each task maps directly to fix strategies:

status	error_count	Next action
`passed`	0	Promote dataset to production catalog
`failed`	1–50	Route to automated repair (e.g. `ST_MakeValid`, buffer(0)); see Categorizing and Prioritizing Spatial Errors
`failed`	50+	Flag for manual QA; errors likely indicate a systematic source problem
Task state `RETRY`	—	Transient failure (I/O, broker disconnect); check worker logs for `OSError` or `TimeoutError`
Task state `FAILURE`	—	Non-retryable error; inspect `result.traceback` and check geometry validity before resubmitting

The fingerprint field in every report lets you correlate a result to its exact source file + rule version combination without querying raw broker logs.

Gotchas and Edge Cases

Broker size limits kill large payloads silently. Redis will reject messages exceeding its configured proto-max-bulk-len (default 512 MB in protocol terms, but practical limits are much lower for broker memory). Enforce a hard 64 KB limit on task payloads at dispatch time and fail loudly if exceeded.

acks_late doubles your message storage. Unacknowledged messages remain in the broker until the task returns successfully. Under high load, this can double broker memory consumption. Size your Redis maxmemory to accommodate active-task payloads plus the result backend.

CRS mismatches cause silent topology failures. Topology rules operating on geometries in mixed CRS return geometrically wrong results rather than errors. Always enforce coordinate reference system normalization as the first step inside each worker before any spatial predicate runs.

max_tasks_per_child is not optional for topology workers. Shapely 2.x and GDAL (Geospatial Data Abstraction Library) maintain C-level allocations that Python’s garbage collector cannot reclaim. Topology checks on large polygon datasets will OOM a worker process that runs indefinitely. Set --max-tasks-per-child=50 as a baseline and tune downward if RSS memory grows above your container limit.

Retrying data errors wastes retries on unfixable inputs. Only raise retryable exceptions (OSError, TimeoutError) for transient failures. Raise a non-retryable exception (e.g. ValueError) for invalid geometry inputs — Celery will move the task to FAILURE state immediately without consuming all five retry slots.

When to Escalate

Move beyond this Celery pattern when:

Feature count exceeds 5 million per dataset. At this scale, single-worker topology checks exceed practical timeout budgets. Move to a spatially partitioned approach using batch processing for large spatial datasets, where datasets are split into spatial tiles and validated in parallel before a boundary-merge aggregation step.
Cross-dataset topology rules are required. Rules such as “no parcel boundary may cross a municipal boundary from a separate dataset” require joined data context that cannot be expressed in a single-dataset Celery task. Escalate to a PostGIS-backed orchestration approach or a Dagster/Prefect asset graph.
Validation latency SLAs drop below 30 seconds. Celery’s broker round-trip and worker startup overhead makes sub-30-second P99 latency difficult to achieve reliably. Consider in-process validation with a synchronous rule engine for latency-critical paths.
Audit trail depth is required by regulation. Celery’s result backend retains task outputs but not intermediate state. Compliance frameworks that require full execution lineage (which features were checked, in what order, by which rule version) need an OpenLineage-compatible orchestrator on top of or instead of raw Celery.

Related:

Asynchronous Validation Workflows — parent guide covering the full five-stage async pipeline pattern
Building Rule Engines with GeoPandas — define spatial predicates that plug directly into Celery tasks
Batch Processing Large Spatial Datasets — spatial tiling and chunked I/O strategies for datasets that outgrow single workers

Back to Validation Pipeline Architecture