Can I run cross-partition spatial joins inside map_partitions?

No. Dask does not support cross-partition spatial joins inside map_partitions — the function executes on isolated partition DataFrames with no visibility into neighbouring partitions. Use dask_geopandas.sjoin() at the graph level for cross-partition joins, or defer to PostGIS for complex topology checks that require global context.

How many partitions should I use?

Target 100–500 MB per partition after spatial sorting. Too few partitions leave workers idle; too many inflate scheduler overhead and serialization costs. For a 50 GB dataset with 4 workers and 16 GB RAM each, 32–64 partitions is a practical starting point.

What happens if a partition returns no errors?

validate_partition() returns an empty DataFrame with the declared schema columns. Dask concatenates it with other partition results during compute(), so the final error log remains structurally consistent regardless of which partitions are clean.

Scaling GeoPandas Validation with Dask

You have a spatial dataset that has outgrown single-process validation: loading the full GeoDataFrame into memory triggers an out-of-memory (OOM) crash, or a serial geometry scan takes hours. This page walks through converting that workflow into a partitioned, lazy-evaluation pipeline using dask_geopandas, so geometry validity, attribute domain, and area-consistency checks run in parallel across worker nodes and return a consolidated error log. This is a specific execution path within the broader Batch Processing Large Spatial Datasets workflow.

Prerequisites

Before converting a single-process pipeline to Dask, verify the following:

dask-geopandas 0.3+ — earlier releases lack stable map_partitions metadata inference for geometry columns.
geopandas 0.14+ and shapely 2.0+ — Shapely 2 switches to vectorised GEOS operations that are significantly faster inside per-partition loops.
pyproj 3.5+ — required for reliable coordinate reference system (CRS) authority lookups; mismatched pyproj versions between workers cause silent CRS comparison failures.
dask[distributed] 2024.1+ — for the Client scheduler, dashboard, and memory management improvements.
Dataset stored in GeoParquet or partitioned Parquet — columnar formats allow Dask to read only the row groups it needs. Monolithic GeoJSON or Shapefile sources must be pre-converted; feeding them directly to dask_geopandas.read_file() loads the entire file on a single worker.
All partitions share the same projected CRS — geographic coordinate systems (degrees) produce meaningless area and distance values. Reproject to a metric authority (e.g., EPSG:3857 or a local UTM zone) before partitioning.
A stable Python environment across all workers — if using a distributed cluster, every node must have identical package versions and GEOS/GDAL shared-library builds. A mismatch between worker GEOS versions causes geometry serialisation failures during task graph execution.

Step-by-Step Procedure

Step 1: Initialise the Dask client

from dask.distributed import Client

client = Client(
    n_workers=4,
    threads_per_worker=2,
    memory_limit="4GB",   # per worker
)
print(client.dashboard_link)  # verify dashboard is accessible

Open the dashboard link in a browser and confirm four workers appear with the expected memory ceilings. If any worker shows “saturated” before the first task, reduce threads_per_worker to 1 — geometry operations hold the Python GIL and do not benefit from thread parallelism.

Step 2: Load and partition the dataset

import dask_geopandas as dgpd

dask_gdf = dgpd.read_parquet("s3://bucket/large_spatial_dataset.parquet")

# Rebalance: aim for 100–500 MB per partition
dask_gdf = dask_gdf.repartition(npartitions=32)

# Verify partition count and estimated sizes
print(dask_gdf.npartitions)
print(dask_gdf.memory_usage(deep=True).compute().sum() / dask_gdf.npartitions / 1e6, "MB avg")

If your source is not already spatially sorted, apply a Hilbert-curve sort before repartitioning. Spatially coherent partitions minimise wasted geometry bounding-box checks inside each worker — features in the same geographic area land in the same partition, avoiding redundant index lookups.

# Optional: sort by Hilbert index to improve spatial locality
dask_gdf = dask_gdf.sort_values(by="hilbert_distance")  # pre-computed column
dask_gdf = dask_gdf.repartition(npartitions=32)

Step 3: Write the stateless validation function

The function must accept a single GeoDataFrame and return a DataFrame of error records. It must not reference any state outside the partition — no global accumulators, no cross-partition lookups.

import geopandas as gpd
import pandas as pd

def validate_partition(gdf: gpd.GeoDataFrame) -> pd.DataFrame:
    """Stateless geometry and attribute checks for a single Dask partition."""
    errors = []

    # 1. Geometry validity — self-intersections, unclosed rings, degenerate edges
    invalid_mask = ~gdf.geometry.is_valid
    for idx in gdf.index[invalid_mask]:
        errors.append({
            "id": idx,
            "rule": "invalid_geometry",
            "detail": "Self-intersection or topological defect",
        })

    # 2. Empty or null geometries
    empty_mask = gdf.geometry.is_empty | gdf.geometry.isna()
    for idx in gdf.index[empty_mask]:
        errors.append({"id": idx, "rule": "empty_or_null_geometry", "detail": "Missing or empty geometry"})

    # 3. Attribute domain validation (vectorised, no iterrows)
    if "status" in gdf.columns:
        valid_statuses = {"active", "inactive", "pending", "archived"}
        bad_status = ~gdf["status"].isin(valid_statuses)
        for idx in gdf.index[bad_status]:
            errors.append({
                "id": idx,
                "rule": "invalid_status",
                "detail": f"Unexpected value: {gdf.at[idx, 'status']}",
            })

    # 4. Area consistency — requires a projected CRS
    if "expected_area" in gdf.columns and gdf.crs is not None and not gdf.crs.is_geographic:
        calc_area = gdf.geometry.area
        area_diff = (calc_area - gdf["expected_area"]).abs()
        threshold = gdf["expected_area"] * 0.05   # 5 % tolerance
        mismatch_mask = area_diff > threshold
        for idx in gdf.index[mismatch_mask]:
            errors.append({
                "id": idx,
                "rule": "area_mismatch",
                "detail": f"Deviation: {area_diff.at[idx]:.2f} units²",
            })

    return pd.DataFrame(errors, columns=["id", "rule", "detail"])

Step 4: Apply the function lazily across all partitions

validation_results = dask_gdf.map_partitions(
    validate_partition,
    meta={"id": "object", "rule": "object", "detail": "object"},
)
# Nothing executes yet — Dask builds a task graph

The meta argument locks the output schema. Dask will raise a ValueError on the first partition that returns mismatched column names or dtypes, surfacing schema errors before the full compute run rather than silently dropping rows.

Step 5: Trigger execution and export

final_errors = validation_results.compute()   # triggers the full task graph
print(f"{len(final_errors)} validation errors found")

final_errors.to_parquet("output/validation_errors.parquet", index=False)

Verification: len(final_errors) should equal the sum of per-partition error counts visible in the Dask dashboard’s task stream. If the output Parquet is empty for a dataset you know has defects, confirm that validate_partition is not swallowing exceptions — wrap the body in a try/except during debugging:

def validate_partition(gdf):
    try:
        # ... validation logic ...
    except Exception as exc:
        return pd.DataFrame([{"id": "PARTITION_ERROR", "rule": "runtime_error", "detail": str(exc)}],
                            columns=["id", "rule", "detail"])

Interpreting Results

The output Parquet contains three columns: id (the original feature index), rule (a machine-readable error code), and detail (a human-readable description). Common patterns and their remediation paths:

Rule code	Typical cause	Fix strategy
`invalid_geometry`	Self-intersections or unclosed rings introduced during digitising or format conversion	Apply `shapely.make_valid()` or `buffer(0)` to affected features; re-run validation to confirm zero `invalid_geometry` rows
`empty_or_null_geometry`	NULL geometry field from upstream joins, failed coordinate parsing, or format-specific null encoding	Trace back to the source row using `id`; either populate the geometry or mark the record for exclusion
`invalid_status`	Attribute value falls outside the declared domain (typo, stale lookup table, encoding mismatch)	Correct the source value; update the domain list in the rule engine if the domain itself has changed
`area_mismatch`	CRS mismatch between stored `expected_area` (metres²) and on-the-fly calculation (degrees²), or genuine boundary change	Confirm all partitions report `gdf.crs.is_geographic == False`; if CRS is correct, flag the feature for manual boundary review
`runtime_error`	Corrupted row triggered an unhandled exception inside the partition	Isolate the offending partition using the `id` value `PARTITION_ERROR` and inspect directly with `geopandas.read_parquet()`

Group errors by rule to identify systemic issues versus isolated outliers:

errors = pd.read_parquet("output/validation_errors.parquet")
print(errors.groupby("rule").size().sort_values(ascending=False))

A high count of invalid_geometry spread uniformly across partitions usually indicates a format-conversion problem in the ingestion step — fix the source, not individual features. A concentrated cluster of area_mismatch in specific partition IDs suggests a CRS inconsistency in a subset of source tiles.

Gotchas & Edge Cases

Cross-partition spatial joins crash or produce wrong results. map_partitions executes each function on an isolated GeoDataFrame with no knowledge of neighbouring partitions. Calling sjoin() inside validate_partition against a reference layer will succeed only if that reference layer fits in memory and is broadcast to each worker. For cross-partition containment or adjacency checks, use dask_geopandas.sjoin() at the graph level before calling map_partitions, or defer to asynchronous validation workflows that route inter-feature checks to a PostGIS step.

Unprojected CRS silently returns degrees² for area checks. GeoPandas does not raise an error when you call .area on a geographic coordinate system — it returns a value in squared degrees. Always guard with not gdf.crs.is_geographic and confirm projected units match expected_area units stored in your attribute table. For CRS authority and precision requirements, refer to the coordinate reference system precision standards guidance.

Missing columns cause KeyError on some partitions but not others. If a source dataset has schema drift across partitions (e.g., a column absent from older tiles), direct access like gdf["status"] raises a KeyError on the affected partition and silently drops the result. Use if "status" in gdf.columns: guards or gdf.get("status") with a fallback.

Large string serialisation adds network overhead. Returning full WKT geometry strings in the error detail field forces Dask to serialise and transfer large payloads over the network during compute(). Store only the feature id and a short description; retrieve the full geometry on demand from the source Parquet using the id as a lookup key in a separate debug step.

Partition skew causes worker memory spikes. If spatial sorting concentrates dense urban geometries in a few partitions, those workers exhaust their memory limit while others sit idle. Monitor partition sizes via the dashboard’s “Worker Memory” tab. If skew exceeds 3x, reduce npartitions and re-sort using a finer spatial tile key (e.g., H3 resolution 8 instead of resolution 6).

When to Escalate

Move beyond this Dask approach when:

Feature count exceeds 10 million per source file. At that scale, Dask scheduler overhead and Python-level geometry operations become the bottleneck. Migrate the geometry-validity and topology checks to PostGIS using ST_IsValid, ST_MakeValid, and server-side spatial indexes — PostGIS can process billions of features with indexed scans that Dask cannot match.
Validation rules require cross-feature topology. Shared-boundary checks, sliver detection between adjacent polygons, or road-network connectivity analysis require global context that map_partitions cannot provide. Route these rules to a PostGIS batch job or Apache Sedona cluster where full spatial indexing is available across the dataset.
Pipeline latency requirements drop below minutes. Dask is designed for throughput, not low-latency streaming. If your data velocity requires validation within seconds of ingest, switch to an event-driven architecture — see designing async validation queues with Celery for a queue-based pattern that decouples ingest from validation without the Dask scheduler’s cold-start overhead.
Errors require automated remediation at scale. validate_partition returns error records but does not fix them. For automated repair workflows (geometry snapping, ring closure, attribute imputation), the categorizing and prioritizing spatial errors framework provides the severity classification model needed to route errors to the correct remediation handler before writing back to the source dataset.

Related:

Batch Processing Large Spatial Datasets — partitioning strategies, storage formats, and the full validation workflow this page implements
Building Rule Engines with GeoPandas — centralising and versioning validation predicates before distributing them to Dask workers
Designing Async Validation Queues with Celery — queue-based alternative for low-latency or event-driven validation

Back to Batch Processing Large Spatial Datasets