Batch Processing Large Spatial Datasets

Q: Can I run batch spatial validation inside a CI pipeline?

Yes. Package the validation environment in a Docker image (GDAL, GEOS, and your Python dependencies) and trigger it as a GitHub Actions job or Prefect flow on every data push. Keep the dataset fixture small for CI (a representative 10k-feature sample) and reserve full production runs for scheduled nightly jobs.

Automated spatial quality control breaks down when dataset volumes exceed single-threaded memory capacity or the capabilities of desktop GIS tools. Batch processing transforms validation from a manual, feature-by-feature inspection into a systematic, repeatable engineering workflow that GIS analysts, QA engineers, and compliance officers can rely on for terabytes of vector and raster data. This page explains how to structure that workflow — from environment setup and data partitioning through parallel rule execution and deterministic error reporting — as the execution layer of a broader Validation Pipeline Architecture.

The core challenge is balancing computational efficiency against geometric precision. Spatial operations — topology verification, self-intersection detection, coordinate reference system (CRS) normalization — are expensive because they depend on GEOS-backed graph algorithms and coordinate transformations. Applied naively to millions of features, they trigger memory exhaustion, I/O bottlenecks, and inconsistent error records. A well-structured batch workflow avoids these outcomes by enforcing partitioning boundaries, parallelizing rule evaluation, and standardizing output serialization.

Prerequisites

Before implementing a batch validation workflow, confirm each item in this checklist. Missing any one of them is a common source of hard-to-diagnose failures at scale.

Python 3.11+ with a managed environment (conda, uv, or virtualenv). Pin exact versions: geopandas 0.14+, shapely 2.0+, pyproj 3.6+, pyarrow 15+, dask 2024.2+. Spatial libraries depend on compiled GDAL/OGR and GEOS C extensions; environment isolation prevents ABI conflicts across worker nodes.
GDAL 3.4+ installed at the OS level. Confirm with gdal-config --version. Format-specific tuning (driver options, spatial index creation) is documented in the GDAL Vector Driver reference.
Storage format: source datasets in GeoParquet (preferred for predicate pushdown and spatial indexing) or OGC GeoPackage (GeoPackage 1.3+). Avoid monolithic uncompressed GeoJSON or Shapefile for datasets above 1 GB — they force full-file scans and lack row-level spatial indexes.
CRS alignment readiness: all input layers must share a documented authority-backed CRS (EPSG code preferred). If inputs arrive in mixed CRS, plan a normalization step before any metric-based rule executes. See CRS Precision Standards for detail on authority lookups and tolerance thresholds.
Validation rule manifest: a machine-readable YAML or JSON file that lists each rule — geometry type, OGC predicate (is_valid, is_simple, is_ring), attribute constraint (null tolerance, domain values, numeric range), and severity (blocker, warning, informational). Version-control this manifest alongside your pipeline code.
Hardware baseline: minimum 32 GB RAM, NVMe-backed scratch storage, and a multi-core CPU (8+ cores). For datasets exceeding 50 GB per run, allocate a distributed worker pool or cloud compute with high-throughput network storage.

Conceptual Foundation

Why Naïve Iteration Fails at Scale

A single-threaded Python loop that opens a GeoJSON file and calls shape.is_valid() on each feature works for tens of thousands of records. Above roughly one million features, three failure modes emerge simultaneously:

Memory exhaustion: Python’s object model creates one shapely.Geometry instance per feature. At ~2 KB per complex polygon, ten million features consume 20 GB in Python objects alone — before any topology computation.
I/O serialization: reading an unindexed flat file requires scanning every byte, even when only 5% of features fall in the spatial extent you care about.
Non-deterministic errors: without explicit partition boundaries, retry logic re-processes an unknown subset of data, making error reports non-reproducible.

The Partition-Validate-Aggregate Pattern

The batch pattern shown in the diagram above separates these concerns. The partitioner reads spatial index metadata (row group statistics in GeoParquet, or an RTRee index in GeoPackage) and emits tiles that each fit comfortably in worker RAM. Each worker is a stateless process that applies the full rule manifest to its tile and emits a structured error stream — it never modifies source data. The aggregator merges error streams, deduplicates cross-boundary features, and produces a single deterministic report.

This design satisfies two formal properties that matter for compliance reporting: idempotency (re-running a worker on the same tile produces the same output) and monotonicity (adding more partitions cannot cause errors to disappear).

The Open Geospatial Consortium (OGC) Simple Features for SQL specification (OGC 06-104r4) defines the geometric validity criteria that underpin every predicate in the rule manifest — a geometry is “valid” if it is a correctly-typed, non-self-intersecting structure with counterclockwise exterior rings. Rule authoring should anchor directly to this specification rather than library defaults, which vary between GEOS versions.

Step-by-Step Implementation

Step 1 — Configure the Environment and Rule Manifest

Create a reproducible environment and define your rule manifest before writing any pipeline code.

# environment: python 3.11, geopandas 0.14, shapely 2.0, pyarrow 15, dask 2024.2
# Install: pip install geopandas==0.14.4 shapely==2.0.4 pyarrow==15.0.2 dask==2024.2.1

import yaml
from pathlib import Path

RULE_MANIFEST_PATH = Path("rules/spatial_rules_v2.yaml")

# Minimal rule manifest schema
EXAMPLE_MANIFEST = """
version: "2.0"
rules:
  - id: GEOM_001
    predicate: is_valid
    severity: blocker
    message: "Geometry is topologically invalid (OGC SFS §6.1.11)"
  - id: GEOM_002
    predicate: is_simple
    severity: warning
    message: "Geometry has self-intersections or repeated points"
  - id: ATTR_001
    field: feature_id
    check: not_null
    severity: blocker
    message: "Required attribute 'feature_id' is null"
  - id: ATTR_002
    field: area_m2
    check: range
    min: 0.01
    severity: warning
    message: "Area is below minimum threshold"
"""

rules = yaml.safe_load(EXAMPLE_MANIFEST)
print(f"Loaded {len(rules['rules'])} rules from manifest v{rules['version']}")
# Expected: Loaded 4 rules from manifest v2.0

Step 2 — Partition the Dataset into Spatial Tiles

Align partition boundaries to the data’s spatial distribution rather than arbitrary row counts. GeoParquet row groups already encode bounding-box statistics; read them without loading geometry to plan tiles cheaply.

import pyarrow.parquet as pq
import geopandas as gpd
from shapely.geometry import box

def get_parquet_tile_bounds(parquet_path: str) -> list[dict]:
    """Read row-group bounding boxes from GeoParquet metadata without loading geometry."""
    pf = pq.ParquetFile(parquet_path)
    tiles = []
    for i, rg in enumerate(pf.metadata.row_groups):
        # GeoParquet stores per-column statistics; geometry column name is 'geometry' by default
        tiles.append({
            "row_group": i,
            "num_rows": rg.num_rows,
        })
    return tiles

def load_tile(parquet_path: str, row_group: int, clip_box: tuple | None = None) -> gpd.GeoDataFrame:
    """Load a single row group with optional spatial pre-filter."""
    gdf = gpd.read_parquet(parquet_path, filters=None)  # replace with row_group param when pyarrow supports it
    if clip_box:
        gdf = gdf[gdf.geometry.intersects(box(*clip_box))]
    return gdf

# Verification: confirm tile row counts sum to total
tiles = get_parquet_tile_bounds("datasets/parcels_2024.parquet")
total_rows = sum(t["num_rows"] for t in tiles)
print(f"Tiles: {len(tiles)}, total rows: {total_rows:,}")
# Expected: Tiles: 12, total rows: 4,821,053

For datasets that arrive as GeoPackage or Shapefile, use a bounding-box grid or H3 index instead:

import h3
import numpy as np

def bbox_to_h3_tiles(minx: float, miny: float, maxx: float, maxy: float, resolution: int = 4) -> list[str]:
    """Return the set of H3 cells at the given resolution that cover the bounding box."""
    # h3.polyfill_geojson expects GeoJSON polygon
    geojson_poly = {
        "type": "Polygon",
        "coordinates": [[
            [minx, miny], [maxx, miny],
            [maxx, maxy], [minx, maxy], [minx, miny]
        ]]
    }
    return list(h3.polyfill_geojson(geojson_poly, resolution))

tiles = bbox_to_h3_tiles(-180, -90, 180, 90, resolution=3)
print(f"H3 resolution-3 global tile count: {len(tiles)}")
# Expected: H3 resolution-3 global tile count: 41162

Step 3 — Execute Validation Rules in Parallel

The rule engine described in Building Rule Engines with GeoPandas can prototype predicates for a single tile. For batch execution, wrap it in a concurrent.futures or Dask task graph.

import concurrent.futures
import pandas as pd
import geopandas as gpd
from shapely.validation import explain_validity

def validate_tile(tile_path: str, tile_id: str, rules: list[dict]) -> pd.DataFrame:
    """
    Run all rules against one tile. Returns a DataFrame of violations.
    Each row: tile_id, feature_id, rule_id, severity, message, wkt_snippet.
    """
    gdf = gpd.read_parquet(tile_path)
    violations = []

    for rule in rules:
        if rule.get("predicate") == "is_valid":
            mask = ~gdf.geometry.is_valid
            for idx in gdf[mask].index:
                geom = gdf.at[idx, "geometry"]
                violations.append({
                    "tile_id": tile_id,
                    "feature_id": gdf.at[idx, "feature_id"] if "feature_id" in gdf.columns else str(idx),
                    "rule_id": rule["id"],
                    "severity": rule["severity"],
                    "message": rule["message"],
                    "explain": explain_validity(geom),
                    "wkt_snippet": geom.wkt[:200],
                })

        elif rule.get("predicate") == "is_simple":
            mask = ~gdf.geometry.is_simple
            for idx in gdf[mask].index:
                geom = gdf.at[idx, "geometry"]
                violations.append({
                    "tile_id": tile_id,
                    "feature_id": str(idx),
                    "rule_id": rule["id"],
                    "severity": rule["severity"],
                    "message": rule["message"],
                    "explain": explain_validity(geom),
                    "wkt_snippet": geom.wkt[:200],
                })

        elif rule.get("check") == "not_null":
            field = rule["field"]
            if field in gdf.columns:
                mask = gdf[field].isna()
                for idx in gdf[mask].index:
                    violations.append({
                        "tile_id": tile_id,
                        "feature_id": str(idx),
                        "rule_id": rule["id"],
                        "severity": rule["severity"],
                        "message": rule["message"],
                        "explain": f"Field '{field}' is null",
                        "wkt_snippet": "",
                    })

    return pd.DataFrame(violations)

# Run across tiles in parallel
tile_jobs = [("tiles/tile_0001.parquet", "T0001"), ("tiles/tile_0002.parquet", "T0002")]
rules_list = rules["rules"]

all_violations = []
with concurrent.futures.ProcessPoolExecutor(max_workers=8) as executor:
    futures = {
        executor.submit(validate_tile, path, tid, rules_list): tid
        for path, tid in tile_jobs
    }
    for future in concurrent.futures.as_completed(futures):
        tile_id = futures[future]
        try:
            df = future.result()
            all_violations.append(df)
            print(f"[{tile_id}] {len(df)} violations")
        except Exception as exc:
            print(f"[{tile_id}] FAILED: {exc}")

violations_df = pd.concat(all_violations, ignore_index=True) if all_violations else pd.DataFrame()
print(f"Total violations before deduplication: {len(violations_df):,}")

Step 4 — Aggregate and Sort the Error Report

Deterministic sorting by spatial index and rule priority ensures the same report across reruns, which is essential for compliance diffing.

def build_error_report(violations: pd.DataFrame) -> pd.DataFrame:
    """
    Deduplicate cross-boundary features, apply severity sort order,
    and produce a reproducible error report.
    """
    if violations.empty:
        return violations

    SEVERITY_ORDER = {"blocker": 0, "warning": 1, "informational": 2}
    violations = violations.copy()
    violations["_severity_rank"] = violations["severity"].map(SEVERITY_ORDER).fillna(9)

    # Deduplicate: a feature may appear in two tiles near a boundary
    deduped = violations.drop_duplicates(subset=["feature_id", "rule_id"])

    report = (
        deduped
        .sort_values(["_severity_rank", "tile_id", "feature_id", "rule_id"])
        .drop(columns=["_severity_rank"])
        .reset_index(drop=True)
    )

    return report

report = build_error_report(violations_df)
print(report[["feature_id", "rule_id", "severity", "explain"]].head(10).to_string())

# Write structured JSON report
report.to_json("reports/validation_run_20260623.ndjson", orient="records", lines=True)
print(f"Report written: {len(report):,} violations")

When validation jobs run for hours, decouple aggregation from reporting by routing results to an asynchronous validation workflow so downstream consumers can subscribe to error streams without blocking the primary validation thread.

Step 5 — Route to Clean Output or Quarantine

Write valid features to the production directory using an atomic rename pattern. Route invalid features to a tagged quarantine layer.

import tempfile, os, hashlib

def write_atomic(gdf: gpd.GeoDataFrame, target_path: str) -> str:
    """Write GeoDataFrame to a temp file, verify checksum, then atomically rename."""
    dir_name = os.path.dirname(target_path)
    with tempfile.NamedTemporaryFile(dir=dir_name, suffix=".parquet", delete=False) as tmp:
        tmp_path = tmp.name

    gdf.to_parquet(tmp_path, index=False)

    # Verify the file is readable and compute checksum
    test_read = gpd.read_parquet(tmp_path)
    assert len(test_read) == len(gdf), "Row count mismatch after write"

    with open(tmp_path, "rb") as f:
        checksum = hashlib.sha256(f.read()).hexdigest()

    os.rename(tmp_path, target_path)
    return checksum

def route_features(
    source_gdf: gpd.GeoDataFrame,
    violations: pd.DataFrame,
    clean_path: str,
    quarantine_path: str,
    validation_timestamp: str,
    rule_version: str,
) -> dict:
    """Split source features into clean and quarantine sets, then write both."""
    invalid_ids = set(violations[violations["severity"] == "blocker"]["feature_id"].tolist())

    clean = source_gdf[~source_gdf["feature_id"].astype(str).isin(invalid_ids)].copy()
    quarantine = source_gdf[source_gdf["feature_id"].astype(str).isin(invalid_ids)].copy()

    # Stamp metadata
    for gdf in (clean, quarantine):
        gdf["_validated_at"] = validation_timestamp
        gdf["_rule_version"] = rule_version

    quarantine = quarantine.merge(
        violations[["feature_id", "rule_id", "severity", "explain"]],
        left_on="feature_id", right_on="feature_id", how="left"
    )

    clean_checksum = write_atomic(clean, clean_path)
    quarantine_checksum = write_atomic(quarantine, quarantine_path)

    return {
        "clean_features": len(clean),
        "quarantined_features": len(quarantine),
        "clean_checksum": clean_checksum,
        "quarantine_checksum": quarantine_checksum,
    }

Common Failure Modes & Fixes

Symptom	Root Cause	Remediation
`MemoryError` in worker process	Tile too large; geometry-heavy polygons	Reduce max tile size to 500 MB; increase number of tiles
`TopologicalError: This operation could not be performed`	Invalid geometry passed to overlay operation	Pre-screen with `gdf.geometry.is_valid`; apply `gdf.geometry = gdf.geometry.buffer(0)` to repair
Duplicate violations for same feature	Feature straddles tile boundary	Add `drop_duplicates(subset=["feature_id", "rule_id"])` in aggregator
Silent CRS mismatch — wrong distances	Layers joined in mixed CRS	Assert `gdf.crs.equals(target_crs)` at tile load; reproject before any metric predicate
Checksum mismatch on atomic write	Disk full during write	Check `df -h` on scratch volume; stream to object storage if local disk is constrained
`explain_validity` returns `"Valid Geometry"` for a visually broken shape	GEOS version below 3.10 misses some ring-orientation errors	Upgrade GEOS to 3.11+; cross-check with PostGIS 3.4 `ST_IsValid`
Non-reproducible error counts across reruns	Non-deterministic partition assignment	Sort input rows by spatial index (H3 or geohash) before tiling

Performance & Scale Considerations

Indexing Strategy

Before tiling, build a spatial index on the source dataset. For GeoPackage, run:

-- Create an R-tree spatial index (GeoPackage uses SQLite RTree extension)
SELECT CreateSpatialIndex('parcels', 'geom');

-- Verify the index exists
SELECT * FROM sqlite_master WHERE type = 'table' AND name LIKE 'rtree_%';

For GeoParquet, set write_covering_bbox=True (PyArrow 15+) so row group bounding boxes are embedded in file metadata — enabling predicate pushdown without loading geometry.

Chunk Size Guidance

Dataset type	Recommended tile size	Rationale
Point cloud (>10M points)	2 GB per tile	Minimal per-feature geometry overhead
Line network (road/utility)	1 GB per tile	Moderate coordinate density
Dense polygon cadastre	500 MB per tile	High coordinate counts per feature
Multi-part polygon with holes	250 MB per tile	Memory amplification during topology checks

When to Move to Distributed Processing

Migrate from single-node ProcessPoolExecutor to Scaling GeoPandas Validation with Dask when any of the following apply:

A single tile no longer fits in 50–70% of available worker RAM even after size reduction
End-to-end wall-clock time for a full dataset run exceeds your daily SLA window
You need parallel I/O from multiple object-storage buckets or cloud regions
The rule manifest includes cross-partition spatial joins (e.g., containment checks against a reference polygon layer)

For datasets exceeding 500 GB or requiring sub-hourly validation cycles, Apache Sedona (formerly GeoSpark) offers native spatial partitioning on top of Apache Spark and avoids the Python GIL overhead that limits Dask-GeoPandas on CPU-bound geometry operations.

Memory Management

import gc
import psutil

def log_memory(label: str) -> None:
    proc = psutil.Process()
    rss_gb = proc.memory_info().rss / 1e9
    print(f"[{label}] RSS: {rss_gb:.2f} GB")

# Force garbage collection between tiles to reclaim shapely geometry objects
def process_tile_with_gc(tile_path: str, tile_id: str, rules: list) -> pd.DataFrame:
    log_memory(f"before {tile_id}")
    result = validate_tile(tile_path, tile_id, rules)
    gc.collect()
    log_memory(f"after {tile_id}")
    return result

Set worker memory limits in Dask (memory_limit="12GB") and enable spill_to_disk=True to prevent OOM kills in the scheduler.

Integration with the Validation Pipeline

Batch processing occupies the execution stage of the DAG defined in Validation Pipeline Architecture. It sits between the schema validation/ingestion stage (which rejects structurally malformed payloads before any geometry is loaded) and the error routing stage (which classifies results and dispatches remediation actions).

The rule manifest consumed by the batch workers is the same artifact produced by the Building Rule Engines with GeoPandas authoring workflow — allowing QA teams to toggle or add checks without touching the partitioner or aggregator code. This separation of concerns is the key architectural property that makes the pipeline testable: the rule engine and the batch executor can be validated independently.

Error records produced by the aggregator feed directly into the Categorizing and Prioritizing Spatial Errors stage, where blocker-severity violations trigger automated repair or rejection workflows, while informational records are archived for trend analysis.

For long-running batch jobs, the output error stream should be decoupled from the batch executor using the pattern described in Asynchronous Validation Workflows, which allows downstream consumers — data stewards, compliance dashboards, remediation queues — to subscribe to results without polling the batch executor or waiting for the full run to complete.

Geometry validity checks and OGC topology rules underpin the specific predicates used in the validate_tile function — reviewing those pages ensures your rule manifest references the correct OGC SFS clause for each check.

Frequently Asked Questions

What chunk size should I use when partitioning large spatial datasets?

Target 500 MB to 2 GB per partition, adjusted for geometry complexity. Simple point or line layers tolerate larger chunks; polygon datasets with dense coordinate rings should use smaller partitions (closer to 500 MB) to avoid memory spikes during topology checks. Profile worker peak RSS on a representative 10% sample before committing to a production tile size.

When should I switch from single-node GeoPandas to a distributed framework?

Move to Dask-GeoPandas or Apache Sedona when a single partition no longer fits in 50–70% of available RAM, when wall-clock time for a full dataset run exceeds your SLA, or when you need true parallel I/O across multiple object-storage buckets. The Scaling GeoPandas Validation with Dask guide walks through the migration step by step.

How do I handle features that span partition boundaries?

Apply a spatial buffer to each tile boundary equal to the maximum geometry diameter you expect, so cross-boundary features appear in both adjacent partitions. After all workers complete, deduplicate results by (feature_id, rule_id) in the aggregator before writing the final error report. Do not try to assign cross-boundary features to exactly one tile at partition time — the geometry needed to make that decision may itself be invalid.

Can I run batch spatial validation inside a CI pipeline?

Yes. Package the validation environment in a Docker image pinning GDAL 3.4+, GEOS 3.11+, and your Python dependencies, then trigger it as a GitHub Actions job or Prefect flow on every data push. Keep the dataset fixture small for CI (a representative 10 000-feature sample) and reserve full production runs for scheduled nightly jobs. This pattern also validates that your rule manifest is syntactically correct before it reaches production.

Related

Scaling GeoPandas Validation with Dask — distribute tile-level validation across a Dask cluster
Asynchronous Validation Workflows — decouple batch execution from error consumption
Building Rule Engines with GeoPandas — author and version-control the rule manifest
Categorizing and Prioritizing Spatial Errors — classify and route the output of this pipeline

Back to Validation Pipeline Architecture