Classifying Topology Errors by Severity
Classifying topology errors by severity requires mapping geometric violations to operational impact thresholds rather than treating all spatial anomalies equally. A robust classification system separates pipeline-blocking violations from cosmetic artifacts, ensuring that data engineering resources target the highest-risk geometry first. Critical errors violate hard constraints—such as overlapping exclusive boundaries, self-intersecting polygons, or invalid ring orientations—that immediately break downstream routing, rendering, or compliance reporting. Major errors degrade analytical precision but allow processing to continue with documented caveats, while minor errors flag stylistic or density anomalies that warrant batch cleanup during maintenance windows.
Severity Tiers & Impact Mapping
Effective classification starts with a deterministic scoring matrix aligned with your organization’s data contracts. When designing a framework for Categorizing and Prioritizing Spatial Errors, map each topology rule to a severity tier using three operational dimensions: impact tolerance, downstream failure modes, and remediation urgency.
| Severity | Spatial Violation Type | Tolerance Handling | Downstream Impact |
|---|---|---|---|
| Critical (P0) | Self-intersections, duplicate geometries, inverted rings, overlapping exclusive zones | Zero tolerance | Pipeline halts; compliance audit fails; routing/join operations crash |
| Major (P1) | Gaps < configurable threshold, slivers, minor overshoots/undershoots, dangling nodes | Buffer-based tolerance (e.g., 0.5–2m) | Analytical bias introduced; requires QA sign-off or automated repair before production |
| Minor (P2) | Excessive vertex density, non-planar edges in non-critical layers, attribute-geometry mismatches | Metadata-only logging | Performance degradation over time; scheduled for batch optimization |
Compliance officers and data stewards should anchor these tiers to recognized quality frameworks like the ISO 19157 Geographic Information — Data Quality standard, which formalizes logical consistency and topological accuracy as measurable conformance metrics. GIS analysts can translate these tiers into automated gate checks that route records to quarantine, repair queues, or direct publication.
Automated Classification Implementation
The following Python implementation uses geopandas and shapely to classify topology errors by severity in a vector dataset. It applies configurable tolerances, flags violations, and outputs a structured report suitable for CI/CD or ETL pipelines.
import geopandas as gpd
import pandas as pd
from shapely.validation import make_valid
import numpy as np
def classify_topology_errors(
gdf: gpd.GeoDataFrame,
overlap_threshold: float = 1.0,
max_vertices: int = 5000
) -> pd.DataFrame:
"""
Classifies topology errors by severity (P0, P1, P2) for a GeoDataFrame.
Returns a report DataFrame with row indices, severity, and violation details.
"""
if gdf.empty:
return pd.DataFrame(columns=["row_index", "severity", "violation_type", "details"])
reports = []
# P0: Invalid geometries (self-intersections, ring orientation issues, etc.)
invalid_mask = ~gdf.geometry.is_valid
if invalid_mask.any():
for idx in gdf.index[invalid_mask]:
geom = gdf.loc[idx, "geometry"]
try:
fixed = make_valid(geom)
reports.append({
"row_index": idx,
"severity": "P0",
"violation_type": "Invalid Geometry",
"details": f"Original: {geom.geom_type}, Fixed: {fixed.geom_type}"
})
except Exception as e:
reports.append({
"row_index": idx,
"severity": "P0",
"violation_type": "Unrepairable Geometry",
"details": str(e)
})
# P1: Overlaps/Gaps (lightweight pairwise intersection check)
# Production systems should use spatial indexing (e.g., R-tree) or PostGIS ST_Relate
if len(gdf) > 1:
sjoin = gpd.sjoin(gdf, gdf, how="inner", predicate="intersects")
sjoin = sjoin[sjoin["index_left"] != sjoin["index_right"]]
# gpd.sjoin retains only left geometry; map right geometries back by index
sjoin["geometry_right"] = sjoin["index_right"].map(gdf["geometry"])
sjoin["overlap_area"] = sjoin.apply(
lambda row: row["geometry"].intersection(row["geometry_right"]).area,
axis=1
)
overlap_mask = sjoin["overlap_area"] > overlap_threshold
if overlap_mask.any():
seen = set()
for _, row in sjoin[overlap_mask].iterrows():
pair = tuple(sorted([row["index_left"], row["index_right"]]))
if pair not in seen:
seen.add(pair)
reports.append({
"row_index": pair[0],
"severity": "P1",
"violation_type": "Overlap/Gap",
"details": f"Overlaps with index {pair[1]} ({row['overlap_area']:.2f} sq units)"
})
# P2: Excessive vertex density (performance degradation)
def count_vertices(geom):
if hasattr(geom, "exterior"):
return len(geom.exterior.coords)
return 0
vertex_counts = gdf.geometry.apply(count_vertices)
dense_mask = vertex_counts > max_vertices
if dense_mask.any():
for idx in gdf.index[dense_mask]:
reports.append({
"row_index": idx,
"severity": "P2",
"violation_type": "High Vertex Density",
"details": f"{vertex_counts.loc[idx]} vertices (threshold: {max_vertices})"
})
return pd.DataFrame(reports)
# Example usage:
# gdf = gpd.read_file("input_boundaries.gpkg")
# report = classify_topology_errors(gdf, overlap_threshold=0.5, max_vertices=3000)
# report.to_csv("topology_severity_report.csv", index=False)
Integrating into Validation Workflows
Embedding severity classification directly into your Validation Pipeline Architecture ensures that topology checks run before data reaches production lakes or feature stores. In CI/CD environments, configure the script as a pre-commit hook or GitHub Actions step that fails the build on any P0 violation. P1 errors should trigger automated repair routines (e.g., shapely.buffer(0) or snap_to_grid) followed by a manual QA review. P2 violations can be aggregated into weekly optimization tickets for the data engineering team.
For enterprise deployments, route classification outputs to a centralized observability dashboard. Track metrics like p0_violation_rate, mean_time_to_repair, and sliver_polygon_frequency. These KPIs directly inform data contract updates and help teams adjust tolerance thresholds as coordinate reference systems (CRS) or source data quality evolve.
Tolerance Calibration & CRS Best Practices
- Define Tolerances in Native CRS Units: Always project geometries to an appropriate metric CRS (e.g., UTM or State Plane) before applying meter-based tolerances. Applying a 1-meter buffer to unprojected WGS84 coordinates yields inconsistent results across latitudes due to degree-to-meter distortion.
- Reference Authoritative Specs: The OGC Simple Features Specification outlines coordinate system requirements and topological consistency rules that should guide your tolerance baselines.
- Log, Don’t Drop: Never silently discard invalid geometries during ingestion. Quarantine them with full provenance metadata so source systems can be corrected upstream.
- Version Your Rules: Topology thresholds drift as business requirements change. Store classification rules in configuration files (YAML/JSON) rather than hardcoding them, enabling audit trails and A/B testing of tolerance values.
- Validate at Multiple Stages: Run lightweight P0 checks at ingestion, comprehensive P1 validation during transformation, and full P2 audits during archival. This staged approach prevents compute bottlenecks while maintaining data integrity.