020 — Rule-based QC

020 — Rule-based QC#

This tutorial applies the multi-stage QC pipeline to the unified monthly data from notebook 010: physical-bounds checks, IQR outliers, breakpoint detection (ruptures), drift detection, sensor-exclusion masks, and combined flag synthesis.

Requires real data. The bundled synthetic sample ships post-QC parquet, skipping this stage. The synthetic generator (scripts/make_sample_data.py) injects realistic edge cases — NaN bursts, drift segments, OOB spikes — that exercise the QC code paths when this notebook is run against real data.

Future work will thin this notebook further by adding palmwtc.pipeline.step_qc_full that wraps the joblib-parallel process_variable_qc loop.

from palmwtc.config import DataPaths
from palmwtc.qc import (
    QCProcessor,
    apply_iqr_flags,
    apply_physical_bounds_flags,
    detect_breakpoints_ruptures,
)

paths = DataPaths.resolve()
print(paths.describe())
DataPaths (source=sample (bundled synthetic), site=libz):
  raw_dir       = /home/runner/work/palmwtc/palmwtc/src/palmwtc/data/sample/synthetic
  processed_dir = /home/runner/work/palmwtc/palmwtc/src/palmwtc/data/sample/Data/Integrated_QC_Data
  exports_dir   = /home/runner/work/palmwtc/palmwtc/src/palmwtc/data/sample/exports
  config_dir    = /home/runner/work/palmwtc/palmwtc/src/palmwtc/data/sample/config
  extras        = <none>

QC components (preview)#

The full notebook iterates over each variable in config/variable_config.json and runs:

result = QCProcessor(paths).run("CO2_C1")

which under the hood calls (in order):

  1. apply_physical_bounds_flags — out-of-range values

  2. apply_iqr_flags — IQR outliers per rolling window

  3. apply_rate_of_change_flags — implausible jumps

  4. apply_persistence_flags — stuck-sensor detection

  5. apply_battery_proxy_flags — datalogger health proxy

  6. apply_sensor_exclusion_flags — manual + auto exclusion windows

  7. combine_qc_flags — synthesis into one composite flag

  8. detect_breakpoints_ruptures — step-change detection

  9. detect_drift_windstats — slow-baseline-drift detection

Joblib parallelism (process_variable_qc) keeps the loop fast over many variables.