010 — Core data integration (raw → unified monthly)

010 — Core data integration (raw → unified monthly)#

This tutorial covers the data ingestion stage: read raw chamber TOA5 files, climate station outputs, and soil-sensor readings, then produce a unified monthly dataset that downstream notebooks consume.

Requires real data. The bundled synthetic sample skips this stage (it ships post-QC parquet directly). Set PALMWTC_DATA_DIR to point at a directory of raw chamber files to follow along. See docs/quickstart.md for the expected directory layout.

This notebook is a thin scaffold showing the raw-data entry points; the full ingest pipeline is exposed through palmwtc.pipeline.step_ingest function.

from palmwtc.config import DataPaths
from palmwtc.io import find_latest_qc_file, get_cloud_sensor_dirs

paths = DataPaths.resolve()
print(paths.describe())
DataPaths (source=sample (bundled synthetic), site=libz):
  raw_dir       = /home/runner/work/palmwtc/palmwtc/src/palmwtc/data/sample/synthetic
  processed_dir = /home/runner/work/palmwtc/palmwtc/src/palmwtc/data/sample/Data/Integrated_QC_Data
  exports_dir   = /home/runner/work/palmwtc/palmwtc/src/palmwtc/data/sample/exports
  config_dir    = /home/runner/work/palmwtc/palmwtc/src/palmwtc/data/sample/config
  extras        = <none>
# Example: locate raw chamber + climate + soil dirs (only meaningful with real data).
# `get_cloud_sensor_dirs(chamber_base)` walks a Google-Drive-mounted root
# and returns {kind: [dirs]} for chamber_1/, chamber_2/, climate/, soil_sensor/.
# The bundled synthetic sample doesn't ship raw chamber dirs, so this is a no-op
# under zero-config.

try:
    sensor_dirs = get_cloud_sensor_dirs(paths.raw_dir)
    print("Sensor directories found:")
    for kind, dirs in sensor_dirs.items():
        print(f"  {kind}: {len(dirs)} directories")
except (FileNotFoundError, TypeError) as e:
    print(f"[bundled synthetic sample]  get_cloud_sensor_dirs not applicable: {type(e).__name__}")
    print("Set PALMWTC_DATA_DIR to a real chamber root to follow this path end-to-end.")
  Cloud chamber_1: 0 directories found
  Cloud chamber_2: 0 directories found
  Cloud climate: 0 directories found
  Cloud soil_sensor: 0 directories found
Sensor directories found:
  chamber_1: 0 directories
  chamber_2: 0 directories
  climate: 0 directories
  soil_sensor: 0 directories

The full ingest path calls palmwtc.io.load_from_multiple_dirs to read every monthly chunk, then integrate_temp_humidity_c2 to fuse climate

  • chamber temperature/humidity, then export_monthly to write the unified monthly parquet that notebook 020 reads.