010 — Core data integration (raw → unified monthly)#
This tutorial covers the data ingestion stage: read raw chamber TOA5 files, climate station outputs, and soil-sensor readings, then produce a unified monthly dataset that downstream notebooks consume.
Requires real data. The bundled synthetic sample skips this stage (it ships post-QC parquet directly). Set
PALMWTC_DATA_DIRto point at a directory of raw chamber files to follow along. Seedocs/quickstart.mdfor the expected directory layout.
This notebook is a thin scaffold showing the raw-data entry points;
the full ingest pipeline is exposed through palmwtc.pipeline.step_ingest
function.
from palmwtc.config import DataPaths
from palmwtc.io import find_latest_qc_file, get_cloud_sensor_dirs
paths = DataPaths.resolve()
print(paths.describe())
DataPaths (source=sample (bundled synthetic), site=libz):
raw_dir = /home/runner/work/palmwtc/palmwtc/src/palmwtc/data/sample/synthetic
processed_dir = /home/runner/work/palmwtc/palmwtc/src/palmwtc/data/sample/Data/Integrated_QC_Data
exports_dir = /home/runner/work/palmwtc/palmwtc/src/palmwtc/data/sample/exports
config_dir = /home/runner/work/palmwtc/palmwtc/src/palmwtc/data/sample/config
extras = <none>
# Example: locate raw chamber + climate + soil dirs (only meaningful with real data).
# `get_cloud_sensor_dirs(chamber_base)` walks a Google-Drive-mounted root
# and returns {kind: [dirs]} for chamber_1/, chamber_2/, climate/, soil_sensor/.
# The bundled synthetic sample doesn't ship raw chamber dirs, so this is a no-op
# under zero-config.
try:
sensor_dirs = get_cloud_sensor_dirs(paths.raw_dir)
print("Sensor directories found:")
for kind, dirs in sensor_dirs.items():
print(f" {kind}: {len(dirs)} directories")
except (FileNotFoundError, TypeError) as e:
print(f"[bundled synthetic sample] get_cloud_sensor_dirs not applicable: {type(e).__name__}")
print("Set PALMWTC_DATA_DIR to a real chamber root to follow this path end-to-end.")
Cloud chamber_1: 0 directories found
Cloud chamber_2: 0 directories found
Cloud climate: 0 directories found
Cloud soil_sensor: 0 directories found
Sensor directories found:
chamber_1: 0 directories
chamber_2: 0 directories
climate: 0 directories
soil_sensor: 0 directories
The full ingest path calls palmwtc.io.load_from_multiple_dirs to
read every monthly chunk, then integrate_temp_humidity_c2 to fuse climate
chamber temperature/humidity, then
export_monthlyto write the unified monthly parquet that notebook 020 reads.