000 — End-to-end on synthetic data#

This notebook walks through the complete palmwtc pipeline using the bundled synthetic sample — no field data needed. Each section shows one step of the pipeline and explains what it is doing scientifically, not just technically.

The system these tools were built for: two automated whole-tree flux chambers around individual oil palm trees, each measuring CO₂ and H₂O concentration every 30 seconds with a LI-COR LI-850 analyser. The pipeline converts those raw concentration readings into calibrated net CO₂ exchange values — and then checks whether those values are consistent with what the plant physiology literature says oil palms should be doing.

For the same pipeline applied to real chamber data, see the sibling notebook 001_End_to_End_LIBZ.ipynb — that one requires a LIBZ-style dataset on disk and is not bundled with the package.

After this notebook, the per-stage tutorials (010–035) each cover one step in much more depth.

import pandas as pd
import matplotlib
matplotlib.use("Agg")      # headless rendering — safe in all environments

from palmwtc.config import DataPaths

paths = DataPaths.resolve()
print(paths.describe())
DataPaths (source=sample (bundled synthetic), site=libz):
  raw_dir       = /home/runner/work/palmwtc/palmwtc/src/palmwtc/data/sample/synthetic
  processed_dir = /home/runner/work/palmwtc/palmwtc/src/palmwtc/data/sample/Data/Integrated_QC_Data
  exports_dir   = /home/runner/work/palmwtc/palmwtc/src/palmwtc/data/sample/exports
  config_dir    = /home/runner/work/palmwtc/palmwtc/src/palmwtc/data/sample/config
  extras        = <none>

Step 1: Load the synthetic sample#

palmwtc ships a one-week synthetic dataset (30-second cadence, 2 chambers) so every step in this notebook runs immediately after pip install palmwtc — no config file needed. The data was generated with a realistic diurnal CO₂ curve plus injected noise, spikes, and a small drift segment to exercise the QC rules. It is not real plantation data.

The main file is a single parquet with one row per 30-second interval. Columns follow the pattern CO2_C1, H2O_C1 (chamber 1) and CO2_C2, H2O_C2 (chamber 2), plus temperature, humidity, atmospheric pressure, and battery voltage for each chamber.

df = pd.read_parquet(paths.raw_dir / "QC_Flagged_Data_synthetic.parquet")

print(f"Rows: {df.shape[0]:,}  (7 days × 2,880 rows/day × 30 s)")
print(f"Columns: {df.shape[1]}")
print()
df.head()
Rows: 20,160  (7 days × 2,880 rows/day × 30 s)
Columns: 19
TIMESTAMP CO2_C1 H2O_C1 VaporPressure_1_C1 Temp_1_C1 RH_1_C1 AtmosphericPressure_1_C1 Batt_volt_Min_C1 CO2_C2 H2O_C2 VaporPressure_1_C2 Temp_1_C2 RH_1_C2 AtmosphericPressure_1_C2 Batt_volt_Min_C2 CO2_C1_qc_flag CO2_C2_qc_flag H2O_C1_qc_flag H2O_C2_qc_flag
0 2026-03-01 00:00:00 401.461633 17.982137 18.173353 28.976117 63.236286 1010.633622 12.558523 401.452462 18.051855 18.211326 29.000160 59.757655 1008.834020 12.627172 0 0 0 0
1 2026-03-01 00:00:30 401.923752 18.280408 18.471429 28.974637 61.745905 1010.449459 12.587892 402.898333 17.919181 18.125826 28.955349 65.300657 1011.532051 12.568139 0 0 0 0
2 2026-03-01 00:01:00 403.618192 18.975835 19.174394 29.145903 63.704736 1010.463786 12.613104 402.257632 18.595620 18.801026 28.523473 61.698780 1011.045901 12.464016 0 0 0 0
3 2026-03-01 00:01:30 404.694238 18.849399 19.037098 28.280104 60.321632 1009.957820 12.591674 404.495244 18.533279 18.744764 29.334173 61.476407 1011.411142 12.683626 0 0 0 0
4 2026-03-01 00:02:00 404.516029 19.348686 19.587934 28.998899 61.840512 1012.365055 12.569800 405.152910 19.280729 19.514835 29.212498 61.600773 1012.141949 12.654404 0 0 0 0

Step 2: Quality control#

Before any flux calculation, every raw CO₂ reading needs to be checked for problems. QCProcessor runs several rule-based tests in sequence:

  • Physical bounds — any reading below 300 ppm or above 600 ppm is outside the physical operating range of the LI-850 and is flagged bad (2).

  • IQR outliers — readings more than 1.5 × IQR away from the local median are flagged suspect (1). This catches momentary spikes that pass the hard bounds but are still implausible.

  • Rate-of-change — if the CO₂ concentration jumps by more than 50 ppm in one 30-second step, that step is flagged.

  • Persistence (stuck sensor) — if the value stays identical for 5 or more consecutive steps, the sensor is probably frozen; those rows are flagged bad.

The output flag values are 0 (good), 1 (suspect), and 2 (bad). Only flag-0 rows are carried forward into the flux calculation.

from palmwtc.qc import QCProcessor

co2_config = {
    "co2": {
        "columns": ["CO2_C1"],
        "hard": [300, 600],          # ppm — absolute physical limits
        "soft": [350, 550],          # ppm — expected operating range
        "rate_of_change": {"limit": 50},   # max ppm per 30-s step
        "persistence": {"window": 5},      # flag if stuck for 5+ steps
    }
}

qc = QCProcessor(df=df, config_dict=co2_config)
result = qc.process_variable("CO2_C1", random_seed=42)
flagged_df = qc.get_processed_dataframe()

summary = result["summary"]
print(f"Total points : {summary['total_points']:,}")
print(f"Good  (0)    : {summary['flag_0_count']:,}  ({summary['flag_0_percent']:.1f} %)")
print(f"Suspect (1)  : {summary['flag_1_count']:,}  ({summary['flag_1_percent']:.2f} %)")
print(f"Bad (2)      : {summary['flag_2_count']:,}  ({summary['flag_2_percent']:.2f} %)")
Total points : 20,160
Good  (0)    : 20,155  (100.0 %)
Suspect (1)  : 1  (0.00 %)
Bad (2)      : 4  (0.02 %)

Step 3: Flux calculation#

Inside the chamber, CO₂ concentration rises (or falls) during each closed measurement cycle — typically over about 5 minutes. The slope of that rise (in ppm per second) is converted to an absolute gas exchange rate (in µmol m⁻² s⁻¹) using the ideal gas law:

flux = slope × (P / RT) × V / A

where P is atmospheric pressure, R is the gas constant, T is air temperature inside the chamber, V is the chamber volume, and A is the enclosed ground area. A negative flux means CO₂ is going into the tree (photosynthesis); a positive flux means CO₂ is leaving the tree (respiration).

prepare_chamber_data selects and cleans the chamber-1 columns from the flagged DataFrame. calculate_flux_cycles then finds every closed-chamber period, fits a linear regression to the CO₂ ramp, and returns one row per cycle with the flux, fit quality metrics (R², NRMSE, SNR), and a per-cycle QC flag.

The output column is called flux_date in the raw cycles DataFrame. We rename it to flux_datetime here because WindowSelector and run_science_validation both expect that name.

from palmwtc.flux import prepare_chamber_data, calculate_flux_cycles

# Select chamber-1 columns, remove flagged rows, apply WPL correction.
chamber_df = prepare_chamber_data(flagged_df, "C1", require_h2o_for_wpl=False)

# Fit one linear regression per closed-chamber cycle.
cycles = calculate_flux_cycles(chamber_df, "Chamber 1", use_multiprocessing=False)

# Rename flux_date → flux_datetime so WindowSelector + validation both work.
cycles = cycles.rename(columns={"flux_date": "flux_datetime"})

print(f"{len(cycles)} cycles extracted")
cycles[["cycle_id", "flux_datetime", "flux_absolute", "flux_slope", "r2", "qc_flag"]].head()
2 cycles extracted
cycle_id flux_datetime flux_absolute flux_slope r2 qc_flag
0 1 2026-03-01 00:00:00 -2.048241 -0.008361 0.510316 0
1 2 2026-03-03 12:45:00 0.044259 0.000181 0.167873 1

Step 4: Calibration window selection#

A calibration window is a consecutive span of high-quality days whose cycle data is suitable for training the XPalm digital-twin model. Not every day qualifies — a window requires enough cycles with high statistical confidence, no sensor drift, and reasonable agreement between the two chambers.

WindowSelector scores every cycle across five components:

Component

What it measures

score_regression

R², p-value, and NRMSE of the linear fit

score_robustness

Consistency across sub-windows of the cycle

score_sensor_qc

Fraction of good-flagged raw points inside the cycle

score_drift

Absence of systematic slope drift across the day

score_cross_chamber

Agreement between chamber 1 and chamber 2

The individual scores combine into a single cycle_confidence (0–1). A window qualifies only when enough consecutive high-confidence cycles are available (minimum 5 days by default). The one-week synthetic sample is too short to produce qualifying windows — that is the expected scientific response, not an error.

from palmwtc.windows import WindowSelector

ws = WindowSelector(cycles)
ws.score_cycles()
ws.identify_windows()
ws.summary()

print()
print(f"Cycles with confidence ≥ 0.65: {(ws.cycles_df['cycle_confidence'] >= 0.65).sum()}")
print()
ws.cycles_df[["cycle_id", "flux_datetime", "cycle_confidence",
              "score_regression", "score_sensor_qc"]].head()
=== WindowSelector summary ===
  Total cycles loaded : 2
  Confidence mean     : 0.493
  Cycles ≥ 0.65       : 0 (0.0%)

Cycles with confidence ≥ 0.65: 0
cycle_id flux_datetime cycle_confidence score_regression score_sensor_qc
0 1 2026-03-01 00:00:00 0.4583 0.0 0.5
1 2 2026-03-03 12:45:00 0.5278 0.0 0.5

Step 5: Science validation#

run_science_validation checks whether the flux data is consistent with published oil-palm physiology. It runs four independent tests:

  1. Light-response curve (Amax) — the maximum photosynthetic rate at light saturation should fall between 15 and 35 µmol m⁻² s⁻¹ for oil palm at the whole-tree scale (Lamade & Bouillet, 2005). A fitted A_max outside this range suggests a systematic measurement error or an unusual physiological state.

  2. Temperature response (Q10) — nighttime respiration should roughly double for every 10 °C increase in temperature (Q10 ≈ 2). The acceptable range for tropical canopies is 1.5–3.5. A Q10 outside that range suggests sensor noise or a confounding effect (e.g., water stress, phenological transition).

  3. Water-use efficiency vs VPD — stomatal conductance and WUE should follow the Medlyn et al. (2011) unified stomatal optimality model. As VPD rises, stomata close and WUE increases — any data where WUE is flat or falls with VPD indicates a measurement problem.

  4. Inter-chamber agreement — the two chambers around different oil palms are not identical, but mean fluxes during the same weather conditions should agree within 30% (relative difference). A larger divergence points to a calibration offset or a blockage in one chamber.

The validator requires Global_Radiation, h2o_slope, and vpd_kPa columns for tests 1–3. In a full pipeline run those come from the weather station and H₂O flux calculation. Here we add NaN placeholders so the validator runs and shows the scorecard structure — all tests correctly return N/A.

from palmwtc.validation import run_science_validation

# Columns required by the validator but not produced by calculate_flux_cycles
# alone.  In a real run these come from the weather station and H2O flux step.
cycles["Global_Radiation"] = float("nan")   # W m⁻²  — PAR / shortwave incoming
cycles["h2o_slope"] = float("nan")          # mmol m⁻² s⁻¹  — H2O flux
cycles["co2_slope"] = cycles["flux_slope"]  # µmol m⁻² s⁻¹  — alias for CO2 flux
cycles["vpd_kPa"] = float("nan")            # kPa  — vapour pressure deficit

report = run_science_validation(cycles)
scorecard = report["scorecard"]

print(f"Tests passed         : {scorecard['n_pass']}")
print(f"Borderline           : {scorecard['n_borderline']}")
print(f"Failed               : {scorecard['n_fail']}")
print(f"Insufficient data (N/A): {scorecard['n_na']}")
print()
print("N/A is the correct result on the one-week synthetic sample.")
print("Run against ≥2 weeks of real data with radiation + H2O columns")
print("to obtain PASS/FAIL scores.")
Tests passed         : 0
Borderline           : 0
Failed               : 0
Insufficient data (N/A): 7

N/A is the correct result on the one-week synthetic sample.
Run against ≥2 weeks of real data with radiation + H2O columns
to obtain PASS/FAIL scores.

Step 6: Flux heatmap#

The flux heatmap is the first visual sanity check. It shows mean CO₂ flux by hour of day (y-axis) and by month (x-axis). Blue cells indicate net CO₂ uptake (photosynthesis during the day); red/warm cells indicate net CO₂ release (respiration at night). A clear diurnal pattern — negative fluxes in the middle of the day, near-zero or slightly positive at night — is the primary visual confirmation that the chambers are capturing a real biological signal.

plot_flux_heatmap reads the flux_date column, so we add it back as an alias of flux_datetime before calling the function. The synthetic sample spans only one week, so the x-axis will show a single month column — but the diurnal pattern should still be visible.

from palmwtc.viz import set_style, plot_flux_heatmap

set_style()   # apply the palmwtc matplotlib theme

# plot_flux_heatmap reads "flux_date"; add it back as an alias of flux_datetime.
cycles["flux_date"] = cycles["flux_datetime"]

fig = plot_flux_heatmap(cycles)
fig.savefig("/tmp/flux_heatmap_tutorial.png", dpi=100, bbox_inches="tight")
print("Heatmap saved to /tmp/flux_heatmap_tutorial.png")
Heatmap saved to /tmp/flux_heatmap_tutorial.png

Where to go next#

This notebook gave a single pass through the pipeline. The thirteen per-stage tutorials go deeper into each step:

Data preparation

Quality control

Flux calculation and windows

Validation and auditing

The API reference documents every public function in palmwtc.*.

The bundled synthetic sample is sufficient to learn the pipeline. Real LIBZ data is available via the future Zenodo DOI for serious analyses.

References#

Lamade, E., & Bouillet, J.-P. (2005). Carbon storage and global change: the role of oil palm. OCL — Oilseeds & fats, Crops and Lipids, 12(2), 154–160. https://doi.org/10.1051/ocl.2005.0154 (Source for the Amax range 15–35 µmol m⁻² s⁻¹ used in the light-response validation test.)

Medlyn, B. E., Duursma, R. A., Eamus, D., Ellsworth, D. S., Prentice, I. C., Barton, C. V. M., Crous, K. Y., De Angelis, P., Freeman, M., & Wingate, L. (2011). Reconciling the optimal and empirical approaches to modelling stomatal conductance. Global Change Biology, 17(6), 2134–2144. https://doi.org/10.1111/j.1365-2486.2010.02375.x (Source for the Medlyn g₁ formulation used in the WUE validation test.)