000 — End-to-end on synthetic data#
This notebook walks through the complete palmwtc pipeline using the bundled synthetic sample — no field data needed. Each section shows one step of the pipeline and explains what it is doing scientifically, not just technically.
The system these tools were built for: two automated whole-tree flux chambers around individual oil palm trees, each measuring CO₂ and H₂O concentration every 30 seconds with a LI-COR LI-850 analyser. The pipeline converts those raw concentration readings into calibrated net CO₂ exchange values — and then checks whether those values are consistent with what the plant physiology literature says oil palms should be doing.
For the same pipeline applied to real chamber data, see the sibling notebook 001_End_to_End_LIBZ.ipynb — that one requires a LIBZ-style dataset on disk and is not bundled with the package.
After this notebook, the per-stage tutorials (010–035) each cover one step in much more depth.
import pandas as pd
import matplotlib
matplotlib.use("Agg") # headless rendering — safe in all environments
from palmwtc.config import DataPaths
paths = DataPaths.resolve()
print(paths.describe())
DataPaths (source=sample (bundled synthetic), site=libz):
raw_dir = /home/runner/work/palmwtc/palmwtc/src/palmwtc/data/sample/synthetic
processed_dir = /home/runner/work/palmwtc/palmwtc/src/palmwtc/data/sample/Data/Integrated_QC_Data
exports_dir = /home/runner/work/palmwtc/palmwtc/src/palmwtc/data/sample/exports
config_dir = /home/runner/work/palmwtc/palmwtc/src/palmwtc/data/sample/config
extras = <none>
Step 1: Load the synthetic sample#
palmwtc ships a one-week synthetic dataset (30-second cadence, 2 chambers)
so every step in this notebook runs immediately after pip install palmwtc
— no config file needed. The data was generated with a realistic diurnal CO₂
curve plus injected noise, spikes, and a small drift segment to exercise the
QC rules. It is not real plantation data.
The main file is a single parquet with one row per 30-second interval.
Columns follow the pattern CO2_C1, H2O_C1 (chamber 1) and CO2_C2,
H2O_C2 (chamber 2), plus temperature, humidity, atmospheric pressure, and
battery voltage for each chamber.
df = pd.read_parquet(paths.raw_dir / "QC_Flagged_Data_synthetic.parquet")
print(f"Rows: {df.shape[0]:,} (7 days × 2,880 rows/day × 30 s)")
print(f"Columns: {df.shape[1]}")
print()
df.head()
Rows: 20,160 (7 days × 2,880 rows/day × 30 s)
Columns: 19
| TIMESTAMP | CO2_C1 | H2O_C1 | VaporPressure_1_C1 | Temp_1_C1 | RH_1_C1 | AtmosphericPressure_1_C1 | Batt_volt_Min_C1 | CO2_C2 | H2O_C2 | VaporPressure_1_C2 | Temp_1_C2 | RH_1_C2 | AtmosphericPressure_1_C2 | Batt_volt_Min_C2 | CO2_C1_qc_flag | CO2_C2_qc_flag | H2O_C1_qc_flag | H2O_C2_qc_flag | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2026-03-01 00:00:00 | 401.461633 | 17.982137 | 18.173353 | 28.976117 | 63.236286 | 1010.633622 | 12.558523 | 401.452462 | 18.051855 | 18.211326 | 29.000160 | 59.757655 | 1008.834020 | 12.627172 | 0 | 0 | 0 | 0 |
| 1 | 2026-03-01 00:00:30 | 401.923752 | 18.280408 | 18.471429 | 28.974637 | 61.745905 | 1010.449459 | 12.587892 | 402.898333 | 17.919181 | 18.125826 | 28.955349 | 65.300657 | 1011.532051 | 12.568139 | 0 | 0 | 0 | 0 |
| 2 | 2026-03-01 00:01:00 | 403.618192 | 18.975835 | 19.174394 | 29.145903 | 63.704736 | 1010.463786 | 12.613104 | 402.257632 | 18.595620 | 18.801026 | 28.523473 | 61.698780 | 1011.045901 | 12.464016 | 0 | 0 | 0 | 0 |
| 3 | 2026-03-01 00:01:30 | 404.694238 | 18.849399 | 19.037098 | 28.280104 | 60.321632 | 1009.957820 | 12.591674 | 404.495244 | 18.533279 | 18.744764 | 29.334173 | 61.476407 | 1011.411142 | 12.683626 | 0 | 0 | 0 | 0 |
| 4 | 2026-03-01 00:02:00 | 404.516029 | 19.348686 | 19.587934 | 28.998899 | 61.840512 | 1012.365055 | 12.569800 | 405.152910 | 19.280729 | 19.514835 | 29.212498 | 61.600773 | 1012.141949 | 12.654404 | 0 | 0 | 0 | 0 |
Step 2: Quality control#
Before any flux calculation, every raw CO₂ reading needs to be checked for
problems. QCProcessor runs several rule-based tests in sequence:
Physical bounds — any reading below 300 ppm or above 600 ppm is outside the physical operating range of the LI-850 and is flagged bad (2).
IQR outliers — readings more than 1.5 × IQR away from the local median are flagged suspect (1). This catches momentary spikes that pass the hard bounds but are still implausible.
Rate-of-change — if the CO₂ concentration jumps by more than 50 ppm in one 30-second step, that step is flagged.
Persistence (stuck sensor) — if the value stays identical for 5 or more consecutive steps, the sensor is probably frozen; those rows are flagged bad.
The output flag values are 0 (good), 1 (suspect), and 2 (bad). Only flag-0 rows are carried forward into the flux calculation.
from palmwtc.qc import QCProcessor
co2_config = {
"co2": {
"columns": ["CO2_C1"],
"hard": [300, 600], # ppm — absolute physical limits
"soft": [350, 550], # ppm — expected operating range
"rate_of_change": {"limit": 50}, # max ppm per 30-s step
"persistence": {"window": 5}, # flag if stuck for 5+ steps
}
}
qc = QCProcessor(df=df, config_dict=co2_config)
result = qc.process_variable("CO2_C1", random_seed=42)
flagged_df = qc.get_processed_dataframe()
summary = result["summary"]
print(f"Total points : {summary['total_points']:,}")
print(f"Good (0) : {summary['flag_0_count']:,} ({summary['flag_0_percent']:.1f} %)")
print(f"Suspect (1) : {summary['flag_1_count']:,} ({summary['flag_1_percent']:.2f} %)")
print(f"Bad (2) : {summary['flag_2_count']:,} ({summary['flag_2_percent']:.2f} %)")
Total points : 20,160
Good (0) : 20,155 (100.0 %)
Suspect (1) : 1 (0.00 %)
Bad (2) : 4 (0.02 %)
Step 3: Flux calculation#
Inside the chamber, CO₂ concentration rises (or falls) during each closed measurement cycle — typically over about 5 minutes. The slope of that rise (in ppm per second) is converted to an absolute gas exchange rate (in µmol m⁻² s⁻¹) using the ideal gas law:
flux = slope × (P / RT) × V / A
where P is atmospheric pressure, R is the gas constant, T is air temperature inside the chamber, V is the chamber volume, and A is the enclosed ground area. A negative flux means CO₂ is going into the tree (photosynthesis); a positive flux means CO₂ is leaving the tree (respiration).
prepare_chamber_data selects and cleans the chamber-1 columns from the
flagged DataFrame. calculate_flux_cycles then finds every closed-chamber
period, fits a linear regression to the CO₂ ramp, and returns one row per
cycle with the flux, fit quality metrics (R², NRMSE, SNR), and a per-cycle
QC flag.
The output column is called flux_date in the raw cycles DataFrame.
We rename it to flux_datetime here because WindowSelector and
run_science_validation both expect that name.
from palmwtc.flux import prepare_chamber_data, calculate_flux_cycles
# Select chamber-1 columns, remove flagged rows, apply WPL correction.
chamber_df = prepare_chamber_data(flagged_df, "C1", require_h2o_for_wpl=False)
# Fit one linear regression per closed-chamber cycle.
cycles = calculate_flux_cycles(chamber_df, "Chamber 1", use_multiprocessing=False)
# Rename flux_date → flux_datetime so WindowSelector + validation both work.
cycles = cycles.rename(columns={"flux_date": "flux_datetime"})
print(f"{len(cycles)} cycles extracted")
cycles[["cycle_id", "flux_datetime", "flux_absolute", "flux_slope", "r2", "qc_flag"]].head()
2 cycles extracted
| cycle_id | flux_datetime | flux_absolute | flux_slope | r2 | qc_flag | |
|---|---|---|---|---|---|---|
| 0 | 1 | 2026-03-01 00:00:00 | -2.048241 | -0.008361 | 0.510316 | 0 |
| 1 | 2 | 2026-03-03 12:45:00 | 0.044259 | 0.000181 | 0.167873 | 1 |
Step 4: Calibration window selection#
A calibration window is a consecutive span of high-quality days whose cycle data is suitable for training the XPalm digital-twin model. Not every day qualifies — a window requires enough cycles with high statistical confidence, no sensor drift, and reasonable agreement between the two chambers.
WindowSelector scores every cycle across five components:
Component |
What it measures |
|---|---|
|
R², p-value, and NRMSE of the linear fit |
|
Consistency across sub-windows of the cycle |
|
Fraction of good-flagged raw points inside the cycle |
|
Absence of systematic slope drift across the day |
|
Agreement between chamber 1 and chamber 2 |
The individual scores combine into a single cycle_confidence (0–1). A window
qualifies only when enough consecutive high-confidence cycles are available
(minimum 5 days by default). The one-week synthetic sample is too short to
produce qualifying windows — that is the expected scientific response, not an
error.
from palmwtc.windows import WindowSelector
ws = WindowSelector(cycles)
ws.score_cycles()
ws.identify_windows()
ws.summary()
print()
print(f"Cycles with confidence ≥ 0.65: {(ws.cycles_df['cycle_confidence'] >= 0.65).sum()}")
print()
ws.cycles_df[["cycle_id", "flux_datetime", "cycle_confidence",
"score_regression", "score_sensor_qc"]].head()
=== WindowSelector summary ===
Total cycles loaded : 2
Confidence mean : 0.493
Cycles ≥ 0.65 : 0 (0.0%)
Cycles with confidence ≥ 0.65: 0
| cycle_id | flux_datetime | cycle_confidence | score_regression | score_sensor_qc | |
|---|---|---|---|---|---|
| 0 | 1 | 2026-03-01 00:00:00 | 0.4583 | 0.0 | 0.5 |
| 1 | 2 | 2026-03-03 12:45:00 | 0.5278 | 0.0 | 0.5 |
Step 5: Science validation#
run_science_validation checks whether the flux data is consistent with
published oil-palm physiology. It runs four independent tests:
Light-response curve (Amax) — the maximum photosynthetic rate at light saturation should fall between 15 and 35 µmol m⁻² s⁻¹ for oil palm at the whole-tree scale (Lamade & Bouillet, 2005). A fitted A_max outside this range suggests a systematic measurement error or an unusual physiological state.
Temperature response (Q10) — nighttime respiration should roughly double for every 10 °C increase in temperature (Q10 ≈ 2). The acceptable range for tropical canopies is 1.5–3.5. A Q10 outside that range suggests sensor noise or a confounding effect (e.g., water stress, phenological transition).
Water-use efficiency vs VPD — stomatal conductance and WUE should follow the Medlyn et al. (2011) unified stomatal optimality model. As VPD rises, stomata close and WUE increases — any data where WUE is flat or falls with VPD indicates a measurement problem.
Inter-chamber agreement — the two chambers around different oil palms are not identical, but mean fluxes during the same weather conditions should agree within 30% (relative difference). A larger divergence points to a calibration offset or a blockage in one chamber.
The validator requires Global_Radiation, h2o_slope, and vpd_kPa columns
for tests 1–3. In a full pipeline run those come from the weather station and
H₂O flux calculation. Here we add NaN placeholders so the validator runs and
shows the scorecard structure — all tests correctly return N/A.
from palmwtc.validation import run_science_validation
# Columns required by the validator but not produced by calculate_flux_cycles
# alone. In a real run these come from the weather station and H2O flux step.
cycles["Global_Radiation"] = float("nan") # W m⁻² — PAR / shortwave incoming
cycles["h2o_slope"] = float("nan") # mmol m⁻² s⁻¹ — H2O flux
cycles["co2_slope"] = cycles["flux_slope"] # µmol m⁻² s⁻¹ — alias for CO2 flux
cycles["vpd_kPa"] = float("nan") # kPa — vapour pressure deficit
report = run_science_validation(cycles)
scorecard = report["scorecard"]
print(f"Tests passed : {scorecard['n_pass']}")
print(f"Borderline : {scorecard['n_borderline']}")
print(f"Failed : {scorecard['n_fail']}")
print(f"Insufficient data (N/A): {scorecard['n_na']}")
print()
print("N/A is the correct result on the one-week synthetic sample.")
print("Run against ≥2 weeks of real data with radiation + H2O columns")
print("to obtain PASS/FAIL scores.")
Tests passed : 0
Borderline : 0
Failed : 0
Insufficient data (N/A): 7
N/A is the correct result on the one-week synthetic sample.
Run against ≥2 weeks of real data with radiation + H2O columns
to obtain PASS/FAIL scores.
Step 6: Flux heatmap#
The flux heatmap is the first visual sanity check. It shows mean CO₂ flux by hour of day (y-axis) and by month (x-axis). Blue cells indicate net CO₂ uptake (photosynthesis during the day); red/warm cells indicate net CO₂ release (respiration at night). A clear diurnal pattern — negative fluxes in the middle of the day, near-zero or slightly positive at night — is the primary visual confirmation that the chambers are capturing a real biological signal.
plot_flux_heatmap reads the flux_date column, so we add it back as an
alias of flux_datetime before calling the function. The synthetic sample
spans only one week, so the x-axis will show a single month column — but the
diurnal pattern should still be visible.
from palmwtc.viz import set_style, plot_flux_heatmap
set_style() # apply the palmwtc matplotlib theme
# plot_flux_heatmap reads "flux_date"; add it back as an alias of flux_datetime.
cycles["flux_date"] = cycles["flux_datetime"]
fig = plot_flux_heatmap(cycles)
fig.savefig("/tmp/flux_heatmap_tutorial.png", dpi=100, bbox_inches="tight")
print("Heatmap saved to /tmp/flux_heatmap_tutorial.png")
Heatmap saved to /tmp/flux_heatmap_tutorial.png
Where to go next#
This notebook gave a single pass through the pipeline. The thirteen per-stage tutorials go deeper into each step:
Data preparation
010 — Core data integration: read raw LI-850 TOA5 files, fuse climate station data, and produce the unified parquet that feeds QC.
011 — Weather station vs chamber: compare temperature and humidity inside the chamber against the open-air weather station to catch microclimate artifacts.
Quality control
020 — Rule-based QC: full multi-variable QC with physical bounds, IQR outliers, breakpoint detection, and drift checks.
022 — ML-enhanced QC: add IsolationForest contextual outlier detection on top of the rule-based flags.
023 — Field alert report: generate an email-able HTML report for field operators when sensor problems appear.
025 — Cross-chamber bias: check whether the two chambers have a systematic offset that would bias calibration.
026 — CO₂/H₂O segmented bias: detect time-localised drift in the CO₂ or H₂O signal.
Flux calculation and windows
030 — Flux cycle calculation: cycle identification, slope fitting, WPL correction, and quality scoring in detail.
031 — Window selection (reference): methodology behind the calibration-window algorithm.
032 — Window selection (production): selecting windows on a real multi-week dataset.
Validation and auditing
033 — Science validation: ecophysiology validation scorecard with a full real-data example.
034 — QC and window audit: pre-calibration sanity check — how many cycles survive QC, how many windows qualify.
035 — QC threshold sensitivity: sweep QC thresholds to understand the operating-point trade-off.
The API reference documents every public function in
palmwtc.*.
The bundled synthetic sample is sufficient to learn the pipeline. Real LIBZ data is available via the future Zenodo DOI for serious analyses.
References#
Lamade, E., & Bouillet, J.-P. (2005). Carbon storage and global change: the role of oil palm. OCL — Oilseeds & fats, Crops and Lipids, 12(2), 154–160. https://doi.org/10.1051/ocl.2005.0154 (Source for the Amax range 15–35 µmol m⁻² s⁻¹ used in the light-response validation test.)
Medlyn, B. E., Duursma, R. A., Eamus, D., Ellsworth, D. S., Prentice, I. C., Barton, C. V. M., Crous, K. Y., De Angelis, P., Freeman, M., & Wingate, L. (2011). Reconciling the optimal and empirical approaches to modelling stomatal conductance. Global Change Biology, 17(6), 2134–2144. https://doi.org/10.1111/j.1365-2486.2010.02375.x (Source for the Medlyn g₁ formulation used in the WUE validation test.)