SmartCane/webapps/docs/PIPELINE_OVERVIEW.md

448 lines
20 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# SmartCane Processing Pipeline - Complete Script Overview
## Pipeline Execution Order
## Complete Pipeline Mermaid Diagram
```mermaid
%% Complete Pipeline
graph TD
%% ===== INPUTS =====
API["🔑 Planet API<br/>Credentials"]
GeoJSON["🗺️ pivot.geojson<br/>(Field Boundaries)"]
HarvestIn["📊 harvest.xlsx<br/>(from Stage 23)"]
%% ===== STAGE 00: DOWNLOAD =====
Stage00["<b>Stage 00: Python</b><br/>00_download_8band_pu_optimized.py"]
Out00["📦 merged_tif/<br/>YYYY-MM-DD.tif<br/>(4-band or 8-band)<br/>(configurable)"]
%% ===== STAGE 10: OPTIONAL TILING =====
Stage10["<b>Stage 10: R</b><br/>10_create_per_field_tiffs.R<br/>(Per-field extraction)"]
Out10["📦 daily_tiles_split/per_field/<br/>YYYY-MM-DD/*.tif<br/>(one per field)"]
%% ===== STAGE 20: CI EXTRACTION =====
Stage20["<b>Stage 20: R</b><br/>20_ci_extraction.R"]
Out20a["📦 combined_CI_data.rds<br/>(wide: fields × dates)"]
Out20b["📦 daily RDS files<br/>(per-date stats)"]
%% ===== STAGE 21: RDS → CSV =====
Stage21["<b>Stage 21: R</b><br/>21_convert_ci_rds_to_csv.R"]
Out21["📦 ci_data_for_python.csv<br/>(long format + DOY)"]
%% ===== STAGE 22: BASELINE HARVEST =====
Stage22["<b>Stage 22: Python</b><br/>22_harvest_baseline_prediction.py<br/>(RUN ONCE)"]
Out22["📦 harvest_production_export.xlsx<br/>(baseline predictions)"]
%% ===== STAGE 23: HARVEST FORMAT =====
Stage23["<b>Stage 23: Python</b><br/>23_convert_harvest_format.py"]
Out23["📦 harvest.xlsx<br/>(standard format)<br/>→ Feeds back to Stage 80"]
%% ===== STAGE 30: GROWTH MODEL =====
Stage30["<b>Stage 30: R</b><br/>30_interpolate_growth_model.R"]
Out30["📦 All_pivots_Cumulative_CI...<br/>_quadrant_year_v2.rds<br/>(interpolated daily)"]
%% ===== STAGE 31: WEEKLY HARVEST =====
Stage31["<b>Stage 31: Python</b><br/>31_harvest_imminent_weekly.py<br/>(Weekly)"]
Out31["📦 harvest_imminent_weekly.csv<br/>(probabilities)"]
%% ===== STAGE 40: MOSAIC =====
Stage40["<b>Stage 40: R</b><br/>40_mosaic_creation.R"]
Out40["📦 weekly_mosaic/<br/>week_WW_YYYY.tif<br/>(5-band composite)"]
%% ===== STAGE 80: KPI =====
Stage80["<b>Stage 80: R</b><br/>80_calculate_kpis.R"]
Out80a["📦 field_analysis_week{WW}.xlsx"]
Out80b["📦 kpi_summary_tables_week{WW}.rds"]
%% ===== STAGE 90: REPORT =====
Stage90["<b>Stage 90: R/RMarkdown</b><br/>90_CI_report_with_kpis_simple.Rmd"]
Out90["📦 SmartCane_Report_week{WW}_{YYYY}.docx<br/>(FINAL OUTPUT)"]
%% ===== CONNECTIONS: INPUTS TO STAGE 00 =====
API --> Stage00
GeoJSON --> Stage00
%% ===== STAGE 00 → 10 OR 20 =====
Stage00 --> Out00
Out00 --> Stage10
Out00 --> Stage20
%% ===== STAGE 10 → 20 =====
Stage10 --> Out10
Out10 --> Stage20
%% ===== STAGE 20 → 21, 30, 40 =====
GeoJSON --> Stage20
Stage20 --> Out20a
Stage20 --> Out20b
Out20a --> Stage21
Out20a --> Stage30
Out00 --> Stage40
%% ===== STAGE 21 → 22, 31 =====
Stage21 --> Out21
Out21 --> Stage22
Out21 --> Stage31
%% ===== STAGE 22 → 23 =====
Stage22 --> Out22
Out22 --> Stage23
%% ===== STAGE 23 → 80 & FEEDBACK =====
Stage23 --> Out23
Out23 -.->|"Feeds back<br/>(Season context)"| Stage80
%% ===== STAGE 30 → 80 =====
Stage30 --> Out30
Out30 --> Stage80
%% ===== STAGE 31 (PARALLEL) =====
Stage31 --> Out31
%% ===== STAGE 40 → 80, 90 =====
Stage40 --> Out40
Out40 --> Stage80
Out40 --> Stage90
%% ===== STAGE 80 → 90 =====
Stage80 --> Out80a
Stage80 --> Out80b
Out80a --> Stage90
Out80b --> Stage90
%% ===== STAGE 90 FINAL =====
Stage90 --> Out90
%% ===== ADDITIONAL INPUTS =====
HarvestIn --> Stage30
HarvestIn --> Stage80
GeoJSON --> Stage30
GeoJSON --> Stage40
GeoJSON --> Stage80
%% ===== STYLING =====
classDef input fill:#e3f2fd,stroke:#1976d2,stroke-width:2px
classDef pyStage fill:#fff3e0,stroke:#f57c00,stroke-width:2px
classDef rStage fill:#f3e5f5,stroke:#7b1fa2,stroke-width:2px
classDef output fill:#e8f5e9,stroke:#388e3c,stroke-width:2px
classDef finalOutput fill:#ffebee,stroke:#c62828,stroke-width:3px
class API,GeoJSON,HarvestIn input
class Stage00,Stage22,Stage23,Stage31 pyStage
class Stage10,Stage20,Stage21,Stage30,Stage40,Stage80,Stage90 rStage
class Out00,Out10,Out20a,Out20b,Out21,Out22,Out30,Out31,Out40,Out80a,Out80b output
class Out23,Out90 finalOutput
```
---
## Detailed Stage Descriptions
```
Stage 00: PYTHON - Download Satellite Data
└─ 00_download_8band_pu_optimized.py
INPUT: Planet API credentials, field boundaries (pivot.geojson), date range
OUTPUT: laravel_app/storage/app/{project}/merged_tif/{YYYY-MM-DD}.tif (4-band or 8-band)
RUN FREQUENCY: Daily or as-needed
NOTES: Download script configures band count; consolidates to single merged_tif/ folder
Stage 10: R - Create Per-Field Daily Tiles
└─ 10_create_per_field_tiffs.R
INPUT: Daily GeoTIFFs from merged_tif/
Field boundaries (pivot.geojson)
OUTPUT: laravel_app/storage/app/{project}/daily_tiles_split/per_field/{YYYY-MM-DD}/*.tif
RUN FREQUENCY: Optional - per-field extraction for efficient memory use
NOTES: Creates one GeoTIFF per field per day
Stage 20: R - Extract Canopy Index (CI) from Daily Imagery
└─ 20_ci_extraction_per_field.R
INPUT: Daily GeoTIFFs (merged_tif/ or daily_tiles_split/per_field/)
Field boundaries (pivot.geojson)
OUTPUT: RDS files:
- laravel_app/storage/app/{project}/Data/extracted_ci/daily_vals/extracted_{YYYY-MM-DD}_{suffix}.rds
- laravel_app/storage/app/{project}/Data/extracted_ci/cumulative_vals/combined_CI_data.rds (wide format)
RUN FREQUENCY: Daily or on-demand
COMMAND: Rscript 20_ci_extraction_per_field.R [end_date] [offset] [project_dir] [data_source]
EXAMPLE: Rscript 20_ci_extraction_per_field.R 2026-01-02 7 angata merged_tif
NOTES: Auto-detects per-field tiles if daily_tiles_split/per_field/ exists; outputs cumulative CI (fields × dates)
Stage 21: R - Convert CI RDS to CSV for Python Harvest Detection
└─ 21_convert_ci_rds_to_csv.R
INPUT: combined_CI_data.rds (from Stage 20)
OUTPUT: laravel_app/storage/app/{project}/Data/extracted_ci/ci_data_for_python/ci_data_for_python.csv
RUN FREQUENCY: After Stage 20
COMMAND: Rscript 21_convert_ci_rds_to_csv.R [project_dir]
EXAMPLE: Rscript 21_convert_ci_rds_to_csv.R angata
NOTES: Converts wide RDS (fields × dates) to long CSV; interpolates missing dates; adds DOY column
Stage 22: PYTHON - Baseline Harvest Prediction (LSTM Model 307)
└─ 22_harvest_baseline_prediction.py
INPUT: ci_data_for_python.csv (complete historical CI data)
OUTPUT: laravel_app/storage/app/{project}/Data/HarvestData/harvest_production_export.xlsx
RUN FREQUENCY: ONCE - establishes ground truth baseline for all fields
COMMAND: python 22_harvest_baseline_prediction.py [project_name]
EXAMPLE: python 22_harvest_baseline_prediction.py angata
NOTES: Two-step detection (Phase 1: growing window, Phase 2: ±40 day argmax refinement)
Tuned parameters: threshold=0.3, consecutive_days=2
Uses LSTM Model 307 dual output heads (imminent + detected)
Stage 23: PYTHON - Convert Harvest Format to Standard Structure
└─ 23_convert_harvest_format.py
INPUT: harvest_production_export.xlsx (from Stage 22)
CI data date range (determines season_start for first season)
OUTPUT: laravel_app/storage/app/{project}/Data/harvest.xlsx (standard format)
RUN FREQUENCY: After Stage 22
COMMAND: python 23_convert_harvest_format.py [project_name]
EXAMPLE: python 23_convert_harvest_format.py angata
NOTES: Converts to standard harvest.xlsx format with columns:
field, sub_field, year, season, season_start, season_end, age, sub_area, tonnage_ha
Season format: "Data{year} : {field}"
Only includes completed seasons (with season_end filled)
Stage 30: R - Growth Model Interpolation (Smooth CI Time Series)
└─ 30_interpolate_growth_model.R
INPUT: combined_CI_data.rds (from Stage 20)
harvest.xlsx (optional, for seasonal context)
OUTPUT: laravel_app/storage/app/{project}/Data/extracted_ci/cumulative_vals/
All_pivots_Cumulative_CI_quadrant_year_v2.rds
RUN FREQUENCY: Weekly or after CI extraction updates
COMMAND: Rscript 30_interpolate_growth_model.R [project_dir]
EXAMPLE: Rscript 30_interpolate_growth_model.R angata
NOTES: Linear interpolation across gaps; calculates daily change and cumulative CI
Outputs long-format data (Date, DOY, field, value, season, etc.)
Stage 31: PYTHON - Weekly Harvest Monitoring (Real-Time Alerts)
└─ 31_harvest_imminent_weekly.py
INPUT: ci_data_for_python.csv (recent CI data, last ~300 days)
harvest_production_export.xlsx (optional baseline reference)
OUTPUT: laravel_app/storage/app/{project}/Data/HarvestData/harvest_imminent_weekly.csv
RUN FREQUENCY: Weekly or daily for operational alerts
COMMAND: python 31_harvest_imminent_weekly.py [project_name]
EXAMPLE: python 31_harvest_imminent_weekly.py angata
NOTES: Single-run inference on recent data; outputs probabilities (imminent_prob, detected_prob)
Used for real-time decision support; compared against baseline from Stage 22
Stage 40: R - Create Weekly 5-Band Mosaics
└─ 40_mosaic_creation_per_field.R
INPUT: Daily GeoTIFFs (merged_tif/ or daily_tiles_split/per_field/)
Field boundaries (pivot.geojson)
OUTPUT: laravel_app/storage/app/{project}/weekly_mosaic/week_{WW}_{YYYY}.tif
RUN FREQUENCY: Weekly
COMMAND: Rscript 40_mosaic_creation_per_field.R [end_date] [offset] [project_dir]
EXAMPLE: Rscript 40_mosaic_creation_per_field.R 2026-01-14 7 angata
NOTES: Composites daily images using MAX function; 5 bands (R, G, B, NIR, CI)
Automatically selects images with acceptable cloud coverage
Output uses ISO week numbering (week_WW_YYYY)
Stage 80: R - Calculate KPIs & Per-Field Analysis
└─ 80_calculate_kpis.R
INPUT: Weekly mosaic (from Stage 40)
Growth model data (from Stage 30)
Field boundaries (pivot.geojson)
Harvest data (harvest.xlsx)
OUTPUT: laravel_app/storage/app/{project}/reports/
- {project}_field_analysis_week{WW}.xlsx
- {project}_kpi_summary_tables_week{WW}.rds
RUN FREQUENCY: Weekly
COMMAND: Rscript 80_calculate_kpis.R [end_date] [project_dir] [offset_days]
EXAMPLE: Rscript 80_calculate_kpis.R 2026-01-14 angata 7
NOTES: Parallel processing for 1000+ fields; calculates:
- Per-field uniformity (CV), phase assignment, growth trends
- Status triggers (germination, rapid growth, disease, harvest imminence)
- Farm-level KPI metrics (6 high-level indicators)
TEST_MODE=TRUE uses only recent weeks for development
Stage 90: R (RMarkdown) - Generate Executive Report (Word Document)
└─ 90_CI_report_with_kpis_simple.Rmd
INPUT: Weekly mosaic (from Stage 40)
KPI summary data (from Stage 80)
Field analysis (from Stage 80)
Field boundaries & harvest data (for context)
OUTPUT: laravel_app/storage/app/{project}/reports/
SmartCane_Report_week{WW}_{YYYY}.docx (PRIMARY OUTPUT)
SmartCane_Report_week{WW}_{YYYY}.html (optional)
RUN FREQUENCY: Weekly
RENDERING: R/RMarkdown with officer + flextable packages
NOTES: Executive summary with KPI overview, phase distribution, status triggers
Field-by-field detail pages with CI metrics and interpretation guides
Automatic unit conversion (hectares ↔ acres)
```
---
## Data Storage & Persistence
All data persists to the file system. No database writes occur during analysis—reads only for metadata.
```
laravel_app/storage/app/{project}/
├── Data/
│ ├── pivot.geojson # Field boundaries (read-only input)
│ ├── harvest.xlsx # Season dates & yield (standard format from Stage 23)
│ ├── vrt/ # Virtual raster files (daily VRTs from Stage 20)
│ │ └── YYYY-MM-DD.vrt
│ ├── extracted_ci/
│ │ ├── ci_data_for_python/
│ │ │ └── ci_data_for_python.csv # CSV for Python (from Stage 21)
│ │ ├── daily_vals/
│ │ │ └── extracted_YYYY-MM-DD_{suffix}.rds # Daily field CI stats (from Stage 20)
│ │ └── cumulative_vals/
│ │ ├── combined_CI_data.rds # Cumulative CI, wide format (from Stage 20)
│ │ └── All_pivots_Cumulative_CI_quadrant_year_v2.rds # Interpolated daily (from Stage 30)
│ └── HarvestData/
│ ├── harvest_production_export.xlsx # Baseline harvest predictions (from Stage 22)
│ └── harvest_imminent_weekly.csv # Weekly monitoring output (from Stage 31)
├── merged_tif/ # Raw satellite imagery (Stage 00 output)
│ └── YYYY-MM-DD.tif # 4-band or 8-band (configurable via download script)
├── daily_tiles_split/ # (Optional) Per-field tile processing (Stage 10 output)
│ ├── per_field/
│ │ └── YYYY-MM-DD/ # Date-specific folder
│ │ └── {FIELD}_YYYY-MM-DD.tif # One per-field GeoTIFF per day
├── weekly_mosaic/ # Weekly composite mosaics (Stage 40 output)
│ └── week_WW_YYYY.tif # 5 bands: R, G, B, NIR, CI (composite)
└── reports/ # Analysis outputs & reports (Stage 80, 90 outputs)
├── SmartCane_Report_week{WW}_{YYYY}.docx # FINAL REPORT (Stage 90)
├── SmartCane_Report_week{WW}_{YYYY}.html # Alternative format
├── {project}_field_analysis_week{WW}.xlsx # Field-by-field data (Stage 80)
├── {project}_kpi_summary_tables_week{WW}.rds # Summary RDS (Stage 80)
└── kpis/
└── week_WW_YYYY/ # Week-specific KPI folder
```
---
## Key File Formats
| Format | Stage | Purpose | Example |
|--------|-------|---------|---------|
| `.tif` (GeoTIFF) | 00, 10, 40 | Geospatial raster imagery | `2026-01-14.tif` (4-band), `week_02_2026.tif` (5-band) |
| `.vrt` (Virtual Raster) | 20 | Virtual pointer to TIFFs | `2026-01-14.vrt` |
| `.rds` (R Binary) | 20, 21, 30, 80 | R serialized data objects | `combined_CI_data.rds`, `All_pivots_Cumulative_CI_quadrant_year_v2.rds` |
| `.csv` (Comma-Separated) | 21, 31 | Tabular data for Python | `ci_data_for_python.csv`, `harvest_imminent_weekly.csv` |
| `.xlsx` (Excel) | 22, 23, 80 | Tabular reports & harvest data | `harvest.xlsx`, `harvest_production_export.xlsx`, field analysis |
| `.docx` (Word) | 90 | Executive report (final output) | `SmartCane_Report_week02_2026.docx` |
| `.json` | 10 | Tiling metadata | `tiling_config.json` |
| `.geojson` | Input | Field boundaries (read-only) | `pivot.geojson` |
---
## Script Dependencies & Utility Files
```
parameters_project.R
├─ Loaded by: 20_ci_extraction.R, 30_interpolate_growth_model.R,
│ 40_mosaic_creation.R, 80_calculate_kpis.R, 90_CI_report_with_kpis_simple.Rmd
└─ Purpose: Initializes project config (paths, field boundaries, harvest data)
harvest_date_pred_utils.py
├─ Used by: 22_harvest_baseline_prediction.py, 23_convert_harvest_format.py, 31_harvest_imminent_weekly.py
└─ Purpose: LSTM model loading, feature extraction, two-step harvest detection
20_ci_extraction_utils.R
├─ Used by: 20_ci_extraction.R
└─ Purpose: CI calculation, field masking, RDS I/O, tile detection
30_growth_model_utils.R
├─ Used by: 30_interpolate_growth_model.R
└─ Purpose: Linear interpolation, daily metrics, seasonal grouping
40_mosaic_creation_utils.R, 40_mosaic_creation_tile_utils.R
├─ Used by: 40_mosaic_creation.R
└─ Purpose: Weekly composite creation, cloud assessment, raster masking
kpi_utils.R
├─ Used by: 80_calculate_kpis.R
└─ Purpose: Per-field statistics, phase assignment, trigger detection
report_utils.R
├─ Used by: 90_CI_report_with_kpis_simple.Rmd
└─ Purpose: Report building, table formatting, Word document generation
```
---
## Command-Line Execution Examples
### Daily/Weekly Workflow
```bash
# Stage 00: Download today's satellite data
cd python_app
python 00_download_8band_pu_optimized.py angata --cleanup
# Stage 20: Extract CI from daily imagery (last 7 days)
cd ../r_app
Rscript 20_ci_extraction_per_field.R 2026-01-14 7 angata merged_tif
# Stage 21: Convert CI to CSV for harvest detection
Rscript 21_convert_ci_rds_to_csv.R angata
# Stage 31: Weekly harvest monitoring (real-time alerts)
cd ../python_app
python 31_harvest_imminent_weekly.py angata
# Back to R for mosaic and KPIs
cd ../r_app
Rscript 40_mosaic_creation.R 2026-01-14 7 angata
Rscript 80_calculate_kpis.R 2026-01-14 angata 7
# Stage 90: Generate report
Rscript -e "rmarkdown::render('90_CI_report_with_kpis_simple.Rmd')"
```
### One-Time Setup (Baseline Harvest Detection)
```bash
# Only run ONCE to establish baseline
cd python_app
python 22_harvest_baseline_prediction.py angata
# Convert to standard format
python 23_convert_harvest_format.py angata
```
---
## Processing Notes
### CI Extraction (Stage 20)
- Calculates CI = (NIR - Green) / (NIR + Green)
- Supports both 4-band and 8-band imagery with auto-detection
- Handles cloud masking via UDM band (8-band) or manual thresholding (4-band)
- Outputs cumulative RDS in wide format (fields × dates) for fast lookups
### Growth Model (Stage 30)
- Linear interpolation across missing dates
- Maintains seasonal context for agricultural lifecycle tracking
- Outputs long-format data for trend analysis
### Harvest Detection (Stages 22 & 31)
- **Model 307**: Unidirectional LSTM with dual output heads
- Imminent Head: Probability field will be harvestable in next 28 days
- Detected Head: Probability of immediate harvest event
- **Stage 22 (Baseline)**: Two-step detection on complete historical data
- Phase 1: Growing window expansion (real-time simulation)
- Phase 2: ±40 day refinement (argmax harvest signal)
- **Stage 31 (Weekly)**: Single-run inference on recent data (~300 days)
- Compares against baseline for anomaly detection
### KPI Calculation (Stage 80)
- **Per-field metrics**: Uniformity (CV), phase, growth trends, 4-week trends
- **Status triggers**: Germination, rapid growth, slow growth, non-uniform, weed pressure, harvest imminence
- **Farm-level KPIs**: 6 high-level indicators for executive summary
- **Parallel processing**: ~1000+ fields processed in <5 minutes
---
## Future Enhancements
- **Real-Time Monitoring**: Daily harvest probability updates integrated into web dashboard
- **SAR Integration**: Radar satellite data (Sentinel-1) for all-weather monitoring
- **IoT Sensors**: Ground-based soil moisture and weather integration
- **Advanced Yield Models**: Enhanced harvest forecasting with satellite + ground truth
- **Automated Alerts**: WhatsApp/email dispatch of critical agricultural advice