updated sc-91

2026-01-29 17:26:03 +01:00 · 2026-01-29 17:26:03 +01:00 · d1f352f21c
parent 4445f72e6f
commit d1f352f21c
5 changed files with 1523 additions and 541 deletions
--- a/CODE_REVIEW_FINDINGS.md
+++ b/CODE_REVIEW_FINDINGS.md
@ -0,0 +1,751 @@
+# SmartCane Pipeline Code Review
+## Efficiency, Cleanup, and Architecture Analysis
+
+**Date**: January 29, 2026  
+**Scope**: `run_full_pipeline.R` + all called scripts (10, 20, 21, 30, 31, 40, 80, 90, 91) + utility files  
+**Status**: Comprehensive review completed
+
+---
+
+## EXECUTIVE SUMMARY
+
+Your pipeline is **well-structured and intentional**, but has accumulated significant technical debt through development iterations. The main issues are:
+
+1. **🔴 HIGH IMPACT**: **3 separate mosaic mode detection functions** doing identical work
+2. **🔴 HIGH IMPACT**: **Week/year calculations duplicated 10+ times** across 6+ files
+3. **🟡 MEDIUM IMPACT**: **40+ debug statements** cluttering output
+4. **🟡 MEDIUM IMPACT**: **File existence checks repeated** in multiple places (especially KPI checks)
+5. **🟢 LOW IMPACT**: Minor redundancy in command construction, but manageable
+
+**Estimated cleanup effort**: 2-3 hours for core refactoring; significant code quality gains.
+
+**Workflow clarity issue**: The split between `merged_tif` vs `merged_tif_8b` and `weekly_mosaic` vs `weekly_tile_max` is **not clearly documented**. This should be clarified.
+
+---
+
+## 1. DUPLICATED FUNCTIONS & LOGIC
+
+### 1.1 Mosaic Mode Detection (CRITICAL REDUNDANCY)
+
+**Problem**: Three identical implementations of `detect_mosaic_mode()`:
+
+| Location | Function Name | Lines | Issue |
+|----------|---------------|-------|-------|
+| `run_full_pipeline.R` | `detect_mosaic_mode_early()` | ~20 lines | Detects tiled vs single-file |
+| `run_full_pipeline.R` | `detect_mosaic_mode_simple()` | ~20 lines | Detects tiled vs single-file (duplicate) |
+| `parameters_project.R` | `detect_mosaic_mode()` | ~30 lines | Detects tiled vs single-file (different signature) |
+
+**Impact**: If you change the detection logic, you must update 3 places. Bug risk is high.
+
+**Solution**: Create **single canonical function in `parameters_project.R`**:
+```r
+# SINGLE SOURCE OF TRUTH
+detect_mosaic_mode <- function(project_dir) {
+  weekly_tile_max <- file.path("laravel_app", "storage", "app", project_dir, "weekly_tile_max")
+  if (dir.exists(weekly_tile_max)) {
+    subfolders <- list.dirs(weekly_tile_max, full.names = FALSE, recursive = FALSE)
+    if (length(grep("^\\d+x\\d+$", subfolders)) > 0) return("tiled")
+  }
+  
+  weekly_mosaic <- file.path("laravel_app", "storage", "app", project_dir, "weekly_mosaic")
+  if (dir.exists(weekly_mosaic) && 
+      length(list.files(weekly_mosaic, pattern = "^week_.*\\.tif$")) > 0) {
+    return("single-file")
+  }
+  
+  return("unknown")
+}
+```
+
+Then replace all three calls in `run_full_pipeline.R` with this single function.
+
+---
+
+### 1.2 Week/Year Calculations (CRITICAL REDUNDANCY)
+
+**Problem**: The pattern `week_num <- as.numeric(format(..., "%V"))` + `year_num <- as.numeric(format(..., "%G"))` appears **13+ times** across multiple files.
+
+**Locations**:
+- `run_full_pipeline.R`: Lines 82, 126-127, 229-230, 630, 793-794 (5 times)
+- `80_calculate_kpis.R`: Lines 323-324 (1 time)
+- `80_weekly_stats_utils.R`: Lines 829-830 (1 time)
+- `kpi_utils.R`: Line 45 (1 time)
+- `80_kpi_utils.R`: Lines 177-178 (1 time)
+- Plus inline in sprintf statements: ~10+ additional times
+
+**Impact**: 
+- High maintenance burden
+- Risk of inconsistency (%V vs %Y confusion noted at line 82 in `run_full_pipeline.R`)
+- Code verbosity
+
+**Solution**: Create **utility function in `parameters_project.R`**:
+```r
+get_iso_week_year <- function(date) {
+  list(
+    week = as.numeric(format(date, "%V")),
+    year = as.numeric(format(date, "%G"))  # ISO year, not calendar year
+  )
+}
+
+# Usage:
+wwy <- get_iso_week_year(end_date)
+cat(sprintf("Week %02d/%d\n", wwy$week, wwy$year))
+```
+
+**Also add convenience function**:
+```r
+format_week_year <- function(date, separator = "_") {
+  wwy <- get_iso_week_year(date)
+  sprintf("week_%02d%s%d", wwy$week, separator, wwy$year)
+}
+
+# Usage: format_week_year(end_date)  # "week_02_2026"
+```
+
+---
+
+### 1.3 File Path Construction (MEDIUM REDUNDANCY)
+
+**Problem**: Repeated patterns like:
+```r
+file.path("laravel_app", "storage", "app", project_dir, "weekly_mosaic")
+file.path("laravel_app", "storage", "app", project_dir, "reports", "kpis", kpi_subdir)
+```
+
+**Solution**: Centralize in `parameters_project.R`:
+```r
+# Project-agnostic path builders
+get_project_storage_path <- function(project_dir, subdir = NULL) {
+  base <- file.path("laravel_app", "storage", "app", project_dir)
+  if (!is.null(subdir)) file.path(base, subdir) else base
+}
+
+get_mosaic_dir <- function(project_dir, mosaic_mode = "auto") {
+  if (mosaic_mode == "auto") mosaic_mode <- detect_mosaic_mode(project_dir)
+  if (mosaic_mode == "tiled") {
+    get_project_storage_path(project_dir, "weekly_tile_max/5x5")
+  } else {
+    get_project_storage_path(project_dir, "weekly_mosaic")
+  }
+}
+
+get_kpi_dir <- function(project_dir, client_type) {
+  subdir <- if (client_type == "agronomic_support") "field_level" else "field_analysis"
+  get_project_storage_path(project_dir, file.path("reports", "kpis", subdir))
+}
+```
+
+---
+
+## 2. DEBUG STATEMENTS & LOGGING CLUTTER
+
+### 2.1 Excessive Debug Output
+
+The pipeline prints **40+ debug statements** that pollute the terminal output. Examples:
+
+**In `run_full_pipeline.R`**:
+```r
+Line 82:   cat(sprintf("       Running week: %02d / %d\n", ...))  # Note: %d (calendar year) should be %G
+Line 218:  cat(sprintf("[KPI_DIR_CREATED] Created directory: %s\n", ...))
+Line 223:  cat(sprintf("[KPI_DIR_EXISTS] %s\n", ...))
+Line 224:  cat(sprintf("[KPI_DEBUG] Total files in directory: %d\n", ...))
+Line 225:  cat(sprintf("[KPI_DEBUG] Sample files: %s\n", ...))
+Line 240:  cat(sprintf("[KPI_DEBUG_W%02d_%d] Pattern: '%s' | Found: %d files\n", ...))
+Line 630:  cat("DEBUG: Running command:", cmd, "\n")
+Line 630 in Script 31 execution - prints full conda command
+```
+
+**In `80_calculate_kpis.R`**:
+```
+Line 323:  message(paste("Calculating statistics for all fields - Week", week_num, year))
+Line 417:  # Plus many more ...
+```
+
+**Impact**: 
+- Makes output hard to scan for real issues
+- Test developers skip important messages
+- Production logs become noise
+
+**Solution**: Replace with **structured logging** (3 levels):
+
+```r
+# Add to parameters_project.R
+smartcane_log <- function(message, level = "INFO") {
+  timestamp <- format(Sys.time(), "%Y-%m-%d %H:%M:%S")
+  prefix <- sprintf("[%s] %s", level, timestamp)
+  cat(sprintf("%s | %s\n", prefix, message))
+}
+
+smartcane_debug <- function(message) {
+  if (Sys.getenv("SMARTCANE_DEBUG") == "TRUE") {
+    smartcane_log(message, level = "DEBUG")
+  }
+}
+
+smartcane_warn <- function(message) {
+  smartcane_log(message, level = "WARN")
+}
+```
+
+**Usage**:
+```r
+# Keep important messages
+smartcane_log(sprintf("Downloaded %d dates, %d failed", download_count, download_failed))
+
+# Hide debug clutter (only show if DEBUG=TRUE)
+smartcane_debug(sprintf("KPI directory exists: %s", kpi_dir))
+
+# Warnings stay visible
+smartcane_warn("Some downloads failed, but continuing pipeline")
+```
+
+---
+
+### 2.2 Redundant Status Checks in KPI Section
+
+**Lines 218-270 in `run_full_pipeline.R`**: The KPI requirement check has **deeply nested debug statements**.
+
+```r
+if (dir.exists(kpi_dir)) {
+  cat(sprintf("[KPI_DIR_EXISTS] %s\n", kpi_dir))
+  all_kpi_files <- list.files(kpi_dir)
+  cat(sprintf("[KPI_DEBUG] Total files in directory: %d\n", length(all_kpi_files)))
+  if (length(all_kpi_files) > 0) {
+    cat(sprintf("[KPI_DEBUG] Sample files: %s\n", ...))
+  }
+} else {
+  cat(sprintf("[KPI_DIR_MISSING] Directory does not exist: %s\n", kpi_dir))
+}
+```
+
+**Solution**: Simplify to:
+```r
+if (!dir.exists(kpi_dir)) {
+  dir.create(kpi_dir, recursive = TRUE, showWarnings = FALSE)
+}
+
+all_kpi_files <- list.files(kpi_dir)
+smartcane_debug(sprintf("KPI directory: %d files found", length(all_kpi_files)))
+```
+
+---
+
+## 3. DOUBLE CALCULATIONS & INEFFICIENCIES
+
+### 3.1 KPI Existence Check (Calculated Twice)
+
+**Problem**: KPI existence is checked **twice** in `run_full_pipeline.R`:
+
+1. **First check (Lines 228-270)**: Initial KPI requirement check that calculates `kpis_needed` dataframe
+2. **Second check (Lines 786-810)**: Verification after Script 80 runs (almost identical logic)
+
+Both loops do:
+```r
+for (weeks_back in 0:(reporting_weeks_needed - 1)) {
+  check_date <- end_date - (weeks_back * 7)
+  week_num <- as.numeric(format(check_date, "%V"))
+  year_num <- as.numeric(format(check_date, "%G"))
+  
+  week_pattern <- sprintf("week%02d_%d", week_num, year_num)
+  kpi_files_this_week <- list.files(kpi_dir, pattern = week_pattern)
+  
+  has_kpis <- length(kpi_files_this_week) > 0
+  # ... same logic again
+}
+```
+
+**Impact**: Slower pipeline execution, code duplication
+
+**Solution**: Create **reusable function in utility file**:
+```r
+check_kpi_completeness <- function(project_dir, client_type, end_date, reporting_weeks_needed) {
+  kpi_dir <- get_kpi_dir(project_dir, client_type)
+  
+  kpis_needed <- data.frame()
+  for (weeks_back in 0:(reporting_weeks_needed - 1)) {
+    check_date <- end_date - (weeks_back * 7)
+    wwy <- get_iso_week_year(check_date)
+    
+    week_pattern <- sprintf("week%02d_%d", wwy$week, wwy$year)
+    has_kpis <- any(grepl(week_pattern, list.files(kpi_dir)))
+    
+    kpis_needed <- rbind(kpis_needed, data.frame(
+      week = wwy$week,
+      year = wwy$year,
+      date = check_date,
+      has_kpis = has_kpis
+    ))
+  }
+  
+  return(list(
+    kpis_df = kpis_needed,
+    missing_count = sum(!kpis_needed$has_kpis),
+    all_complete = all(kpis_needed$has_kpis)
+  ))
+}
+
+# Then in run_full_pipeline.R:
+initial_kpi_check <- check_kpi_completeness(project_dir, client_type, end_date, reporting_weeks_needed)
+
+# ... after Script 80 runs:
+final_kpi_check <- check_kpi_completeness(project_dir, client_type, end_date, reporting_weeks_needed)
+if (final_kpi_check$all_complete) {
+  smartcane_log("✓ All KPIs available")
+}
+```
+
+---
+
+### 3.2 Mosaic Mode Detection (Called 3+ Times per Run)
+
+**Current code**:
+- Line 99-117: `detect_mosaic_mode_early()` called once
+- Line 301-324: `detect_mosaic_mode_simple()` called again
+- Result: **Same detection logic runs twice unnecessarily**
+
+**Solution**: Call once, store result:
+```r
+mosaic_mode <- detect_mosaic_mode(project_dir)  # Once at top
+
+# Then reuse throughout:
+if (mosaic_mode == "tiled") { ... }
+else if (mosaic_mode == "single-file") { ... }
+```
+
+---
+
+### 3.3 Missing Weeks Calculation Inefficiency
+
+**Lines 126-170**: The loop builds `weeks_needed` dataframe, then **immediately** iterates again to find which ones are missing.
+
+**Current code**:
+```r
+# First: build all weeks
+weeks_needed <- data.frame()
+for (weeks_back in 0:(reporting_weeks_needed - 1)) {
+  # ... build weeks_needed
+}
+
+# Then: check which are missing (loop again)
+missing_weeks <- data.frame()
+for (i in 1:nrow(weeks_needed)) {
+  # ... check each week
+}
+```
+
+**Solution**: Combine into **single loop**:
+```r
+weeks_needed <- data.frame()
+missing_weeks <- data.frame()
+earliest_missing_date <- end_date
+
+for (weeks_back in 0:(reporting_weeks_needed - 1)) {
+  check_date <- end_date - (weeks_back * 7)
+  wwy <- get_iso_week_year(check_date)
+  
+  # Add to weeks_needed
+  weeks_needed <- rbind(weeks_needed, data.frame(
+    week = wwy$week, year = wwy$year, date = check_date
+  ))
+  
+  # Check if missing, add to missing_weeks if so
+  week_pattern <- sprintf("week_%02d_%d", wwy$week, wwy$year)
+  mosaic_dir <- get_mosaic_dir(project_dir, mosaic_mode)
+  
+  if (length(list.files(mosaic_dir, pattern = week_pattern)) == 0) {
+    missing_weeks <- rbind(missing_weeks, data.frame(
+      week = wwy$week, year = wwy$year, week_end_date = check_date
+    ))
+    if (check_date - 6 < earliest_missing_date) {
+      earliest_missing_date <- check_date - 6
+    }
+  }
+}
+```
+
+---
+
+### 3.4 Data Source Detection Logic
+
+**Lines 58-84**: The `data_source_used` detection is overly complex:
+
+```r
+data_source_used <- "merged_tif_8b"  # Default
+if (dir.exists(merged_tif_path)) {
+  tif_files <- list.files(merged_tif_path, pattern = "\\.tif$")
+  if (length(tif_files) > 0) {
+    data_source_used <- "merged_tif"
+    # ...
+  } else if (dir.exists(merged_tif_8b_path)) {
+    tif_files_8b <- list.files(merged_tif_8b_path, pattern = "\\.tif$")
+    # ...
+  }
+} else if (dir.exists(merged_tif_8b_path)) {
+  # ...
+}
+```
+
+**Issues**:
+- Multiple nested conditions doing the same check
+- `tif_files` and `tif_files_8b` are listed but only counts checked (not used later)
+- Logic could be cleaner
+
+**Solution**: Create utility function:
+```r
+detect_data_source <- function(project_dir, preferred = "auto") {
+  storage_dir <- get_project_storage_path(project_dir)
+  
+  for (source in c("merged_tif", "merged_tif_8b")) {
+    source_dir <- file.path(storage_dir, source)
+    if (dir.exists(source_dir)) {
+      tifs <- list.files(source_dir, pattern = "\\.tif$")
+      if (length(tifs) > 0) return(source)
+    }
+  }
+  
+  smartcane_warn("No data source found - defaulting to merged_tif_8b")
+  return("merged_tif_8b")
+}
+```
+
+---
+
+## 4. WORKFLOW CLARITY ISSUES
+
+### 4.1 TIFF Data Format Confusion
+
+**Problem**: Why are there TWO different TIFF folders?
+
+- `merged_tif`: 4-band data (RGB + NIR)
+- `merged_tif_8b`: 8-band data (appears to include UDM cloud masking from Planet)
+
+**Currently in code**:
+```r
+data_source <- if (project_dir == "angata") "merged_tif_8b" else "merged_tif"
+```
+
+**Issues**:
+- Hard-coded per project, not based on what's actually available
+- Not documented **why** angata uses 8-band
+- Unclear what the 8-band data adds (cloud masking? extra bands?)
+- Scripts handle both, but it's not clear when to use which
+
+**Recommendation**:
+1. **Document in `parameters_project.R`** what each data source contains:
+```r
+DATA_SOURCE_FORMATS <- list(
+  "merged_tif" = list(
+    bands = 4,
+    description = "4-band PlanetScope: Red, Green, Blue, NIR",
+    projects = c("aura", "chemba", "xinavane"),
+    note = "Standard format from Planet API"
+  ),
+  "merged_tif_8b" = list(
+    bands = 8,
+    description = "8-band PlanetScope with UDM: RGB+NIR + 4-band cloud mask",
+    projects = c("angata"),
+    note = "Enhanced with cloud confidence from UDM2 (Unusable Data Mask)"
+  )
+)
+```
+
+2. **Update hard-coded assignment** to be data-driven:
+```r
+# OLD: data_source <- if (project_dir == "angata") "merged_tif_8b" else "merged_tif"
+# NEW: detect what's actually available
+data_source <- detect_data_source(project_dir)
+```
+
+---
+
+### 4.2 Mosaic Storage Format Confusion
+
+**Problem**: Why are there TWO different mosaic storage styles?
+
+- `weekly_mosaic/`: Single TIF file per week (monolithic)
+- `weekly_tile_max/5x5/`: Tiled TIFFs per week (25+ files per week)
+
+**Currently in code**:
+- Detected automatically via `detect_mosaic_mode()`
+- But **no documentation** on when/why each is used
+
+**Recommendation**:
+1. **Document the trade-offs in `parameters_project.R`**:
+```r
+MOSAIC_MODES <- list(
+  "single-file" = list(
+    description = "One TIF per week",
+    storage_path = "weekly_mosaic/",
+    files_per_week = 1,
+    pros = c("Simpler file management", "Easier to load full mosaic"),
+    cons = c("Slower for field-specific analysis", "Large file I/O"),
+    suitable_for = c("agronomic_support", "dashboard visualization")
+  ),
+  "tiled" = list(
+    description = "5×5 grid of tiles per week",
+    storage_path = "weekly_tile_max/5x5/",
+    files_per_week = 25,
+    pros = c("Parallel field processing", "Faster per-field queries", "Scalable to 1000+ fields"),
+    cons = c("More file management", "Requires tile_grid metadata"),
+    suitable_for = c("cane_supply", "large-scale operations")
+  )
+)
+```
+
+2. **Document why angata uses tiled, aura uses single-file**:
+   - Is it a function of field count? (Angata = cane_supply, large fields → tiled)
+   - Is it historical? (Legacy decision?)
+   - Should new projects choose based on client type?
+
+---
+
+### 4.3 Client Type Mapping Clarity
+
+**Current structure** in `parameters_project.R`:
+
+```r
+CLIENT_TYPE_MAP <- list(
+  "angata" = "cane_supply",
+  "aura" = "agronomic_support",
+  "chemba" = "cane_supply",
+  "xinavane" = "cane_supply",
+  "esa" = "cane_supply"
+)
+```
+
+**Issues**:
+- Not clear **why** aura is agronomic_support while angata/chemba are cane_supply
+- No documentation of what each client type needs
+- Scripts branch heavily on `skip_cane_supply_only` logic
+
+**Recommendation**: 
+Add metadata to explain the distinction:
+
+```r
+CLIENT_TYPES <- list(
+  "cane_supply" = list(
+    description = "Sugar mill supply chain optimization",
+    requires_harvest_prediction = TRUE,  # Script 31
+    requires_phase_assignment = TRUE,     # Based on planting date
+    per_field_detail = TRUE,              # Script 91 Excel report
+    data_sources = c("merged_tif", "merged_tif_8b"),
+    mosaic_mode = "tiled",
+    projects = c("angata", "chemba", "xinavane", "esa")
+  ),
+  "agronomic_support" = list(
+    description = "Farm-level decision support for agronomists",
+    requires_harvest_prediction = FALSE,
+    requires_phase_assignment = FALSE,
+    per_field_detail = FALSE,
+    farm_level_kpis = TRUE,               # Script 90 Word report
+    data_sources = c("merged_tif"),
+    mosaic_mode = "single-file",
+    projects = c("aura")
+  )
+)
+```
+
+---
+
+## 5. COMMAND CONSTRUCTION REDUNDANCY
+
+### 5.1 Rscript Path Repetition
+
+**Problem**: The Rscript path is repeated 5 times:
+
+```r
+Line 519:  '"C:\\Program Files\\R\\R-4.4.3\\bin\\x64\\Rscript.exe"'
+Line 676:  '"C:\\Program Files\\R\\R-4.4.3\\bin\\x64\\Rscript.exe"'
+Line 685:  '"C:\\Program Files\\R\\R-4.4.3\\bin\\x64\\Rscript.exe"'
+```
+
+**Solution**: Define once in `parameters_project.R`:
+```r
+RSCRIPT_PATH <- "C:\\Program Files\\R\\R-4.4.3\\bin\\x64\\Rscript.exe"
+
+# Usage:
+cmd <- sprintf('"%s" --vanilla r_app/20_ci_extraction.R ...', RSCRIPT_PATH)
+```
+
+---
+
+## 6. SPECIFIC LINE-BY-LINE ISSUES
+
+### 6.1 Line 82 Bug: Wrong Format Code
+
+```r
+cat(sprintf("       Running week: %02d / %d\n", 
+            as.numeric(format(end_date, "%V")), 
+            as.numeric(format(end_date, "%Y"))))  # ❌ Should be %G, not %Y
+```
+
+**Issue**: Uses calendar year `%Y` instead of ISO week year `%G`. On dates like 2025-12-30 (week 1 of 2026), this will print "Week 01 / 2025" (confusing).
+
+**Fix**:
+```r
+wwy <- get_iso_week_year(end_date)
+cat(sprintf("       Running week: %02d / %d\n", wwy$week, wwy$year))
+```
+
+---
+
+### 6.2 Line 630 Debug Statement
+
+```r
+cmd <- sprintf('conda run -n pytorch_gpu python python_app/31_harvest_imminent_weekly.py %s', project_dir)
+cat("DEBUG: Running command:", cmd, "\n")  # ❌ Prints full conda command
+```
+
+**Solution**: Use `smartcane_debug()` function:
+```r
+cmd <- sprintf('conda run -n pytorch_gpu python python_app/31_harvest_imminent_weekly.py %s', project_dir)
+smartcane_debug(sprintf("Running Python 31: %s", cmd))
+```
+
+---
+
+### 6.3 Lines 719-723: Verbose Script 31 Verification
+
+```r
+# Check for THIS WEEK's specific file
+current_week <- as.numeric(format(end_date, "%V"))
+current_year <- as.numeric(format(end_date, "%Y"))
+expected_file <- file.path(...)
+```
+
+**Issue**: Calculates week twice (already done earlier). Also uses `%Y` (should be `%G`).
+
+**Solution**: Reuse earlier `wwy` calculation or create helper.
+
+---
+
+## 7. REFACTORING ROADMAP
+
+### Phase 1: Foundation (1 hour)
+- [ ] Consolidate `detect_mosaic_mode()` into single function in `parameters_project.R`
+- [ ] Create `get_iso_week_year()` and `format_week_year()` utilities
+- [ ] Create `get_project_storage_path()`, `get_mosaic_dir()`, `get_kpi_dir()` helpers
+- [ ] Add logging functions (`smartcane_log()`, `smartcane_debug()`, `smartcane_warn()`)
+
+### Phase 2: Deduplication (1 hour)
+- [ ] Replace all 13+ week_num/year_num calculations with `get_iso_week_year()`
+- [ ] Replace all 3 `detect_mosaic_mode_*()` calls with single function
+- [ ] Combine duplicate KPI checks into `check_kpi_completeness()` function
+- [ ] Fix line 82 and 630 format bugs
+
+### Phase 3: Cleanup (1 hour)
+- [ ] Remove all debug statements (40+), replace with `smartcane_debug()`
+- [ ] Simplify nested conditions in data_source detection
+- [ ] Combine missing weeks detection into single loop
+- [ ] Extract Rscript path to constant
+
+### Phase 4: Documentation (30 min)
+- [ ] Add comments explaining `merged_tif` vs `merged_tif_8b` trade-offs
+- [ ] Document `single-file` vs `tiled` mosaic modes and when to use each
+- [ ] Clarify client type mapping in `CLIENT_TYPE_MAP`
+- [ ] Add inline comments for non-obvious logic
+
+---
+
+## 8. ARCHITECTURE & WORKFLOW RECOMMENDATIONS
+
+### 8.1 Clear Data Flow Diagram
+
+Add to `r_app/system_architecture/system_architecture.md`:
+
+```
+INPUT SOURCES:
+  ├── Planet API 4-band or 8-band imagery
+  ├── Field boundaries (pivot.geojson)
+  └── Harvest data (harvest.xlsx, optional for cane_supply)
+
+STORAGE TIERS:
+  ├── Tier 1: Raw data (merged_tif/ or merged_tif_8b/)
+  ├── Tier 2: Daily tiles (daily_tiles_split/{grid_size}/{dates}/)
+  ├── Tier 3: Extracted CI (Data/extracted_ci/daily_vals/*.rds)
+  ├── Tier 4: Weekly mosaics (weekly_mosaic/ OR weekly_tile_max/5x5/)
+  └── Tier 5: KPI outputs (reports/kpis/{field_level|field_analysis}/)
+
+DECISION POINTS:
+  └─ Client type (cane_supply vs agronomic_support)
+     ├─ Drives script selection (Scripts 21, 22, 23, 31, 90/91)
+     ├─ Drives data source (merged_tif_8b for cane_supply, merged_tif for agronomic)
+     ├─ Drives mosaic mode (tiled for cane_supply, single-file for agronomic)
+     └─ Drives KPI subdirectory (field_analysis vs field_level)
+```
+
+### 8.2 .sh Scripts Alignment
+
+You mention `.sh` scripts in the online environment. If they're **not calling the R pipeline**, there's a **split responsibility** issue:
+
+**Question**: Are the `.sh` scripts:
+- (A) Independent duplicates of the R pipeline logic? (BAD - maintenance nightmare)
+- (B) Wrappers calling the R pipeline? (GOOD - single source of truth)
+- (C) Different workflow for online vs local? (RED FLAG - they diverge)
+
+**Recommendation**: If using `.sh` for production, ensure they **call the same R scripts** (`run_full_pipeline.R`). Example:
+
+```bash
+#!/bin/bash
+# Wrapper that ensures R pipeline is called
+cd /path/to/smartcane
+& "C:\Program Files\R\R-4.4.3\bin\x64\Rscript.exe" r_app/run_full_pipeline.R
+```
+
+---
+
+## 9. SUMMARY TABLE: Issues by Severity
+
+| Issue | Type | Impact | Effort | Priority |
+|-------|------|--------|--------|----------|
+| 3 mosaic detection functions | Duplication | HIGH | 30 min | P0 |
+| 13+ week/year calculations | Duplication | HIGH | 1 hour | P0 |
+| 40+ debug statements | Clutter | MEDIUM | 1 hour | P1 |
+| KPI check run twice | Inefficiency | LOW | 30 min | P2 |
+| Line 82: %Y should be %G | Bug | LOW | 5 min | P2 |
+| Data source confusion | Documentation | MEDIUM | 30 min | P1 |
+| Mosaic mode confusion | Documentation | MEDIUM | 30 min | P1 |
+| Client type mapping | Documentation | MEDIUM | 30 min | P1 |
+| Data source detection complexity | Code style | LOW | 15 min | P3 |
+
+---
+
+## 10. RECOMMENDED NEXT STEPS
+
+1. **Review this report** with your team to align on priorities
+2. **Create Linear issues** for each phase of refactoring
+3. **Start with Phase 1** (foundation utilities) - builds confidence for Phase 2
+4. **Test thoroughly** after each phase - the pipeline is complex and easy to break
+5. **Update `.sh` scripts** if they duplicate R logic
+6. **Document data flow** in `system_architecture/system_architecture.md`
+
+---
+
+## Questions for Clarification
+
+Before implementing, please clarify:
+
+1. **Data source split**: Why does angata use `merged_tif_8b` (8-band with cloud mask) while aura uses `merged_tif` (4-band)? Is this:
+   - A function of client need (cane_supply requires cloud masking)?
+   - Historical (legacy decision for angata)?
+   - Should new projects choose based on availability?
+
+2. **Mosaic mode split**: Why tiled for angata but single-file for aura? Should this be:
+   - Hard-coded per project?
+   - Based on field count/client type?
+   - Auto-detected from first run?
+
+3. **Production vs local**: Are the `.sh` scripts in the online environment:
+   - Calling this same R pipeline?
+   - Duplicating logic independently?
+   - A different workflow entirely?
+
+4. **Client type growth**: Are there other client types planned beyond `cane_supply` and `agronomic_support`? (e.g., extension_service?)
+
+---
+
+**Report prepared**: January 29, 2026  
+**Total code reviewed**: ~2,500 lines across 10 files  
+**Estimated refactoring time**: 3-4 hours  
+**Estimated maintenance savings**: 5-10 hours/month (fewer bugs, easier updates)
+
--- a/r_app/40_mosaic_creation.R
+++ b/r_app/40_mosaic_creation.R
@ -188,7 +188,7 @@ main <- function() {
  if (!exists("use_tile_mosaic")) {
    # Fallback detection if flag not set (shouldn't happen)
    merged_final_dir <- file.path(laravel_storage, "merged_final_tif")
-    tile_detection <- detect_mosaic_mode(merged_final_dir)
+    tile_detection <- detect_tile_structure_from_merged_final(merged_final_dir)
    use_tile_mosaic <- tile_detection$has_tiles
  }
  
--- a/r_app/40_mosaic_creation_utils.R
+++ b/r_app/40_mosaic_creation_utils.R
@ -3,12 +3,12 @@
 # Utility functions for creating weekly mosaics from daily satellite imagery.
 # These functions support cloud cover assessment, date handling, and mosaic creation.

-#' Detect whether a project uses tile-based or single-file mosaic approach
+#' Detect whether a project uses tile-based or single-file mosaic approach (utility version)
 #'
 #' @param merged_final_tif_dir Directory containing merged_final_tif files
 #' @return List with has_tiles (logical), detected_tiles (vector), total_files (count)
 #'
-detect_mosaic_mode <- function(merged_final_tif_dir) {
+detect_tile_structure_from_files <- function(merged_final_tif_dir) {
  # Check if directory exists
  if (!dir.exists(merged_final_tif_dir)) {
    return(list(has_tiles = FALSE, detected_tiles = character(), total_files = 0))
--- a/r_app/parameters_project.R
+++ b/r_app/parameters_project.R
@ -114,7 +114,7 @@ get_client_kpi_config <- function(client_type) {

 # 3. Smart detection for tile-based vs single-file mosaic approach
 # ----------------------------------------------------------------
-detect_mosaic_mode <- function(merged_final_tif_dir, daily_tiles_split_dir = NULL) {
+detect_tile_structure_from_merged_final <- function(merged_final_tif_dir, daily_tiles_split_dir = NULL) {
  # PRIORITY 1: Check for tiling_config.json metadata file from script 10
  # This is the most reliable source since script 10 explicitly records its decision
  
@ -223,7 +223,7 @@ setup_project_directories <- function(project_dir, data_source = "merged_tif_8b"
  merged_final_dir <- here(laravel_storage_dir, "merged_final_tif")
  daily_tiles_split_dir <- here(laravel_storage_dir, "daily_tiles_split")
  
-  tile_detection <- detect_mosaic_mode(
+  tile_detection <- detect_tile_structure_from_merged_final(
    merged_final_tif_dir = merged_final_dir,
    daily_tiles_split_dir = daily_tiles_split_dir
  )
@ -498,6 +498,279 @@ setup_logging <- function(log_dir) {
  ))
 }

+# 8. HELPER FUNCTIONS FOR COMMON CALCULATIONS
+# -----------------------------------------------
+# Centralized functions to reduce duplication across scripts
+
+# Get ISO week and year from a date
+get_iso_week <- function(date) {
+  as.numeric(format(date, "%V"))
+}
+
+get_iso_year <- function(date) {
+  as.numeric(format(date, "%G"))
+}
+
+# Get both ISO week and year as a list
+get_iso_week_year <- function(date) {
+  list(
+    week = as.numeric(format(date, "%V")),
+    year = as.numeric(format(date, "%G"))
+  )
+}
+
+# Format week/year into a readable label
+format_week_label <- function(date, separator = "_") {
+  wwy <- get_iso_week_year(date)
+  sprintf("week%02d%s%d", wwy$week, separator, wwy$year)
+}
+
+# Auto-detect mosaic mode (tiled vs single-file)
+# Returns: "tiled", "single-file", or "unknown"
+detect_mosaic_mode <- function(project_dir) {
+  # Check for tile-based approach: weekly_tile_max/{grid_size}/week_*.tif
+  weekly_tile_max <- file.path("laravel_app", "storage", "app", project_dir, "weekly_tile_max")
+  if (dir.exists(weekly_tile_max)) {
+    subfolders <- list.dirs(weekly_tile_max, full.names = FALSE, recursive = FALSE)
+    grid_patterns <- grep("^\\d+x\\d+$", subfolders, value = TRUE)
+    if (length(grid_patterns) > 0) {
+      return("tiled")
+    }
+  }
+  
+  # Check for single-file approach: weekly_mosaic/week_*.tif
+  weekly_mosaic <- file.path("laravel_app", "storage", "app", project_dir, "weekly_mosaic")
+  if (dir.exists(weekly_mosaic)) {
+    files <- list.files(weekly_mosaic, pattern = "^week_.*\\.tif$")
+    if (length(files) > 0) {
+      return("single-file")
+    }
+  }
+  
+  return("unknown")
+}
+
+# Auto-detect grid size from tile directory structure
+# Returns: e.g., "5x5", "10x10", or "unknown"
+detect_grid_size <- function(project_dir) {
+  weekly_tile_max <- file.path("laravel_app", "storage", "app", project_dir, "weekly_tile_max")
+  if (dir.exists(weekly_tile_max)) {
+    subfolders <- list.dirs(weekly_tile_max, full.names = FALSE, recursive = FALSE)
+    grid_patterns <- grep("^\\d+x\\d+$", subfolders, value = TRUE)
+    if (length(grid_patterns) > 0) {
+      return(grid_patterns[1])  # Return first match (usually only one)
+    }
+  }
+  return("unknown")
+}
+
+# Build storage paths consistently across all scripts
+get_project_storage_path <- function(project_dir, subdir = NULL) {
+  base <- file.path("laravel_app", "storage", "app", project_dir)
+  if (!is.null(subdir)) file.path(base, subdir) else base
+}
+
+get_mosaic_dir <- function(project_dir, mosaic_mode = "auto") {
+  if (mosaic_mode == "auto") {
+    mosaic_mode <- detect_mosaic_mode(project_dir)
+  }
+  
+  if (mosaic_mode == "tiled") {
+    grid_size <- detect_grid_size(project_dir)
+    if (grid_size != "unknown") {
+      get_project_storage_path(project_dir, file.path("weekly_tile_max", grid_size))
+    } else {
+      get_project_storage_path(project_dir, "weekly_tile_max/5x5")  # Fallback default
+    }
+  } else {
+    get_project_storage_path(project_dir, "weekly_mosaic")
+  }
+}
+
+get_kpi_dir <- function(project_dir, client_type) {
+  subdir <- if (client_type == "agronomic_support") "field_level" else "field_analysis"
+  get_project_storage_path(project_dir, file.path("reports", "kpis", subdir))
+}
+
+# Logging functions for clean output
+smartcane_log <- function(message, level = "INFO", verbose = TRUE) {
+  if (!verbose) return(invisible(NULL))
+  timestamp <- format(Sys.time(), "%Y-%m-%d %H:%M:%S")
+  prefix <- sprintf("[%s]", level)
+  cat(sprintf("%s %s\n", prefix, message))
+}
+
+smartcane_debug <- function(message, verbose = FALSE) {
+  if (!verbose && Sys.getenv("SMARTCANE_DEBUG") != "TRUE") {
+    return(invisible(NULL))
+  }
+  smartcane_log(message, level = "DEBUG", verbose = TRUE)
+}
+
+smartcane_warn <- function(message) {
+  smartcane_log(message, level = "WARN", verbose = TRUE)
+}
+
+# ============================================================================
+# PHASE 3 & 4: OPTIMIZATION & DOCUMENTATION
+# ============================================================================
+
+# System Constants
+# ----------------
+# Define once, use everywhere
+
+RSCRIPT_PATH <- "C:\\Program Files\\R\\R-4.4.3\\bin\\x64\\Rscript.exe"
+# Used in run_full_pipeline.R for calling R scripts via system()
+
+# Data Source Documentation
+# ---------------------------
+# Explains the two satellite data formats and when to use each
+#
+# SmartCane uses PlanetScope imagery from Planet Labs API in two formats:
+#
+# 1. merged_tif (4-band):
+#    - Standard format: Red, Green, Blue, Near-Infrared
+#    - Size: ~150-200 MB per date
+#    - Use case: Agronomic support, general crop health monitoring
+#    - Projects: aura, xinavane
+#    - Cloud handling: Basic cloud masking from Planet metadata
+#
+# 2. merged_tif_8b (8-band with cloud confidence):
+#    - Enhanced format: 4-band imagery + 4-band UDM2 cloud mask
+#    - UDM2 bands: Clear, Snow, Shadow, Light Haze
+#    - Size: ~250-350 MB per date
+#    - Use case: Harvest prediction, supply chain optimization
+#    - Projects: angata, chemba, esa (cane_supply clients)
+#    - Cloud handling: Per-pixel cloud confidence from Planet UDM2
+#    - Why: Cane supply chains need precise confidence to predict harvest dates
+#           (don't want to predict based on cloudy data)
+#
+# The system auto-detects which is available via detect_data_source()
+
+# Mosaic Mode Documentation
+# --------------------------
+# SmartCane supports two ways to store and process weekly mosaics:
+#
+# 1. Single-file mosaic ("single-file"):
+#    - One GeoTIFF per week: weekly_mosaic/week_02_2026.tif
+#    - 5 bands per file: R, G, B, NIR, CI (Canopy Index)
+#    - Size: ~300-500 MB per week
+#    - Pros: Simpler file management, easier full-field visualization
+#    - Cons: Slower for field-specific queries, requires loading full raster
+#    - Best for: Agronomic support (aura) with <100 fields
+#    - Script 04 output: 5-band single-file mosaic
+#
+# 2. Tiled mosaic ("tiled"):
+#    - Grid of tiles per week: weekly_tile_max/5x5/week_02_2026_{TT}.tif
+#    - Example: 25 files (5×5 grid) × 5 bands = 125 individual tiffs
+#    - Size: ~15-20 MB per tile, organized in folders
+#    - Pros: Parallel processing, fast field lookups, scales to 1000+ fields
+#    - Cons: More file I/O, requires tile-to-field mapping metadata
+#    - Best for: Cane supply (angata, chemba) with 500+ fields
+#    - Script 04 output: Per-tile tiff files in weekly_tile_max/{grid}/
+#    - Tile assignment: Field boundaries mapped to grid coordinates
+#
+# The system auto-detects which is available via detect_mosaic_mode()
+
+# Client Type Documentation
+# --------------------------
+# SmartCane runs different analysis pipelines based on client_type:
+#
+# CLIENT_TYPE: cane_supply
+#   Purpose: Optimize sugar mill supply chain (harvest scheduling)
+#   Scripts run: 20 (CI), 21 (RDS to CSV), 30 (Growth), 31 (Harvest pred), 40 (Mosaic), 80 (KPI), 91 (Excel)
+#   Outputs:
+#     - Per-field analysis: field status, growth phase, harvest readiness
+#     - Excel reports (Script 91): Detailed metrics for logistics planning
+#     - KPI directory: reports/kpis/field_analysis/ (one RDS per week)
+#   Harvest data: Required (harvest.xlsx - planting dates for phase assignment)
+#   Data source: merged_tif_8b (uses cloud confidence for confidence)
+#   Mosaic mode: tiled (scales to 500+ fields)
+#   Projects: angata, chemba, xinavane, esa
+#
+# CLIENT_TYPE: agronomic_support
+#   Purpose: Provide weekly crop health insights to agronomists
+#   Scripts run: 80 (KPI), 90 (Word report)
+#   Outputs:
+#     - Farm-level KPI summaries (no per-field breakdown)
+#     - Word reports (Script 90): Charts and trends for agronomist decision support
+#     - KPI directory: reports/kpis/field_level/ (one RDS per week)
+#   Harvest data: Not used
+#   Data source: merged_tif (simpler, smaller)
+#   Mosaic mode: single-file (100-200 fields)
+#   Projects: aura
+#
+
+# Detect data source (merged_tif vs merged_tif_8b) based on availability
+# Returns the first available source; defaults to merged_tif_8b if neither exists
+detect_data_source <- function(project_dir) {
+  storage_dir <- get_project_storage_path(project_dir)
+  
+  # Preferred order: check merged_tif first, fall back to merged_tif_8b
+  for (source in c("merged_tif", "merged_tif_8b")) {
+    source_dir <- file.path(storage_dir, source)
+    if (dir.exists(source_dir)) {
+      tifs <- list.files(source_dir, pattern = "\\.tif$")
+      if (length(tifs) > 0) {
+        smartcane_log(sprintf("Detected data source: %s (%d TIF files)", source, length(tifs)))
+        return(source)
+      }
+    }
+  }
+  
+  smartcane_warn(sprintf("No data source found for %s - defaulting to merged_tif_8b", project_dir))
+  return("merged_tif_8b")
+}
+
+# Check KPI completeness for a reporting period
+# Returns: List with kpis_df (data.frame), missing_count, and all_complete (boolean)
+# This replaces duplicate KPI checking logic in run_full_pipeline.R (lines ~228-270, ~786-810)
+check_kpi_completeness <- function(project_dir, client_type, end_date, reporting_weeks_needed) {
+  kpi_dir <- get_kpi_dir(project_dir, client_type)
+  
+  kpis_needed <- data.frame()
+  
+  for (weeks_back in 0:(reporting_weeks_needed - 1)) {
+    check_date <- end_date - (weeks_back * 7)
+    wwy <- get_iso_week_year(check_date)
+    
+    # Build week pattern and check if it exists
+    week_pattern <- sprintf("week%02d_%d", wwy$week, wwy$year)
+    files_this_week <- list.files(kpi_dir, pattern = week_pattern)
+    has_kpis <- length(files_this_week) > 0
+    
+    # Track missing weeks
+    kpis_needed <- rbind(kpis_needed, data.frame(
+      week = wwy$week,
+      year = wwy$year,
+      date = check_date,
+      has_kpis = has_kpis,
+      pattern = week_pattern,
+      file_count = length(files_this_week)
+    ))
+    
+    # Debug logging
+    smartcane_debug(sprintf(
+      "Week %02d/%d (%s): %s (%d files)",
+      wwy$week, wwy$year, format(check_date, "%Y-%m-%d"),
+      if (has_kpis) "✓ FOUND" else "✗ MISSING",
+      length(files_this_week)
+    ))
+  }
+  
+  # Summary statistics
+  missing_count <- sum(!kpis_needed$has_kpis)
+  all_complete <- missing_count == 0
+  
+  return(list(
+    kpis_df = kpis_needed,
+    kpi_dir = kpi_dir,
+    missing_count = missing_count,
+    missing_weeks = kpis_needed[!kpis_needed$has_kpis, ],
+    all_complete = all_complete
+  ))
+}
+
 # 9. Initialize the project
 # ----------------------
 # Export project directories and settings
--- a/r_app/run_full_pipeline.R
+++ b/r_app/run_full_pipeline.R