updated sc-91

2026-01-29 17:26:03 +01:00 · 2026-01-29 17:26:03 +01:00 · d1f352f21c
parent 4445f72e6f
commit d1f352f21c
5 changed files with 1523 additions and 541 deletions
--- a/CODE_REVIEW_FINDINGS.md
+++ b/CODE_REVIEW_FINDINGS.md
@ -0,0 +1,751 @@
 # SmartCane Pipeline Code Review
 ## Efficiency, Cleanup, and Architecture Analysis
 **Date**: January 29, 2026  
 **Scope**: `run_full_pipeline.R` + all called scripts (10, 20, 21, 30, 31, 40, 80, 90, 91) + utility files  
 **Status**: Comprehensive review completed
 ---
 ## EXECUTIVE SUMMARY
 Your pipeline is **well-structured and intentional**, but has accumulated significant technical debt through development iterations. The main issues are:
 1. **🔴 HIGH IMPACT**: **3 separate mosaic mode detection functions** doing identical work
 2. **🔴 HIGH IMPACT**: **Week/year calculations duplicated 10+ times** across 6+ files
 3. **🟡 MEDIUM IMPACT**: **40+ debug statements** cluttering output
 4. **🟡 MEDIUM IMPACT**: **File existence checks repeated** in multiple places (especially KPI checks)
 5. **🟢 LOW IMPACT**: Minor redundancy in command construction, but manageable
 **Estimated cleanup effort**: 2-3 hours for core refactoring; significant code quality gains.
 **Workflow clarity issue**: The split between `merged_tif` vs `merged_tif_8b` and `weekly_mosaic` vs `weekly_tile_max` is **not clearly documented**. This should be clarified.
 ---
 ## 1. DUPLICATED FUNCTIONS & LOGIC
 ### 1.1 Mosaic Mode Detection (CRITICAL REDUNDANCY)
 **Problem**: Three identical implementations of `detect_mosaic_mode()`:
 | Location | Function Name | Lines | Issue |
 |----------|---------------|-------|-------|
 | `run_full_pipeline.R` | `detect_mosaic_mode_early()` | ~20 lines | Detects tiled vs single-file |
 | `run_full_pipeline.R` | `detect_mosaic_mode_simple()` | ~20 lines | Detects tiled vs single-file (duplicate) |
 | `parameters_project.R` | `detect_mosaic_mode()` | ~30 lines | Detects tiled vs single-file (different signature) |
 **Impact**: If you change the detection logic, you must update 3 places. Bug risk is high.
 **Solution**: Create **single canonical function in `parameters_project.R`**:
 ```r
 # SINGLE SOURCE OF TRUTH
 detect_mosaic_mode <- function(project_dir) {
  weekly_tile_max <- file.path("laravel_app", "storage", "app", project_dir, "weekly_tile_max")
  if (dir.exists(weekly_tile_max)) {
    subfolders <- list.dirs(weekly_tile_max, full.names = FALSE, recursive = FALSE)
    if (length(grep("^\\d+x\\d+$", subfolders)) > 0) return("tiled")
  }
  weekly_mosaic <- file.path("laravel_app", "storage", "app", project_dir, "weekly_mosaic")
  if (dir.exists(weekly_mosaic) && 
      length(list.files(weekly_mosaic, pattern = "^week_.*\\.tif$")) > 0) {
    return("single-file")
  }
  return("unknown")
 }
 ```
 Then replace all three calls in `run_full_pipeline.R` with this single function.
 ---
 ### 1.2 Week/Year Calculations (CRITICAL REDUNDANCY)
 **Problem**: The pattern `week_num <- as.numeric(format(..., "%V"))` + `year_num <- as.numeric(format(..., "%G"))` appears **13+ times** across multiple files.
 **Locations**:
 - `run_full_pipeline.R`: Lines 82, 126-127, 229-230, 630, 793-794 (5 times)
 - `80_calculate_kpis.R`: Lines 323-324 (1 time)
 - `80_weekly_stats_utils.R`: Lines 829-830 (1 time)
 - `kpi_utils.R`: Line 45 (1 time)
 - `80_kpi_utils.R`: Lines 177-178 (1 time)
 - Plus inline in sprintf statements: ~10+ additional times
 **Impact**: 
 - High maintenance burden
 - Risk of inconsistency (%V vs %Y confusion noted at line 82 in `run_full_pipeline.R`)
 - Code verbosity
 **Solution**: Create **utility function in `parameters_project.R`**:
 ```r
 get_iso_week_year <- function(date) {
  list(
    week = as.numeric(format(date, "%V")),
    year = as.numeric(format(date, "%G"))  # ISO year, not calendar year
  )
 }
 # Usage:
 wwy <- get_iso_week_year(end_date)
 cat(sprintf("Week %02d/%d\n", wwy$week, wwy$year))
 ```
 **Also add convenience function**:
 ```r
 format_week_year <- function(date, separator = "_") {
  wwy <- get_iso_week_year(date)
  sprintf("week_%02d%s%d", wwy$week, separator, wwy$year)
 }
 # Usage: format_week_year(end_date)  # "week_02_2026"
 ```
 ---
 ### 1.3 File Path Construction (MEDIUM REDUNDANCY)
 **Problem**: Repeated patterns like:
 ```r
 file.path("laravel_app", "storage", "app", project_dir, "weekly_mosaic")
 file.path("laravel_app", "storage", "app", project_dir, "reports", "kpis", kpi_subdir)
 ```
 **Solution**: Centralize in `parameters_project.R`:
 ```r
 # Project-agnostic path builders
 get_project_storage_path <- function(project_dir, subdir = NULL) {
  base <- file.path("laravel_app", "storage", "app", project_dir)
  if (!is.null(subdir)) file.path(base, subdir) else base
 }
 get_mosaic_dir <- function(project_dir, mosaic_mode = "auto") {
  if (mosaic_mode == "auto") mosaic_mode <- detect_mosaic_mode(project_dir)
  if (mosaic_mode == "tiled") {
    get_project_storage_path(project_dir, "weekly_tile_max/5x5")
  } else {
    get_project_storage_path(project_dir, "weekly_mosaic")
  }
 }
 get_kpi_dir <- function(project_dir, client_type) {
  subdir <- if (client_type == "agronomic_support") "field_level" else "field_analysis"
  get_project_storage_path(project_dir, file.path("reports", "kpis", subdir))
 }
 ```
 ---
 ## 2. DEBUG STATEMENTS & LOGGING CLUTTER
 ### 2.1 Excessive Debug Output
 The pipeline prints **40+ debug statements** that pollute the terminal output. Examples:
 **In `run_full_pipeline.R`**:
 ```r
 Line 82:   cat(sprintf("       Running week: %02d / %d\n", ...))  # Note: %d (calendar year) should be %G
 Line 218:  cat(sprintf("[KPI_DIR_CREATED] Created directory: %s\n", ...))
 Line 223:  cat(sprintf("[KPI_DIR_EXISTS] %s\n", ...))
 Line 224:  cat(sprintf("[KPI_DEBUG] Total files in directory: %d\n", ...))
 Line 225:  cat(sprintf("[KPI_DEBUG] Sample files: %s\n", ...))
 Line 240:  cat(sprintf("[KPI_DEBUG_W%02d_%d] Pattern: '%s' | Found: %d files\n", ...))
 Line 630:  cat("DEBUG: Running command:", cmd, "\n")
 Line 630 in Script 31 execution - prints full conda command
 ```
 **In `80_calculate_kpis.R`**:
 ```
 Line 323:  message(paste("Calculating statistics for all fields - Week", week_num, year))
 Line 417:  # Plus many more ...
 ```
 **Impact**: 
 - Makes output hard to scan for real issues
 - Test developers skip important messages
 - Production logs become noise
 **Solution**: Replace with **structured logging** (3 levels):
 ```r
 # Add to parameters_project.R
 smartcane_log <- function(message, level = "INFO") {
  timestamp <- format(Sys.time(), "%Y-%m-%d %H:%M:%S")
  prefix <- sprintf("[%s] %s", level, timestamp)
  cat(sprintf("%s | %s\n", prefix, message))
 }
 smartcane_debug <- function(message) {
  if (Sys.getenv("SMARTCANE_DEBUG") == "TRUE") {
    smartcane_log(message, level = "DEBUG")
  }
 }
 smartcane_warn <- function(message) {
  smartcane_log(message, level = "WARN")
 }
 ```
 **Usage**:
 ```r
 # Keep important messages
 smartcane_log(sprintf("Downloaded %d dates, %d failed", download_count, download_failed))
 # Hide debug clutter (only show if DEBUG=TRUE)
 smartcane_debug(sprintf("KPI directory exists: %s", kpi_dir))
 # Warnings stay visible
 smartcane_warn("Some downloads failed, but continuing pipeline")
 ```
 ---
 ### 2.2 Redundant Status Checks in KPI Section
 **Lines 218-270 in `run_full_pipeline.R`**: The KPI requirement check has **deeply nested debug statements**.
 ```r
 if (dir.exists(kpi_dir)) {
  cat(sprintf("[KPI_DIR_EXISTS] %s\n", kpi_dir))
  all_kpi_files <- list.files(kpi_dir)
  cat(sprintf("[KPI_DEBUG] Total files in directory: %d\n", length(all_kpi_files)))
  if (length(all_kpi_files) > 0) {
    cat(sprintf("[KPI_DEBUG] Sample files: %s\n", ...))
  }
 } else {
  cat(sprintf("[KPI_DIR_MISSING] Directory does not exist: %s\n", kpi_dir))
 }
 ```
 **Solution**: Simplify to:
 ```r
 if (!dir.exists(kpi_dir)) {
  dir.create(kpi_dir, recursive = TRUE, showWarnings = FALSE)
 }
 all_kpi_files <- list.files(kpi_dir)
 smartcane_debug(sprintf("KPI directory: %d files found", length(all_kpi_files)))
 ```
 ---
 ## 3. DOUBLE CALCULATIONS & INEFFICIENCIES
 ### 3.1 KPI Existence Check (Calculated Twice)
 **Problem**: KPI existence is checked **twice** in `run_full_pipeline.R`:
 1. **First check (Lines 228-270)**: Initial KPI requirement check that calculates `kpis_needed` dataframe
 2. **Second check (Lines 786-810)**: Verification after Script 80 runs (almost identical logic)
 Both loops do:
 ```r
 for (weeks_back in 0:(reporting_weeks_needed - 1)) {
  check_date <- end_date - (weeks_back * 7)
  week_num <- as.numeric(format(check_date, "%V"))
  year_num <- as.numeric(format(check_date, "%G"))
  week_pattern <- sprintf("week%02d_%d", week_num, year_num)
  kpi_files_this_week <- list.files(kpi_dir, pattern = week_pattern)
  has_kpis <- length(kpi_files_this_week) > 0
  # ... same logic again
 }
 ```
 **Impact**: Slower pipeline execution, code duplication
 **Solution**: Create **reusable function in utility file**:
 ```r
 check_kpi_completeness <- function(project_dir, client_type, end_date, reporting_weeks_needed) {
  kpi_dir <- get_kpi_dir(project_dir, client_type)
  kpis_needed <- data.frame()
  for (weeks_back in 0:(reporting_weeks_needed - 1)) {
    check_date <- end_date - (weeks_back * 7)
    wwy <- get_iso_week_year(check_date)
    week_pattern <- sprintf("week%02d_%d", wwy$week, wwy$year)
    has_kpis <- any(grepl(week_pattern, list.files(kpi_dir)))
    kpis_needed <- rbind(kpis_needed, data.frame(
      week = wwy$week,
      year = wwy$year,
      date = check_date,
      has_kpis = has_kpis
    ))
  }
  return(list(
    kpis_df = kpis_needed,
    missing_count = sum(!kpis_needed$has_kpis),
    all_complete = all(kpis_needed$has_kpis)
  ))
 }
 # Then in run_full_pipeline.R:
 initial_kpi_check <- check_kpi_completeness(project_dir, client_type, end_date, reporting_weeks_needed)
 # ... after Script 80 runs:
 final_kpi_check <- check_kpi_completeness(project_dir, client_type, end_date, reporting_weeks_needed)
 if (final_kpi_check$all_complete) {
  smartcane_log("✓ All KPIs available")
 }
 ```
 ---
 ### 3.2 Mosaic Mode Detection (Called 3+ Times per Run)
 **Current code**:
 - Line 99-117: `detect_mosaic_mode_early()` called once
 - Line 301-324: `detect_mosaic_mode_simple()` called again
 - Result: **Same detection logic runs twice unnecessarily**
 **Solution**: Call once, store result:
 ```r
 mosaic_mode <- detect_mosaic_mode(project_dir)  # Once at top
 # Then reuse throughout:
 if (mosaic_mode == "tiled") { ... }
 else if (mosaic_mode == "single-file") { ... }
 ```
 ---
 ### 3.3 Missing Weeks Calculation Inefficiency
 **Lines 126-170**: The loop builds `weeks_needed` dataframe, then **immediately** iterates again to find which ones are missing.
 **Current code**:
 ```r
 # First: build all weeks
 weeks_needed <- data.frame()
 for (weeks_back in 0:(reporting_weeks_needed - 1)) {
  # ... build weeks_needed
 }
 # Then: check which are missing (loop again)
 missing_weeks <- data.frame()
 for (i in 1:nrow(weeks_needed)) {
  # ... check each week
 }
 ```
 **Solution**: Combine into **single loop**:
 ```r
 weeks_needed <- data.frame()
 missing_weeks <- data.frame()
 earliest_missing_date <- end_date
 for (weeks_back in 0:(reporting_weeks_needed - 1)) {
  check_date <- end_date - (weeks_back * 7)
  wwy <- get_iso_week_year(check_date)
  # Add to weeks_needed
  weeks_needed <- rbind(weeks_needed, data.frame(
    week = wwy$week, year = wwy$year, date = check_date
  ))
  # Check if missing, add to missing_weeks if so
  week_pattern <- sprintf("week_%02d_%d", wwy$week, wwy$year)
  mosaic_dir <- get_mosaic_dir(project_dir, mosaic_mode)
  if (length(list.files(mosaic_dir, pattern = week_pattern)) == 0) {
    missing_weeks <- rbind(missing_weeks, data.frame(
      week = wwy$week, year = wwy$year, week_end_date = check_date
    ))
    if (check_date - 6 < earliest_missing_date) {
      earliest_missing_date <- check_date - 6
    }
  }
 }
 ```
 ---
 ### 3.4 Data Source Detection Logic
 **Lines 58-84**: The `data_source_used` detection is overly complex:
 ```r
 data_source_used <- "merged_tif_8b"  # Default
 if (dir.exists(merged_tif_path)) {
  tif_files <- list.files(merged_tif_path, pattern = "\\.tif$")
  if (length(tif_files) > 0) {
    data_source_used <- "merged_tif"
    # ...
  } else if (dir.exists(merged_tif_8b_path)) {
    tif_files_8b <- list.files(merged_tif_8b_path, pattern = "\\.tif$")
    # ...
  }
 } else if (dir.exists(merged_tif_8b_path)) {
  # ...
 }
 ```
 **Issues**:
 - Multiple nested conditions doing the same check
 - `tif_files` and `tif_files_8b` are listed but only counts checked (not used later)
 - Logic could be cleaner
 **Solution**: Create utility function:
 ```r
 detect_data_source <- function(project_dir, preferred = "auto") {
  storage_dir <- get_project_storage_path(project_dir)
  for (source in c("merged_tif", "merged_tif_8b")) {
    source_dir <- file.path(storage_dir, source)
    if (dir.exists(source_dir)) {
      tifs <- list.files(source_dir, pattern = "\\.tif$")
      if (length(tifs) > 0) return(source)
    }
  }
  smartcane_warn("No data source found - defaulting to merged_tif_8b")
  return("merged_tif_8b")
 }
 ```
 ---
 ## 4. WORKFLOW CLARITY ISSUES
 ### 4.1 TIFF Data Format Confusion
 **Problem**: Why are there TWO different TIFF folders?
 - `merged_tif`: 4-band data (RGB + NIR)
 - `merged_tif_8b`: 8-band data (appears to include UDM cloud masking from Planet)
 **Currently in code**:
 ```r
 data_source <- if (project_dir == "angata") "merged_tif_8b" else "merged_tif"
 ```
 **Issues**:
 - Hard-coded per project, not based on what's actually available
 - Not documented **why** angata uses 8-band
 - Unclear what the 8-band data adds (cloud masking? extra bands?)
 - Scripts handle both, but it's not clear when to use which
 **Recommendation**:
 1. **Document in `parameters_project.R`** what each data source contains:
 ```r
 DATA_SOURCE_FORMATS <- list(
  "merged_tif" = list(
    bands = 4,
    description = "4-band PlanetScope: Red, Green, Blue, NIR",
    projects = c("aura", "chemba", "xinavane"),
    note = "Standard format from Planet API"
  ),
  "merged_tif_8b" = list(
    bands = 8,
    description = "8-band PlanetScope with UDM: RGB+NIR + 4-band cloud mask",
    projects = c("angata"),
    note = "Enhanced with cloud confidence from UDM2 (Unusable Data Mask)"
  )
 )
 ```
 2. **Update hard-coded assignment** to be data-driven:
 ```r
 # OLD: data_source <- if (project_dir == "angata") "merged_tif_8b" else "merged_tif"
 # NEW: detect what's actually available
 data_source <- detect_data_source(project_dir)
 ```
 ---
 ### 4.2 Mosaic Storage Format Confusion
 **Problem**: Why are there TWO different mosaic storage styles?
 - `weekly_mosaic/`: Single TIF file per week (monolithic)
 - `weekly_tile_max/5x5/`: Tiled TIFFs per week (25+ files per week)
 **Currently in code**:
 - Detected automatically via `detect_mosaic_mode()`
 - But **no documentation** on when/why each is used
 **Recommendation**:
 1. **Document the trade-offs in `parameters_project.R`**:
 ```r
 MOSAIC_MODES <- list(
  "single-file" = list(
    description = "One TIF per week",
    storage_path = "weekly_mosaic/",
    files_per_week = 1,
    pros = c("Simpler file management", "Easier to load full mosaic"),
    cons = c("Slower for field-specific analysis", "Large file I/O"),
    suitable_for = c("agronomic_support", "dashboard visualization")
  ),
  "tiled" = list(
    description = "5×5 grid of tiles per week",
    storage_path = "weekly_tile_max/5x5/",
    files_per_week = 25,
    pros = c("Parallel field processing", "Faster per-field queries", "Scalable to 1000+ fields"),
    cons = c("More file management", "Requires tile_grid metadata"),
    suitable_for = c("cane_supply", "large-scale operations")
  )
 )
 ```
 2. **Document why angata uses tiled, aura uses single-file**:
   - Is it a function of field count? (Angata = cane_supply, large fields → tiled)
   - Is it historical? (Legacy decision?)
   - Should new projects choose based on client type?
 ---
 ### 4.3 Client Type Mapping Clarity
 **Current structure** in `parameters_project.R`:
 ```r
 CLIENT_TYPE_MAP <- list(
  "angata" = "cane_supply",
  "aura" = "agronomic_support",
  "chemba" = "cane_supply",
  "xinavane" = "cane_supply",
  "esa" = "cane_supply"
 )
 ```
 **Issues**:
 - Not clear **why** aura is agronomic_support while angata/chemba are cane_supply
 - No documentation of what each client type needs
 - Scripts branch heavily on `skip_cane_supply_only` logic
 **Recommendation**: 
 Add metadata to explain the distinction:
 ```r
 CLIENT_TYPES <- list(
  "cane_supply" = list(
    description = "Sugar mill supply chain optimization",
    requires_harvest_prediction = TRUE,  # Script 31
    requires_phase_assignment = TRUE,     # Based on planting date
    per_field_detail = TRUE,              # Script 91 Excel report
    data_sources = c("merged_tif", "merged_tif_8b"),
    mosaic_mode = "tiled",
    projects = c("angata", "chemba", "xinavane", "esa")
  ),
  "agronomic_support" = list(
    description = "Farm-level decision support for agronomists",
    requires_harvest_prediction = FALSE,
    requires_phase_assignment = FALSE,
    per_field_detail = FALSE,
    farm_level_kpis = TRUE,               # Script 90 Word report
    data_sources = c("merged_tif"),
    mosaic_mode = "single-file",
    projects = c("aura")
  )
 )
 ```
 ---
 ## 5. COMMAND CONSTRUCTION REDUNDANCY
 ### 5.1 Rscript Path Repetition
 **Problem**: The Rscript path is repeated 5 times:
 ```r
 Line 519:  '"C:\\Program Files\\R\\R-4.4.3\\bin\\x64\\Rscript.exe"'
 Line 676:  '"C:\\Program Files\\R\\R-4.4.3\\bin\\x64\\Rscript.exe"'
 Line 685:  '"C:\\Program Files\\R\\R-4.4.3\\bin\\x64\\Rscript.exe"'
 ```
 **Solution**: Define once in `parameters_project.R`:
 ```r
 RSCRIPT_PATH <- "C:\\Program Files\\R\\R-4.4.3\\bin\\x64\\Rscript.exe"
 # Usage:
 cmd <- sprintf('"%s" --vanilla r_app/20_ci_extraction.R ...', RSCRIPT_PATH)
 ```
 ---
 ## 6. SPECIFIC LINE-BY-LINE ISSUES
 ### 6.1 Line 82 Bug: Wrong Format Code
 ```r
 cat(sprintf("       Running week: %02d / %d\n", 
            as.numeric(format(end_date, "%V")), 
            as.numeric(format(end_date, "%Y"))))  # ❌ Should be %G, not %Y
 ```
 **Issue**: Uses calendar year `%Y` instead of ISO week year `%G`. On dates like 2025-12-30 (week 1 of 2026), this will print "Week 01 / 2025" (confusing).
 **Fix**:
 ```r
 wwy <- get_iso_week_year(end_date)
 cat(sprintf("       Running week: %02d / %d\n", wwy$week, wwy$year))
 ```
 ---
 ### 6.2 Line 630 Debug Statement
 ```r
 cmd <- sprintf('conda run -n pytorch_gpu python python_app/31_harvest_imminent_weekly.py %s', project_dir)
 cat("DEBUG: Running command:", cmd, "\n")  # ❌ Prints full conda command
 ```
 **Solution**: Use `smartcane_debug()` function:
 ```r
 cmd <- sprintf('conda run -n pytorch_gpu python python_app/31_harvest_imminent_weekly.py %s', project_dir)
 smartcane_debug(sprintf("Running Python 31: %s", cmd))
 ```
 ---
 ### 6.3 Lines 719-723: Verbose Script 31 Verification
 ```r
 # Check for THIS WEEK's specific file
 current_week <- as.numeric(format(end_date, "%V"))
 current_year <- as.numeric(format(end_date, "%Y"))
 expected_file <- file.path(...)
 ```
 **Issue**: Calculates week twice (already done earlier). Also uses `%Y` (should be `%G`).
 **Solution**: Reuse earlier `wwy` calculation or create helper.
 ---
 ## 7. REFACTORING ROADMAP
 ### Phase 1: Foundation (1 hour)
 - [ ] Consolidate `detect_mosaic_mode()` into single function in `parameters_project.R`
 - [ ] Create `get_iso_week_year()` and `format_week_year()` utilities
 - [ ] Create `get_project_storage_path()`, `get_mosaic_dir()`, `get_kpi_dir()` helpers
 - [ ] Add logging functions (`smartcane_log()`, `smartcane_debug()`, `smartcane_warn()`)
 ### Phase 2: Deduplication (1 hour)
 - [ ] Replace all 13+ week_num/year_num calculations with `get_iso_week_year()`
 - [ ] Replace all 3 `detect_mosaic_mode_*()` calls with single function
 - [ ] Combine duplicate KPI checks into `check_kpi_completeness()` function
 - [ ] Fix line 82 and 630 format bugs
 ### Phase 3: Cleanup (1 hour)
 - [ ] Remove all debug statements (40+), replace with `smartcane_debug()`
 - [ ] Simplify nested conditions in data_source detection
 - [ ] Combine missing weeks detection into single loop
 - [ ] Extract Rscript path to constant
 ### Phase 4: Documentation (30 min)
 - [ ] Add comments explaining `merged_tif` vs `merged_tif_8b` trade-offs
 - [ ] Document `single-file` vs `tiled` mosaic modes and when to use each
 - [ ] Clarify client type mapping in `CLIENT_TYPE_MAP`
 - [ ] Add inline comments for non-obvious logic
 ---
 ## 8. ARCHITECTURE & WORKFLOW RECOMMENDATIONS
 ### 8.1 Clear Data Flow Diagram
 Add to `r_app/system_architecture/system_architecture.md`:
 ```
 INPUT SOURCES:
  ├── Planet API 4-band or 8-band imagery
  ├── Field boundaries (pivot.geojson)
  └── Harvest data (harvest.xlsx, optional for cane_supply)
 STORAGE TIERS:
  ├── Tier 1: Raw data (merged_tif/ or merged_tif_8b/)
  ├── Tier 2: Daily tiles (daily_tiles_split/{grid_size}/{dates}/)
  ├── Tier 3: Extracted CI (Data/extracted_ci/daily_vals/*.rds)
  ├── Tier 4: Weekly mosaics (weekly_mosaic/ OR weekly_tile_max/5x5/)
  └── Tier 5: KPI outputs (reports/kpis/{field_level|field_analysis}/)
 DECISION POINTS:
  └─ Client type (cane_supply vs agronomic_support)
     ├─ Drives script selection (Scripts 21, 22, 23, 31, 90/91)
     ├─ Drives data source (merged_tif_8b for cane_supply, merged_tif for agronomic)
     ├─ Drives mosaic mode (tiled for cane_supply, single-file for agronomic)
     └─ Drives KPI subdirectory (field_analysis vs field_level)
 ```
 ### 8.2 .sh Scripts Alignment
 You mention `.sh` scripts in the online environment. If they're **not calling the R pipeline**, there's a **split responsibility** issue:
 **Question**: Are the `.sh` scripts:
 - (A) Independent duplicates of the R pipeline logic? (BAD - maintenance nightmare)
 - (B) Wrappers calling the R pipeline? (GOOD - single source of truth)
 - (C) Different workflow for online vs local? (RED FLAG - they diverge)
 **Recommendation**: If using `.sh` for production, ensure they **call the same R scripts** (`run_full_pipeline.R`). Example:
 ```bash
 #!/bin/bash
 # Wrapper that ensures R pipeline is called
 cd /path/to/smartcane
 & "C:\Program Files\R\R-4.4.3\bin\x64\Rscript.exe" r_app/run_full_pipeline.R
 ```
 ---
 ## 9. SUMMARY TABLE: Issues by Severity
 | Issue | Type | Impact | Effort | Priority |
 |-------|------|--------|--------|----------|
 | 3 mosaic detection functions | Duplication | HIGH | 30 min | P0 |
 | 13+ week/year calculations | Duplication | HIGH | 1 hour | P0 |
 | 40+ debug statements | Clutter | MEDIUM | 1 hour | P1 |
 | KPI check run twice | Inefficiency | LOW | 30 min | P2 |
 | Line 82: %Y should be %G | Bug | LOW | 5 min | P2 |
 | Data source confusion | Documentation | MEDIUM | 30 min | P1 |
 | Mosaic mode confusion | Documentation | MEDIUM | 30 min | P1 |
 | Client type mapping | Documentation | MEDIUM | 30 min | P1 |
 | Data source detection complexity | Code style | LOW | 15 min | P3 |
 ---
 ## 10. RECOMMENDED NEXT STEPS
 1. **Review this report** with your team to align on priorities
 2. **Create Linear issues** for each phase of refactoring
 3. **Start with Phase 1** (foundation utilities) - builds confidence for Phase 2
 4. **Test thoroughly** after each phase - the pipeline is complex and easy to break
 5. **Update `.sh` scripts** if they duplicate R logic
 6. **Document data flow** in `system_architecture/system_architecture.md`
 ---
 ## Questions for Clarification
 Before implementing, please clarify:
 1. **Data source split**: Why does angata use `merged_tif_8b` (8-band with cloud mask) while aura uses `merged_tif` (4-band)? Is this:
   - A function of client need (cane_supply requires cloud masking)?
   - Historical (legacy decision for angata)?
   - Should new projects choose based on availability?
 2. **Mosaic mode split**: Why tiled for angata but single-file for aura? Should this be:
   - Hard-coded per project?
   - Based on field count/client type?
   - Auto-detected from first run?
 3. **Production vs local**: Are the `.sh` scripts in the online environment:
   - Calling this same R pipeline?
   - Duplicating logic independently?
   - A different workflow entirely?
 4. **Client type growth**: Are there other client types planned beyond `cane_supply` and `agronomic_support`? (e.g., extension_service?)
 ---
 **Report prepared**: January 29, 2026  
 **Total code reviewed**: ~2,500 lines across 10 files  
 **Estimated refactoring time**: 3-4 hours  
 **Estimated maintenance savings**: 5-10 hours/month (fewer bugs, easier updates)
--- a/r_app/40_mosaic_creation.R
+++ b/r_app/40_mosaic_creation.R
@ -188,7 +188,7 @@ main <- function() {
  if (!exists("use_tile_mosaic")) {
    # Fallback detection if flag not set (shouldn't happen)
    merged_final_dir <- file.path(laravel_storage, "merged_final_tif")
-    tile_detection <- detect_mosaic_mode(merged_final_dir)
+    tile_detection <- detect_tile_structure_from_merged_final(merged_final_dir)
    use_tile_mosaic <- tile_detection$has_tiles
  }
--- a/r_app/40_mosaic_creation_utils.R
+++ b/r_app/40_mosaic_creation_utils.R
@ -3,12 +3,12 @@
 # Utility functions for creating weekly mosaics from daily satellite imagery.
 # These functions support cloud cover assessment, date handling, and mosaic creation.
-#' Detect whether a project uses tile-based or single-file mosaic approach
+#' Detect whether a project uses tile-based or single-file mosaic approach (utility version)
 #'
 #' @param merged_final_tif_dir Directory containing merged_final_tif files
 #' @return List with has_tiles (logical), detected_tiles (vector), total_files (count)
 #'
-detect_mosaic_mode <- function(merged_final_tif_dir) {
+detect_tile_structure_from_files <- function(merged_final_tif_dir) {
  # Check if directory exists
  if (!dir.exists(merged_final_tif_dir)) {
    return(list(has_tiles = FALSE, detected_tiles = character(), total_files = 0))
--- a/r_app/parameters_project.R
+++ b/r_app/parameters_project.R
@ -114,7 +114,7 @@ get_client_kpi_config <- function(client_type) {
 # 3. Smart detection for tile-based vs single-file mosaic approach
 # ----------------------------------------------------------------
-detect_mosaic_mode <- function(merged_final_tif_dir, daily_tiles_split_dir = NULL) {
+detect_tile_structure_from_merged_final <- function(merged_final_tif_dir, daily_tiles_split_dir = NULL) {
  # PRIORITY 1: Check for tiling_config.json metadata file from script 10
  # This is the most reliable source since script 10 explicitly records its decision
@ -223,7 +223,7 @@ setup_project_directories <- function(project_dir, data_source = "merged_tif_8b"
  merged_final_dir <- here(laravel_storage_dir, "merged_final_tif")
  daily_tiles_split_dir <- here(laravel_storage_dir, "daily_tiles_split")
-  tile_detection <- detect_mosaic_mode(
+  tile_detection <- detect_tile_structure_from_merged_final(
    merged_final_tif_dir = merged_final_dir,
    daily_tiles_split_dir = daily_tiles_split_dir
  )
@ -498,6 +498,279 @@ setup_logging <- function(log_dir) {
  ))
 }
 # 8. HELPER FUNCTIONS FOR COMMON CALCULATIONS
 # -----------------------------------------------
 # Centralized functions to reduce duplication across scripts
 # Get ISO week and year from a date
 get_iso_week <- function(date) {
  as.numeric(format(date, "%V"))
 }
 get_iso_year <- function(date) {
  as.numeric(format(date, "%G"))
 }
 # Get both ISO week and year as a list
 get_iso_week_year <- function(date) {
  list(
    week = as.numeric(format(date, "%V")),
    year = as.numeric(format(date, "%G"))
  )
 }
 # Format week/year into a readable label
 format_week_label <- function(date, separator = "_") {
  wwy <- get_iso_week_year(date)
  sprintf("week%02d%s%d", wwy$week, separator, wwy$year)
 }
 # Auto-detect mosaic mode (tiled vs single-file)
 # Returns: "tiled", "single-file", or "unknown"
 detect_mosaic_mode <- function(project_dir) {
  # Check for tile-based approach: weekly_tile_max/{grid_size}/week_*.tif
  weekly_tile_max <- file.path("laravel_app", "storage", "app", project_dir, "weekly_tile_max")
  if (dir.exists(weekly_tile_max)) {
    subfolders <- list.dirs(weekly_tile_max, full.names = FALSE, recursive = FALSE)
    grid_patterns <- grep("^\\d+x\\d+$", subfolders, value = TRUE)
    if (length(grid_patterns) > 0) {
      return("tiled")
    }
  }
  # Check for single-file approach: weekly_mosaic/week_*.tif
  weekly_mosaic <- file.path("laravel_app", "storage", "app", project_dir, "weekly_mosaic")
  if (dir.exists(weekly_mosaic)) {
    files <- list.files(weekly_mosaic, pattern = "^week_.*\\.tif$")
    if (length(files) > 0) {
      return("single-file")
    }
  }
  return("unknown")
 }
 # Auto-detect grid size from tile directory structure
 # Returns: e.g., "5x5", "10x10", or "unknown"
 detect_grid_size <- function(project_dir) {
  weekly_tile_max <- file.path("laravel_app", "storage", "app", project_dir, "weekly_tile_max")
  if (dir.exists(weekly_tile_max)) {
    subfolders <- list.dirs(weekly_tile_max, full.names = FALSE, recursive = FALSE)
    grid_patterns <- grep("^\\d+x\\d+$", subfolders, value = TRUE)
    if (length(grid_patterns) > 0) {
      return(grid_patterns[1])  # Return first match (usually only one)
    }
  }
  return("unknown")
 }
 # Build storage paths consistently across all scripts
 get_project_storage_path <- function(project_dir, subdir = NULL) {
  base <- file.path("laravel_app", "storage", "app", project_dir)
  if (!is.null(subdir)) file.path(base, subdir) else base
 }
 get_mosaic_dir <- function(project_dir, mosaic_mode = "auto") {
  if (mosaic_mode == "auto") {
    mosaic_mode <- detect_mosaic_mode(project_dir)
  }
  if (mosaic_mode == "tiled") {
    grid_size <- detect_grid_size(project_dir)
    if (grid_size != "unknown") {
      get_project_storage_path(project_dir, file.path("weekly_tile_max", grid_size))
    } else {
      get_project_storage_path(project_dir, "weekly_tile_max/5x5")  # Fallback default
    }
  } else {
    get_project_storage_path(project_dir, "weekly_mosaic")
  }
 }
 get_kpi_dir <- function(project_dir, client_type) {
  subdir <- if (client_type == "agronomic_support") "field_level" else "field_analysis"
  get_project_storage_path(project_dir, file.path("reports", "kpis", subdir))
 }
 # Logging functions for clean output
 smartcane_log <- function(message, level = "INFO", verbose = TRUE) {
  if (!verbose) return(invisible(NULL))
  timestamp <- format(Sys.time(), "%Y-%m-%d %H:%M:%S")
  prefix <- sprintf("[%s]", level)
  cat(sprintf("%s %s\n", prefix, message))
 }
 smartcane_debug <- function(message, verbose = FALSE) {
  if (!verbose && Sys.getenv("SMARTCANE_DEBUG") != "TRUE") {
    return(invisible(NULL))
  }
  smartcane_log(message, level = "DEBUG", verbose = TRUE)
 }
 smartcane_warn <- function(message) {
  smartcane_log(message, level = "WARN", verbose = TRUE)
 }
 # ============================================================================
 # PHASE 3 & 4: OPTIMIZATION & DOCUMENTATION
 # ============================================================================
 # System Constants
 # ----------------
 # Define once, use everywhere
 RSCRIPT_PATH <- "C:\\Program Files\\R\\R-4.4.3\\bin\\x64\\Rscript.exe"
 # Used in run_full_pipeline.R for calling R scripts via system()
 # Data Source Documentation
 # ---------------------------
 # Explains the two satellite data formats and when to use each
 #
 # SmartCane uses PlanetScope imagery from Planet Labs API in two formats:
 #
 # 1. merged_tif (4-band):
 #    - Standard format: Red, Green, Blue, Near-Infrared
 #    - Size: ~150-200 MB per date
 #    - Use case: Agronomic support, general crop health monitoring
 #    - Projects: aura, xinavane
 #    - Cloud handling: Basic cloud masking from Planet metadata
 #
 # 2. merged_tif_8b (8-band with cloud confidence):
 #    - Enhanced format: 4-band imagery + 4-band UDM2 cloud mask
 #    - UDM2 bands: Clear, Snow, Shadow, Light Haze
 #    - Size: ~250-350 MB per date
 #    - Use case: Harvest prediction, supply chain optimization
 #    - Projects: angata, chemba, esa (cane_supply clients)
 #    - Cloud handling: Per-pixel cloud confidence from Planet UDM2
 #    - Why: Cane supply chains need precise confidence to predict harvest dates
 #           (don't want to predict based on cloudy data)
 #
 # The system auto-detects which is available via detect_data_source()
 # Mosaic Mode Documentation
 # --------------------------
 # SmartCane supports two ways to store and process weekly mosaics:
 #
 # 1. Single-file mosaic ("single-file"):
 #    - One GeoTIFF per week: weekly_mosaic/week_02_2026.tif
 #    - 5 bands per file: R, G, B, NIR, CI (Canopy Index)
 #    - Size: ~300-500 MB per week
 #    - Pros: Simpler file management, easier full-field visualization
 #    - Cons: Slower for field-specific queries, requires loading full raster
 #    - Best for: Agronomic support (aura) with <100 fields
 #    - Script 04 output: 5-band single-file mosaic
 #
 # 2. Tiled mosaic ("tiled"):
 #    - Grid of tiles per week: weekly_tile_max/5x5/week_02_2026_{TT}.tif
 #    - Example: 25 files (5×5 grid) × 5 bands = 125 individual tiffs
 #    - Size: ~15-20 MB per tile, organized in folders
 #    - Pros: Parallel processing, fast field lookups, scales to 1000+ fields
 #    - Cons: More file I/O, requires tile-to-field mapping metadata
 #    - Best for: Cane supply (angata, chemba) with 500+ fields
 #    - Script 04 output: Per-tile tiff files in weekly_tile_max/{grid}/
 #    - Tile assignment: Field boundaries mapped to grid coordinates
 #
 # The system auto-detects which is available via detect_mosaic_mode()
 # Client Type Documentation
 # --------------------------
 # SmartCane runs different analysis pipelines based on client_type:
 #
 # CLIENT_TYPE: cane_supply
 #   Purpose: Optimize sugar mill supply chain (harvest scheduling)
 #   Scripts run: 20 (CI), 21 (RDS to CSV), 30 (Growth), 31 (Harvest pred), 40 (Mosaic), 80 (KPI), 91 (Excel)
 #   Outputs:
 #     - Per-field analysis: field status, growth phase, harvest readiness
 #     - Excel reports (Script 91): Detailed metrics for logistics planning
 #     - KPI directory: reports/kpis/field_analysis/ (one RDS per week)
 #   Harvest data: Required (harvest.xlsx - planting dates for phase assignment)
 #   Data source: merged_tif_8b (uses cloud confidence for confidence)
 #   Mosaic mode: tiled (scales to 500+ fields)
 #   Projects: angata, chemba, xinavane, esa
 #
 # CLIENT_TYPE: agronomic_support
 #   Purpose: Provide weekly crop health insights to agronomists
 #   Scripts run: 80 (KPI), 90 (Word report)
 #   Outputs:
 #     - Farm-level KPI summaries (no per-field breakdown)
 #     - Word reports (Script 90): Charts and trends for agronomist decision support
 #     - KPI directory: reports/kpis/field_level/ (one RDS per week)
 #   Harvest data: Not used
 #   Data source: merged_tif (simpler, smaller)
 #   Mosaic mode: single-file (100-200 fields)
 #   Projects: aura
 #
 # Detect data source (merged_tif vs merged_tif_8b) based on availability
 # Returns the first available source; defaults to merged_tif_8b if neither exists
 detect_data_source <- function(project_dir) {
  storage_dir <- get_project_storage_path(project_dir)
  # Preferred order: check merged_tif first, fall back to merged_tif_8b
  for (source in c("merged_tif", "merged_tif_8b")) {
    source_dir <- file.path(storage_dir, source)
    if (dir.exists(source_dir)) {
      tifs <- list.files(source_dir, pattern = "\\.tif$")
      if (length(tifs) > 0) {
        smartcane_log(sprintf("Detected data source: %s (%d TIF files)", source, length(tifs)))
        return(source)
      }
    }
  }
  smartcane_warn(sprintf("No data source found for %s - defaulting to merged_tif_8b", project_dir))
  return("merged_tif_8b")
 }
 # Check KPI completeness for a reporting period
 # Returns: List with kpis_df (data.frame), missing_count, and all_complete (boolean)
 # This replaces duplicate KPI checking logic in run_full_pipeline.R (lines ~228-270, ~786-810)
 check_kpi_completeness <- function(project_dir, client_type, end_date, reporting_weeks_needed) {
  kpi_dir <- get_kpi_dir(project_dir, client_type)
  kpis_needed <- data.frame()
  for (weeks_back in 0:(reporting_weeks_needed - 1)) {
    check_date <- end_date - (weeks_back * 7)
    wwy <- get_iso_week_year(check_date)
    # Build week pattern and check if it exists
    week_pattern <- sprintf("week%02d_%d", wwy$week, wwy$year)
    files_this_week <- list.files(kpi_dir, pattern = week_pattern)
    has_kpis <- length(files_this_week) > 0
    # Track missing weeks
    kpis_needed <- rbind(kpis_needed, data.frame(
      week = wwy$week,
      year = wwy$year,
      date = check_date,
      has_kpis = has_kpis,
      pattern = week_pattern,
      file_count = length(files_this_week)
    ))
    # Debug logging
    smartcane_debug(sprintf(
      "Week %02d/%d (%s): %s (%d files)",
      wwy$week, wwy$year, format(check_date, "%Y-%m-%d"),
      if (has_kpis) "✓ FOUND" else "✗ MISSING",
      length(files_this_week)
    ))
  }
  # Summary statistics
  missing_count <- sum(!kpis_needed$has_kpis)
  all_complete <- missing_count == 0
  return(list(
    kpis_df = kpis_needed,
    kpi_dir = kpi_dir,
    missing_count = missing_count,
    missing_weeks = kpis_needed[!kpis_needed$has_kpis, ],
    all_complete = all_complete
  ))
 }
 # 9. Initialize the project
 # ----------------------
 # Export project directories and settings
--- a/r_app/run_full_pipeline.R
+++ b/r_app/run_full_pipeline.R