2026-01-29 17:26:03 +01:00

24 KiB

Raw Blame History

SmartCane Pipeline Code Review

Efficiency, Cleanup, and Architecture Analysis

Date: January 29, 2026
Scope: run_full_pipeline.R + all called scripts (10, 20, 21, 30, 31, 40, 80, 90, 91) + utility files
Status: Comprehensive review completed

EXECUTIVE SUMMARY

Your pipeline is well-structured and intentional, but has accumulated significant technical debt through development iterations. The main issues are:

🔴 HIGH IMPACT: 3 separate mosaic mode detection functions doing identical work
🔴 HIGH IMPACT: Week/year calculations duplicated 10+ times across 6+ files
🟡 MEDIUM IMPACT: 40+ debug statements cluttering output
🟡 MEDIUM IMPACT: File existence checks repeated in multiple places (especially KPI checks)
🟢 LOW IMPACT: Minor redundancy in command construction, but manageable

Estimated cleanup effort: 2-3 hours for core refactoring; significant code quality gains.

Workflow clarity issue: The split between merged_tif vs merged_tif_8b and weekly_mosaic vs weekly_tile_max is not clearly documented. This should be clarified.

1. DUPLICATED FUNCTIONS & LOGIC

1.1 Mosaic Mode Detection (CRITICAL REDUNDANCY)

Problem: Three identical implementations of detect_mosaic_mode():

Location	Function Name	Lines	Issue
`run_full_pipeline.R`	`detect_mosaic_mode_early()`	~20 lines	Detects tiled vs single-file
`run_full_pipeline.R`	`detect_mosaic_mode_simple()`	~20 lines	Detects tiled vs single-file (duplicate)
`parameters_project.R`	`detect_mosaic_mode()`	~30 lines	Detects tiled vs single-file (different signature)

Impact: If you change the detection logic, you must update 3 places. Bug risk is high.

Solution: Create single canonical function in parameters_project.R:

# SINGLE SOURCE OF TRUTH
detect_mosaic_mode <- function(project_dir) {
  weekly_tile_max <- file.path("laravel_app", "storage", "app", project_dir, "weekly_tile_max")
  if (dir.exists(weekly_tile_max)) {
    subfolders <- list.dirs(weekly_tile_max, full.names = FALSE, recursive = FALSE)
    if (length(grep("^\\d+x\\d+$", subfolders)) > 0) return("tiled")
  }
  
  weekly_mosaic <- file.path("laravel_app", "storage", "app", project_dir, "weekly_mosaic")
  if (dir.exists(weekly_mosaic) && 
      length(list.files(weekly_mosaic, pattern = "^week_.*\\.tif$")) > 0) {
    return("single-file")
  }
  
  return("unknown")
}

Then replace all three calls in run_full_pipeline.R with this single function.

1.2 Week/Year Calculations (CRITICAL REDUNDANCY)

Problem: The pattern week_num <- as.numeric(format(..., "%V")) + year_num <- as.numeric(format(..., "%G")) appears 13+ times across multiple files.

Locations:

run_full_pipeline.R: Lines 82, 126-127, 229-230, 630, 793-794 (5 times)
80_calculate_kpis.R: Lines 323-324 (1 time)
80_weekly_stats_utils.R: Lines 829-830 (1 time)
kpi_utils.R: Line 45 (1 time)
80_kpi_utils.R: Lines 177-178 (1 time)
Plus inline in sprintf statements: ~10+ additional times

Impact:

High maintenance burden
Risk of inconsistency (%V vs %Y confusion noted at line 82 in run_full_pipeline.R)
Code verbosity

Solution: Create utility function in parameters_project.R:

get_iso_week_year <- function(date) {
  list(
    week = as.numeric(format(date, "%V")),
    year = as.numeric(format(date, "%G"))  # ISO year, not calendar year
  )
}

# Usage:
wwy <- get_iso_week_year(end_date)
cat(sprintf("Week %02d/%d\n", wwy$week, wwy$year))

Also add convenience function:

format_week_year <- function(date, separator = "_") {
  wwy <- get_iso_week_year(date)
  sprintf("week_%02d%s%d", wwy$week, separator, wwy$year)
}

# Usage: format_week_year(end_date)  # "week_02_2026"

1.3 File Path Construction (MEDIUM REDUNDANCY)

Problem: Repeated patterns like:

file.path("laravel_app", "storage", "app", project_dir, "weekly_mosaic")
file.path("laravel_app", "storage", "app", project_dir, "reports", "kpis", kpi_subdir)

Solution: Centralize in parameters_project.R:

# Project-agnostic path builders
get_project_storage_path <- function(project_dir, subdir = NULL) {
  base <- file.path("laravel_app", "storage", "app", project_dir)
  if (!is.null(subdir)) file.path(base, subdir) else base
}

get_mosaic_dir <- function(project_dir, mosaic_mode = "auto") {
  if (mosaic_mode == "auto") mosaic_mode <- detect_mosaic_mode(project_dir)
  if (mosaic_mode == "tiled") {
    get_project_storage_path(project_dir, "weekly_tile_max/5x5")
  } else {
    get_project_storage_path(project_dir, "weekly_mosaic")
  }
}

get_kpi_dir <- function(project_dir, client_type) {
  subdir <- if (client_type == "agronomic_support") "field_level" else "field_analysis"
  get_project_storage_path(project_dir, file.path("reports", "kpis", subdir))
}

2. DEBUG STATEMENTS & LOGGING CLUTTER

2.1 Excessive Debug Output

The pipeline prints 40+ debug statements that pollute the terminal output. Examples:

In run_full_pipeline.R:

Line 82:   cat(sprintf("       Running week: %02d / %d\n", ...))  # Note: %d (calendar year) should be %G
Line 218:  cat(sprintf("[KPI_DIR_CREATED] Created directory: %s\n", ...))
Line 223:  cat(sprintf("[KPI_DIR_EXISTS] %s\n", ...))
Line 224:  cat(sprintf("[KPI_DEBUG] Total files in directory: %d\n", ...))
Line 225:  cat(sprintf("[KPI_DEBUG] Sample files: %s\n", ...))
Line 240:  cat(sprintf("[KPI_DEBUG_W%02d_%d] Pattern: '%s' | Found: %d files\n", ...))
Line 630:  cat("DEBUG: Running command:", cmd, "\n")
Line 630 in Script 31 execution - prints full conda command

In 80_calculate_kpis.R:

Line 323:  message(paste("Calculating statistics for all fields - Week", week_num, year))
Line 417:  # Plus many more ...

Impact:

Makes output hard to scan for real issues
Test developers skip important messages
Production logs become noise

Solution: Replace with structured logging (3 levels):

# Add to parameters_project.R
smartcane_log <- function(message, level = "INFO") {
  timestamp <- format(Sys.time(), "%Y-%m-%d %H:%M:%S")
  prefix <- sprintf("[%s] %s", level, timestamp)
  cat(sprintf("%s | %s\n", prefix, message))
}

smartcane_debug <- function(message) {
  if (Sys.getenv("SMARTCANE_DEBUG") == "TRUE") {
    smartcane_log(message, level = "DEBUG")
  }
}

smartcane_warn <- function(message) {
  smartcane_log(message, level = "WARN")
}

Usage:

# Keep important messages
smartcane_log(sprintf("Downloaded %d dates, %d failed", download_count, download_failed))

# Hide debug clutter (only show if DEBUG=TRUE)
smartcane_debug(sprintf("KPI directory exists: %s", kpi_dir))

# Warnings stay visible
smartcane_warn("Some downloads failed, but continuing pipeline")

2.2 Redundant Status Checks in KPI Section

Lines 218-270 in run_full_pipeline.R: The KPI requirement check has deeply nested debug statements.

if (dir.exists(kpi_dir)) {
  cat(sprintf("[KPI_DIR_EXISTS] %s\n", kpi_dir))
  all_kpi_files <- list.files(kpi_dir)
  cat(sprintf("[KPI_DEBUG] Total files in directory: %d\n", length(all_kpi_files)))
  if (length(all_kpi_files) > 0) {
    cat(sprintf("[KPI_DEBUG] Sample files: %s\n", ...))
  }
} else {
  cat(sprintf("[KPI_DIR_MISSING] Directory does not exist: %s\n", kpi_dir))
}

Solution: Simplify to:

if (!dir.exists(kpi_dir)) {
  dir.create(kpi_dir, recursive = TRUE, showWarnings = FALSE)
}

all_kpi_files <- list.files(kpi_dir)
smartcane_debug(sprintf("KPI directory: %d files found", length(all_kpi_files)))

3. DOUBLE CALCULATIONS & INEFFICIENCIES

3.1 KPI Existence Check (Calculated Twice)

Problem: KPI existence is checked twice in run_full_pipeline.R:

First check (Lines 228-270): Initial KPI requirement check that calculates kpis_needed dataframe
Second check (Lines 786-810): Verification after Script 80 runs (almost identical logic)

Both loops do:

for (weeks_back in 0:(reporting_weeks_needed - 1)) {
  check_date <- end_date - (weeks_back * 7)
  week_num <- as.numeric(format(check_date, "%V"))
  year_num <- as.numeric(format(check_date, "%G"))
  
  week_pattern <- sprintf("week%02d_%d", week_num, year_num)
  kpi_files_this_week <- list.files(kpi_dir, pattern = week_pattern)
  
  has_kpis <- length(kpi_files_this_week) > 0
  # ... same logic again
}

Impact: Slower pipeline execution, code duplication

Solution: Create reusable function in utility file:

check_kpi_completeness <- function(project_dir, client_type, end_date, reporting_weeks_needed) {
  kpi_dir <- get_kpi_dir(project_dir, client_type)
  
  kpis_needed <- data.frame()
  for (weeks_back in 0:(reporting_weeks_needed - 1)) {
    check_date <- end_date - (weeks_back * 7)
    wwy <- get_iso_week_year(check_date)
    
    week_pattern <- sprintf("week%02d_%d", wwy$week, wwy$year)
    has_kpis <- any(grepl(week_pattern, list.files(kpi_dir)))
    
    kpis_needed <- rbind(kpis_needed, data.frame(
      week = wwy$week,
      year = wwy$year,
      date = check_date,
      has_kpis = has_kpis
    ))
  }
  
  return(list(
    kpis_df = kpis_needed,
    missing_count = sum(!kpis_needed$has_kpis),
    all_complete = all(kpis_needed$has_kpis)
  ))
}

# Then in run_full_pipeline.R:
initial_kpi_check <- check_kpi_completeness(project_dir, client_type, end_date, reporting_weeks_needed)

# ... after Script 80 runs:
final_kpi_check <- check_kpi_completeness(project_dir, client_type, end_date, reporting_weeks_needed)
if (final_kpi_check$all_complete) {
  smartcane_log("✓ All KPIs available")
}

3.2 Mosaic Mode Detection (Called 3+ Times per Run)

Current code:

Line 99-117: detect_mosaic_mode_early() called once
Line 301-324: detect_mosaic_mode_simple() called again
Result: Same detection logic runs twice unnecessarily

Solution: Call once, store result:

mosaic_mode <- detect_mosaic_mode(project_dir)  # Once at top

# Then reuse throughout:
if (mosaic_mode == "tiled") { ... }
else if (mosaic_mode == "single-file") { ... }

3.3 Missing Weeks Calculation Inefficiency

Lines 126-170: The loop builds weeks_needed dataframe, then immediately iterates again to find which ones are missing.

Current code:

# First: build all weeks
weeks_needed <- data.frame()
for (weeks_back in 0:(reporting_weeks_needed - 1)) {
  # ... build weeks_needed
}

# Then: check which are missing (loop again)
missing_weeks <- data.frame()
for (i in 1:nrow(weeks_needed)) {
  # ... check each week
}

Solution: Combine into single loop:

weeks_needed <- data.frame()
missing_weeks <- data.frame()
earliest_missing_date <- end_date

for (weeks_back in 0:(reporting_weeks_needed - 1)) {
  check_date <- end_date - (weeks_back * 7)
  wwy <- get_iso_week_year(check_date)
  
  # Add to weeks_needed
  weeks_needed <- rbind(weeks_needed, data.frame(
    week = wwy$week, year = wwy$year, date = check_date
  ))
  
  # Check if missing, add to missing_weeks if so
  week_pattern <- sprintf("week_%02d_%d", wwy$week, wwy$year)
  mosaic_dir <- get_mosaic_dir(project_dir, mosaic_mode)
  
  if (length(list.files(mosaic_dir, pattern = week_pattern)) == 0) {
    missing_weeks <- rbind(missing_weeks, data.frame(
      week = wwy$week, year = wwy$year, week_end_date = check_date
    ))
    if (check_date - 6 < earliest_missing_date) {
      earliest_missing_date <- check_date - 6
    }
  }
}

3.4 Data Source Detection Logic

Lines 58-84: The data_source_used detection is overly complex:

data_source_used <- "merged_tif_8b"  # Default
if (dir.exists(merged_tif_path)) {
  tif_files <- list.files(merged_tif_path, pattern = "\\.tif$")
  if (length(tif_files) > 0) {
    data_source_used <- "merged_tif"
    # ...
  } else if (dir.exists(merged_tif_8b_path)) {
    tif_files_8b <- list.files(merged_tif_8b_path, pattern = "\\.tif$")
    # ...
  }
} else if (dir.exists(merged_tif_8b_path)) {
  # ...
}

Issues:

Multiple nested conditions doing the same check
tif_files and tif_files_8b are listed but only counts checked (not used later)
Logic could be cleaner

Solution: Create utility function:

detect_data_source <- function(project_dir, preferred = "auto") {
  storage_dir <- get_project_storage_path(project_dir)
  
  for (source in c("merged_tif", "merged_tif_8b")) {
    source_dir <- file.path(storage_dir, source)
    if (dir.exists(source_dir)) {
      tifs <- list.files(source_dir, pattern = "\\.tif$")
      if (length(tifs) > 0) return(source)
    }
  }
  
  smartcane_warn("No data source found - defaulting to merged_tif_8b")
  return("merged_tif_8b")
}

4. WORKFLOW CLARITY ISSUES

4.1 TIFF Data Format Confusion

Problem: Why are there TWO different TIFF folders?

merged_tif: 4-band data (RGB + NIR)
merged_tif_8b: 8-band data (appears to include UDM cloud masking from Planet)

Currently in code:

data_source <- if (project_dir == "angata") "merged_tif_8b" else "merged_tif"

Issues:

Hard-coded per project, not based on what's actually available
Not documented why angata uses 8-band
Unclear what the 8-band data adds (cloud masking? extra bands?)
Scripts handle both, but it's not clear when to use which

Recommendation:

Document in parameters_project.R what each data source contains:

DATA_SOURCE_FORMATS <- list(
  "merged_tif" = list(
    bands = 4,
    description = "4-band PlanetScope: Red, Green, Blue, NIR",
    projects = c("aura", "chemba", "xinavane"),
    note = "Standard format from Planet API"
  ),
  "merged_tif_8b" = list(
    bands = 8,
    description = "8-band PlanetScope with UDM: RGB+NIR + 4-band cloud mask",
    projects = c("angata"),
    note = "Enhanced with cloud confidence from UDM2 (Unusable Data Mask)"
  )
)

Update hard-coded assignment to be data-driven:

# OLD: data_source <- if (project_dir == "angata") "merged_tif_8b" else "merged_tif"
# NEW: detect what's actually available
data_source <- detect_data_source(project_dir)

4.2 Mosaic Storage Format Confusion

Problem: Why are there TWO different mosaic storage styles?

weekly_mosaic/: Single TIF file per week (monolithic)
weekly_tile_max/5x5/: Tiled TIFFs per week (25+ files per week)

Currently in code:

Detected automatically via detect_mosaic_mode()
But no documentation on when/why each is used

Recommendation:

Document the trade-offs in parameters_project.R:

MOSAIC_MODES <- list(
  "single-file" = list(
    description = "One TIF per week",
    storage_path = "weekly_mosaic/",
    files_per_week = 1,
    pros = c("Simpler file management", "Easier to load full mosaic"),
    cons = c("Slower for field-specific analysis", "Large file I/O"),
    suitable_for = c("agronomic_support", "dashboard visualization")
  ),
  "tiled" = list(
    description = "5×5 grid of tiles per week",
    storage_path = "weekly_tile_max/5x5/",
    files_per_week = 25,
    pros = c("Parallel field processing", "Faster per-field queries", "Scalable to 1000+ fields"),
    cons = c("More file management", "Requires tile_grid metadata"),
    suitable_for = c("cane_supply", "large-scale operations")
  )
)

Document why angata uses tiled, aura uses single-file:
- Is it a function of field count? (Angata = cane_supply, large fields → tiled)
- Is it historical? (Legacy decision?)
- Should new projects choose based on client type?

4.3 Client Type Mapping Clarity

Current structure in parameters_project.R:

CLIENT_TYPE_MAP <- list(
  "angata" = "cane_supply",
  "aura" = "agronomic_support",
  "chemba" = "cane_supply",
  "xinavane" = "cane_supply",
  "esa" = "cane_supply"
)

Issues:

Not clear why aura is agronomic_support while angata/chemba are cane_supply
No documentation of what each client type needs
Scripts branch heavily on skip_cane_supply_only logic

Recommendation: Add metadata to explain the distinction:

CLIENT_TYPES <- list(
  "cane_supply" = list(
    description = "Sugar mill supply chain optimization",
    requires_harvest_prediction = TRUE,  # Script 31
    requires_phase_assignment = TRUE,     # Based on planting date
    per_field_detail = TRUE,              # Script 91 Excel report
    data_sources = c("merged_tif", "merged_tif_8b"),
    mosaic_mode = "tiled",
    projects = c("angata", "chemba", "xinavane", "esa")
  ),
  "agronomic_support" = list(
    description = "Farm-level decision support for agronomists",
    requires_harvest_prediction = FALSE,
    requires_phase_assignment = FALSE,
    per_field_detail = FALSE,
    farm_level_kpis = TRUE,               # Script 90 Word report
    data_sources = c("merged_tif"),
    mosaic_mode = "single-file",
    projects = c("aura")
  )
)

5. COMMAND CONSTRUCTION REDUNDANCY

5.1 Rscript Path Repetition

Problem: The Rscript path is repeated 5 times:

Line 519:  '"C:\\Program Files\\R\\R-4.4.3\\bin\\x64\\Rscript.exe"'
Line 676:  '"C:\\Program Files\\R\\R-4.4.3\\bin\\x64\\Rscript.exe"'
Line 685:  '"C:\\Program Files\\R\\R-4.4.3\\bin\\x64\\Rscript.exe"'

Solution: Define once in parameters_project.R:

RSCRIPT_PATH <- "C:\\Program Files\\R\\R-4.4.3\\bin\\x64\\Rscript.exe"

# Usage:
cmd <- sprintf('"%s" --vanilla r_app/20_ci_extraction.R ...', RSCRIPT_PATH)

6. SPECIFIC LINE-BY-LINE ISSUES

6.1 Line 82 Bug: Wrong Format Code

cat(sprintf("       Running week: %02d / %d\n", 
            as.numeric(format(end_date, "%V")), 
            as.numeric(format(end_date, "%Y"))))  # ❌ Should be %G, not %Y

Issue: Uses calendar year %Y instead of ISO week year %G. On dates like 2025-12-30 (week 1 of 2026), this will print "Week 01 / 2025" (confusing).

Fix:

wwy <- get_iso_week_year(end_date)
cat(sprintf("       Running week: %02d / %d\n", wwy$week, wwy$year))

6.2 Line 630 Debug Statement

cmd <- sprintf('conda run -n pytorch_gpu python python_app/31_harvest_imminent_weekly.py %s', project_dir)
cat("DEBUG: Running command:", cmd, "\n")  # ❌ Prints full conda command

Solution: Use smartcane_debug() function:

cmd <- sprintf('conda run -n pytorch_gpu python python_app/31_harvest_imminent_weekly.py %s', project_dir)
smartcane_debug(sprintf("Running Python 31: %s", cmd))

6.3 Lines 719-723: Verbose Script 31 Verification

# Check for THIS WEEK's specific file
current_week <- as.numeric(format(end_date, "%V"))
current_year <- as.numeric(format(end_date, "%Y"))
expected_file <- file.path(...)

Issue: Calculates week twice (already done earlier). Also uses %Y (should be %G).

Solution: Reuse earlier wwy calculation or create helper.

7. REFACTORING ROADMAP

Phase 1: Foundation (1 hour)

Consolidate detect_mosaic_mode() into single function in parameters_project.R
Create get_iso_week_year() and format_week_year() utilities
Create get_project_storage_path(), get_mosaic_dir(), get_kpi_dir() helpers
Add logging functions (smartcane_log(), smartcane_debug(), smartcane_warn())

Phase 2: Deduplication (1 hour)

Replace all 13+ week_num/year_num calculations with get_iso_week_year()
Replace all 3 detect_mosaic_mode_*() calls with single function
Combine duplicate KPI checks into check_kpi_completeness() function
Fix line 82 and 630 format bugs

Phase 3: Cleanup (1 hour)

Remove all debug statements (40+), replace with smartcane_debug()
Simplify nested conditions in data_source detection
Combine missing weeks detection into single loop
Extract Rscript path to constant

Phase 4: Documentation (30 min)

Add comments explaining merged_tif vs merged_tif_8b trade-offs
Document single-file vs tiled mosaic modes and when to use each
Clarify client type mapping in CLIENT_TYPE_MAP
Add inline comments for non-obvious logic

8. ARCHITECTURE & WORKFLOW RECOMMENDATIONS

8.1 Clear Data Flow Diagram

Add to r_app/system_architecture/system_architecture.md:

INPUT SOURCES:
  ├── Planet API 4-band or 8-band imagery
  ├── Field boundaries (pivot.geojson)
  └── Harvest data (harvest.xlsx, optional for cane_supply)

STORAGE TIERS:
  ├── Tier 1: Raw data (merged_tif/ or merged_tif_8b/)
  ├── Tier 2: Daily tiles (daily_tiles_split/{grid_size}/{dates}/)
  ├── Tier 3: Extracted CI (Data/extracted_ci/daily_vals/*.rds)
  ├── Tier 4: Weekly mosaics (weekly_mosaic/ OR weekly_tile_max/5x5/)
  └── Tier 5: KPI outputs (reports/kpis/{field_level|field_analysis}/)

DECISION POINTS:
  └─ Client type (cane_supply vs agronomic_support)
     ├─ Drives script selection (Scripts 21, 22, 23, 31, 90/91)
     ├─ Drives data source (merged_tif_8b for cane_supply, merged_tif for agronomic)
     ├─ Drives mosaic mode (tiled for cane_supply, single-file for agronomic)
     └─ Drives KPI subdirectory (field_analysis vs field_level)

8.2 .sh Scripts Alignment

You mention .sh scripts in the online environment. If they're not calling the R pipeline, there's a split responsibility issue:

Question: Are the .sh scripts:

(A) Independent duplicates of the R pipeline logic? (BAD - maintenance nightmare)
(B) Wrappers calling the R pipeline? (GOOD - single source of truth)
(C) Different workflow for online vs local? (RED FLAG - they diverge)

Recommendation: If using .sh for production, ensure they call the same R scripts (run_full_pipeline.R). Example:

#!/bin/bash
# Wrapper that ensures R pipeline is called
cd /path/to/smartcane
& "C:\Program Files\R\R-4.4.3\bin\x64\Rscript.exe" r_app/run_full_pipeline.R

9. SUMMARY TABLE: Issues by Severity

Issue	Type	Impact	Effort	Priority
3 mosaic detection functions	Duplication	HIGH	30 min	P0
13+ week/year calculations	Duplication	HIGH	1 hour	P0
40+ debug statements	Clutter	MEDIUM	1 hour	P1
KPI check run twice	Inefficiency	LOW	30 min	P2
Line 82: %Y should be %G	Bug	LOW	5 min	P2
Data source confusion	Documentation	MEDIUM	30 min	P1
Mosaic mode confusion	Documentation	MEDIUM	30 min	P1
Client type mapping	Documentation	MEDIUM	30 min	P1
Data source detection complexity	Code style	LOW	15 min	P3

10. RECOMMENDED NEXT STEPS

Review this report with your team to align on priorities
Create Linear issues for each phase of refactoring
Start with Phase 1 (foundation utilities) - builds confidence for Phase 2
Test thoroughly after each phase - the pipeline is complex and easy to break
Update .sh scripts if they duplicate R logic
Document data flow in system_architecture/system_architecture.md

Questions for Clarification

Before implementing, please clarify:

Data source split: Why does angata use merged_tif_8b (8-band with cloud mask) while aura uses merged_tif (4-band)? Is this:
- A function of client need (cane_supply requires cloud masking)?
- Historical (legacy decision for angata)?
- Should new projects choose based on availability?
Mosaic mode split: Why tiled for angata but single-file for aura? Should this be:
- Hard-coded per project?
- Based on field count/client type?
- Auto-detected from first run?
Production vs local: Are the .sh scripts in the online environment:
- Calling this same R pipeline?
- Duplicating logic independently?
- A different workflow entirely?
Client type growth: Are there other client types planned beyond cane_supply and agronomic_support? (e.g., extension_service?)

Report prepared: January 29, 2026
Total code reviewed: ~2,500 lines across 10 files
Estimated refactoring time: 3-4 hours
Estimated maintenance savings: 5-10 hours/month (fewer bugs, easier updates)

24 KiB Raw Blame History Unescape Escape

SmartCane Pipeline Code Review

Efficiency, Cleanup, and Architecture Analysis

EXECUTIVE SUMMARY

1. DUPLICATED FUNCTIONS & LOGIC

1.1 Mosaic Mode Detection (CRITICAL REDUNDANCY)

1.2 Week/Year Calculations (CRITICAL REDUNDANCY)

1.3 File Path Construction (MEDIUM REDUNDANCY)

2. DEBUG STATEMENTS & LOGGING CLUTTER

2.1 Excessive Debug Output

2.2 Redundant Status Checks in KPI Section

3. DOUBLE CALCULATIONS & INEFFICIENCIES

3.1 KPI Existence Check (Calculated Twice)

3.2 Mosaic Mode Detection (Called 3+ Times per Run)

3.3 Missing Weeks Calculation Inefficiency

3.4 Data Source Detection Logic

4. WORKFLOW CLARITY ISSUES

4.1 TIFF Data Format Confusion

4.2 Mosaic Storage Format Confusion

4.3 Client Type Mapping Clarity

5. COMMAND CONSTRUCTION REDUNDANCY

5.1 Rscript Path Repetition

6. SPECIFIC LINE-BY-LINE ISSUES

6.1 Line 82 Bug: Wrong Format Code

6.2 Line 630 Debug Statement

6.3 Lines 719-723: Verbose Script 31 Verification

7. REFACTORING ROADMAP

Phase 1: Foundation (1 hour)

Phase 2: Deduplication (1 hour)

Phase 3: Cleanup (1 hour)

Phase 4: Documentation (30 min)

8. ARCHITECTURE & WORKFLOW RECOMMENDATIONS

8.1 Clear Data Flow Diagram

8.2 .sh Scripts Alignment

9. SUMMARY TABLE: Issues by Severity

10. RECOMMENDED NEXT STEPS

Questions for Clarification

24 KiB

Raw Blame History