SmartCane/CODE_REVIEW_FINDINGS.md

# SmartCane Pipeline Code Review
## Efficiency, Cleanup, and Architecture Analysis

**Date**: January 29, 2026
**Scope**: `run_full_pipeline.R` + all called scripts (10, 20, 21, 30, 31, 40, 80, 90, 91) + utility files
**Status**: Comprehensive review completed

---

## EXECUTIVE SUMMARY

Your pipeline is **well-structured and intentional**, but has accumulated significant technical debt through development iterations. The main issues are:

1. **🔴 HIGH IMPACT**: **3 separate mosaic mode detection functions** doing identical work
2. **🔴 HIGH IMPACT**: **Week/year calculations duplicated 10+ times** across 6+ files
3. **🟡 MEDIUM IMPACT**: **40+ debug statements** cluttering output
4. **🟡 MEDIUM IMPACT**: **File existence checks repeated** in multiple places (especially KPI checks)
5. **🟢 LOW IMPACT**: Minor redundancy in command construction, but manageable

**Estimated cleanup effort**: 2-3 hours for core refactoring; significant code quality gains.

**Workflow clarity issue**: The split between `merged_tif` vs `merged_tif_8b` and `weekly_mosaic` vs `weekly_tile_max` is **not clearly documented**. This should be clarified.

---

## 1. DUPLICATED FUNCTIONS & LOGIC

### 1.1 Mosaic Mode Detection (CRITICAL REDUNDANCY)

**Problem**: Three identical implementations of `detect_mosaic_mode()`:

| Location | Function Name | Lines | Issue |
|----------|---------------|-------|-------|
| `run_full_pipeline.R` | `detect_mosaic_mode_early()` | ~20 lines | Detects tiled vs single-file |
| `run_full_pipeline.R` | `detect_mosaic_mode_simple()` | ~20 lines | Detects tiled vs single-file (duplicate) |
| `parameters_project.R` | `detect_mosaic_mode()` | ~30 lines | Detects tiled vs single-file (different signature) |

**Impact**: If you change the detection logic, you must update 3 places. Bug risk is high.

**Solution**: Create **single canonical function in `parameters_project.R`**:
```r
# SINGLE SOURCE OF TRUTH
detect_mosaic_mode <- function(project_dir) {
  weekly_tile_max <- file.path("laravel_app", "storage", "app", project_dir, "weekly_tile_max")
  if (dir.exists(weekly_tile_max)) {
    subfolders <- list.dirs(weekly_tile_max, full.names = FALSE, recursive = FALSE)
    if (length(grep("^\\d+x\\d+$", subfolders)) > 0) return("tiled")
  }

  weekly_mosaic <- file.path("laravel_app", "storage", "app", project_dir, "weekly_mosaic")
  if (dir.exists(weekly_mosaic) &&
      length(list.files(weekly_mosaic, pattern = "^week_.*\\.tif$")) > 0) {
    return("single-file")
  }

  return("unknown")
}
```

Then replace all three calls in `run_full_pipeline.R` with this single function.

---

### 1.2 Week/Year Calculations (CRITICAL REDUNDANCY)

**Problem**: The pattern `week_num <- as.numeric(format(..., "%V"))` + `year_num <- as.numeric(format(..., "%G"))` appears **13+ times** across multiple files.

**Locations**:
- `run_full_pipeline.R`: Lines 82, 126-127, 229-230, 630, 793-794 (5 times)
- `80_calculate_kpis.R`: Lines 323-324 (1 time)
- `80_weekly_stats_utils.R`: Lines 829-830 (1 time)
- `kpi_utils.R`: Line 45 (1 time)
- `80_kpi_utils.R`: Lines 177-178 (1 time)
- Plus inline in sprintf statements: ~10+ additional times

**Impact**:
- High maintenance burden
- Risk of inconsistency (%V vs %Y confusion noted at line 82 in `run_full_pipeline.R`)
- Code verbosity

**Solution**: Create **utility function in `parameters_project.R`**:
```r
get_iso_week_year <- function(date) {
  list(
    week = as.numeric(format(date, "%V")),
    year = as.numeric(format(date, "%G"))  # ISO year, not calendar year
  )
}

# Usage:
wwy <- get_iso_week_year(end_date)
cat(sprintf("Week %02d/%d\n", wwy$week, wwy$year))
```

**Also add convenience function**:
```r
format_week_year <- function(date, separator = "_") {
  wwy <- get_iso_week_year(date)
  sprintf("week_%02d%s%d", wwy$week, separator, wwy$year)
}

# Usage: format_week_year(end_date)  # "week_02_2026"
```

---

### 1.3 File Path Construction (MEDIUM REDUNDANCY)

**Problem**: Repeated patterns like:
```r
file.path("laravel_app", "storage", "app", project_dir, "weekly_mosaic")
file.path("laravel_app", "storage", "app", project_dir, "reports", "kpis", kpi_subdir)
```

**Solution**: Centralize in `parameters_project.R`:
```r
# Project-agnostic path builders
get_project_storage_path <- function(project_dir, subdir = NULL) {
  base <- file.path("laravel_app", "storage", "app", project_dir)
  if (!is.null(subdir)) file.path(base, subdir) else base
}

get_mosaic_dir <- function(project_dir, mosaic_mode = "auto") {
  if (mosaic_mode == "auto") mosaic_mode <- detect_mosaic_mode(project_dir)
  if (mosaic_mode == "tiled") {
    get_project_storage_path(project_dir, "weekly_tile_max/5x5")
  } else {
    get_project_storage_path(project_dir, "weekly_mosaic")
  }
}

get_kpi_dir <- function(project_dir, client_type) {
  subdir <- if (client_type == "agronomic_support") "field_level" else "field_analysis"
  get_project_storage_path(project_dir, file.path("reports", "kpis", subdir))
}
```

---

## 2. DEBUG STATEMENTS & LOGGING CLUTTER

### 2.1 Excessive Debug Output

The pipeline prints **40+ debug statements** that pollute the terminal output. Examples:

**In `run_full_pipeline.R`**:
```r
Line 82:   cat(sprintf("       Running week: %02d / %d\n", ...))  # Note: %d (calendar year) should be %G
Line 218:  cat(sprintf("[KPI_DIR_CREATED] Created directory: %s\n", ...))
Line 223:  cat(sprintf("[KPI_DIR_EXISTS] %s\n", ...))
Line 224:  cat(sprintf("[KPI_DEBUG] Total files in directory: %d\n", ...))
Line 225:  cat(sprintf("[KPI_DEBUG] Sample files: %s\n", ...))
Line 240:  cat(sprintf("[KPI_DEBUG_W%02d_%d] Pattern: '%s' | Found: %d files\n", ...))
Line 630:  cat("DEBUG: Running command:", cmd, "\n")
Line 630 in Script 31 execution - prints full conda command
```

**In `80_calculate_kpis.R`**:
```
Line 323:  message(paste("Calculating statistics for all fields - Week", week_num, year))
Line 417:  # Plus many more ...
```

**Impact**:
- Makes output hard to scan for real issues
- Test developers skip important messages
- Production logs become noise

**Solution**: Replace with **structured logging** (3 levels):

```r
# Add to parameters_project.R
smartcane_log <- function(message, level = "INFO") {
  timestamp <- format(Sys.time(), "%Y-%m-%d %H:%M:%S")
  prefix <- sprintf("[%s] %s", level, timestamp)
  cat(sprintf("%s | %s\n", prefix, message))
}

smartcane_debug <- function(message) {
  if (Sys.getenv("SMARTCANE_DEBUG") == "TRUE") {
    smartcane_log(message, level = "DEBUG")
  }
}

smartcane_warn <- function(message) {
  smartcane_log(message, level = "WARN")
}
```

**Usage**:
```r
# Keep important messages
smartcane_log(sprintf("Downloaded %d dates, %d failed", download_count, download_failed))

# Hide debug clutter (only show if DEBUG=TRUE)
smartcane_debug(sprintf("KPI directory exists: %s", kpi_dir))

# Warnings stay visible
smartcane_warn("Some downloads failed, but continuing pipeline")
```

---

### 2.2 Redundant Status Checks in KPI Section

**Lines 218-270 in `run_full_pipeline.R`**: The KPI requirement check has **deeply nested debug statements**.

```r
if (dir.exists(kpi_dir)) {
  cat(sprintf("[KPI_DIR_EXISTS] %s\n", kpi_dir))
  all_kpi_files <- list.files(kpi_dir)
  cat(sprintf("[KPI_DEBUG] Total files in directory: %d\n", length(all_kpi_files)))
  if (length(all_kpi_files) > 0) {
    cat(sprintf("[KPI_DEBUG] Sample files: %s\n", ...))
  }
} else {
  cat(sprintf("[KPI_DIR_MISSING] Directory does not exist: %s\n", kpi_dir))
}
```

**Solution**: Simplify to:
```r
if (!dir.exists(kpi_dir)) {
  dir.create(kpi_dir, recursive = TRUE, showWarnings = FALSE)
}

all_kpi_files <- list.files(kpi_dir)
smartcane_debug(sprintf("KPI directory: %d files found", length(all_kpi_files)))
```

---

## 3. DOUBLE CALCULATIONS & INEFFICIENCIES

### 3.1 KPI Existence Check (Calculated Twice)

**Problem**: KPI existence is checked **twice** in `run_full_pipeline.R`:

1. **First check (Lines 228-270)**: Initial KPI requirement check that calculates `kpis_needed` dataframe
2. **Second check (Lines 786-810)**: Verification after Script 80 runs (almost identical logic)

Both loops do:
```r
for (weeks_back in 0:(reporting_weeks_needed - 1)) {
  check_date <- end_date - (weeks_back * 7)
  week_num <- as.numeric(format(check_date, "%V"))
  year_num <- as.numeric(format(check_date, "%G"))

  week_pattern <- sprintf("week%02d_%d", week_num, year_num)
  kpi_files_this_week <- list.files(kpi_dir, pattern = week_pattern)

  has_kpis <- length(kpi_files_this_week) > 0
  # ... same logic again
}
```

**Impact**: Slower pipeline execution, code duplication

**Solution**: Create **reusable function in utility file**:
```r
check_kpi_completeness <- function(project_dir, client_type, end_date, reporting_weeks_needed) {
  kpi_dir <- get_kpi_dir(project_dir, client_type)

  kpis_needed <- data.frame()
  for (weeks_back in 0:(reporting_weeks_needed - 1)) {
    check_date <- end_date - (weeks_back * 7)
    wwy <- get_iso_week_year(check_date)

    week_pattern <- sprintf("week%02d_%d", wwy$week, wwy$year)
    has_kpis <- any(grepl(week_pattern, list.files(kpi_dir)))

    kpis_needed <- rbind(kpis_needed, data.frame(
      week = wwy$week,
      year = wwy$year,
      date = check_date,
      has_kpis = has_kpis
    ))
  }

  return(list(
    kpis_df = kpis_needed,
    missing_count = sum(!kpis_needed$has_kpis),
    all_complete = all(kpis_needed$has_kpis)
  ))
}

# Then in run_full_pipeline.R:
initial_kpi_check <- check_kpi_completeness(project_dir, client_type, end_date, reporting_weeks_needed)

# ... after Script 80 runs:
final_kpi_check <- check_kpi_completeness(project_dir, client_type, end_date, reporting_weeks_needed)
if (final_kpi_check$all_complete) {
  smartcane_log("✓ All KPIs available")
}
```

---

### 3.2 Mosaic Mode Detection (Called 3+ Times per Run)

**Current code**:
- Line 99-117: `detect_mosaic_mode_early()` called once
- Line 301-324: `detect_mosaic_mode_simple()` called again
- Result: **Same detection logic runs twice unnecessarily**

**Solution**: Call once, store result:
```r
mosaic_mode <- detect_mosaic_mode(project_dir)  # Once at top

# Then reuse throughout:
if (mosaic_mode == "tiled") { ... }
else if (mosaic_mode == "single-file") { ... }
```

---

### 3.3 Missing Weeks Calculation Inefficiency

**Lines 126-170**: The loop builds `weeks_needed` dataframe, then **immediately** iterates again to find which ones are missing.

**Current code**:
```r
# First: build all weeks
weeks_needed <- data.frame()
for (weeks_back in 0:(reporting_weeks_needed - 1)) {
  # ... build weeks_needed
}

# Then: check which are missing (loop again)
missing_weeks <- data.frame()
for (i in 1:nrow(weeks_needed)) {
  # ... check each week
}
```

**Solution**: Combine into **single loop**:
```r
weeks_needed <- data.frame()
missing_weeks <- data.frame()
earliest_missing_date <- end_date

for (weeks_back in 0:(reporting_weeks_needed - 1)) {
  check_date <- end_date - (weeks_back * 7)
  wwy <- get_iso_week_year(check_date)

  # Add to weeks_needed
  weeks_needed <- rbind(weeks_needed, data.frame(
    week = wwy$week, year = wwy$year, date = check_date
  ))

  # Check if missing, add to missing_weeks if so
  week_pattern <- sprintf("week_%02d_%d", wwy$week, wwy$year)
  mosaic_dir <- get_mosaic_dir(project_dir, mosaic_mode)

  if (length(list.files(mosaic_dir, pattern = week_pattern)) == 0) {
    missing_weeks <- rbind(missing_weeks, data.frame(
      week = wwy$week, year = wwy$year, week_end_date = check_date
    ))
    if (check_date - 6 < earliest_missing_date) {
      earliest_missing_date <- check_date - 6
    }
  }
}
```

---

### 3.4 Data Source Detection Logic

**Lines 58-84**: The `data_source_used` detection is overly complex:

```r
data_source_used <- "merged_tif_8b"  # Default
if (dir.exists(merged_tif_path)) {
  tif_files <- list.files(merged_tif_path, pattern = "\\.tif$")
  if (length(tif_files) > 0) {
    data_source_used <- "merged_tif"
    # ...
  } else if (dir.exists(merged_tif_8b_path)) {
    tif_files_8b <- list.files(merged_tif_8b_path, pattern = "\\.tif$")
    # ...
  }
} else if (dir.exists(merged_tif_8b_path)) {
  # ...
}
```

**Issues**:
- Multiple nested conditions doing the same check
- `tif_files` and `tif_files_8b` are listed but only counts checked (not used later)
- Logic could be cleaner

**Solution**: Create utility function:
```r
detect_data_source <- function(project_dir, preferred = "auto") {
  storage_dir <- get_project_storage_path(project_dir)

  for (source in c("merged_tif", "merged_tif_8b")) {
    source_dir <- file.path(storage_dir, source)
    if (dir.exists(source_dir)) {
      tifs <- list.files(source_dir, pattern = "\\.tif$")
      if (length(tifs) > 0) return(source)
    }
  }

  smartcane_warn("No data source found - defaulting to merged_tif_8b")
  return("merged_tif_8b")
}
```

---

## 4. WORKFLOW CLARITY ISSUES

### 4.1 TIFF Data Format Confusion

**Problem**: Why are there TWO different TIFF folders?

- `merged_tif`: 4-band data (RGB + NIR)
- `merged_tif_8b`: 8-band data (appears to include UDM cloud masking from Planet)

**Currently in code**:
```r
data_source <- if (project_dir == "angata") "merged_tif_8b" else "merged_tif"
```

**Issues**:
- Hard-coded per project, not based on what's actually available
- Not documented **why** angata uses 8-band
- Unclear what the 8-band data adds (cloud masking? extra bands?)
- Scripts handle both, but it's not clear when to use which

**Recommendation**:
1. **Document in `parameters_project.R`** what each data source contains:
```r
DATA_SOURCE_FORMATS <- list(
  "merged_tif" = list(
    bands = 4,
    description = "4-band PlanetScope: Red, Green, Blue, NIR",
    projects = c("aura", "chemba", "xinavane"),
    note = "Standard format from Planet API"
  ),
  "merged_tif_8b" = list(
    bands = 8,
    description = "8-band PlanetScope with UDM: RGB+NIR + 4-band cloud mask",
    projects = c("angata"),
    note = "Enhanced with cloud confidence from UDM2 (Unusable Data Mask)"
  )
)
```

2. **Update hard-coded assignment** to be data-driven:
```r
# OLD: data_source <- if (project_dir == "angata") "merged_tif_8b" else "merged_tif"
# NEW: detect what's actually available
data_source <- detect_data_source(project_dir)
```

---

### 4.2 Mosaic Storage Format Confusion

**Problem**: Why are there TWO different mosaic storage styles?

- `weekly_mosaic/`: Single TIF file per week (monolithic)
- `weekly_tile_max/5x5/`: Tiled TIFFs per week (25+ files per week)

**Currently in code**:
- Detected automatically via `detect_mosaic_mode()`
- But **no documentation** on when/why each is used

**Recommendation**:
1. **Document the trade-offs in `parameters_project.R`**:
```r
MOSAIC_MODES <- list(
  "single-file" = list(
    description = "One TIF per week",
    storage_path = "weekly_mosaic/",
    files_per_week = 1,
    pros = c("Simpler file management", "Easier to load full mosaic"),
    cons = c("Slower for field-specific analysis", "Large file I/O"),
    suitable_for = c("agronomic_support", "dashboard visualization")
  ),
  "tiled" = list(
    description = "5×5 grid of tiles per week",
    storage_path = "weekly_tile_max/5x5/",
    files_per_week = 25,
    pros = c("Parallel field processing", "Faster per-field queries", "Scalable to 1000+ fields"),
    cons = c("More file management", "Requires tile_grid metadata"),
    suitable_for = c("cane_supply", "large-scale operations")
  )
)
```

2. **Document why angata uses tiled, aura uses single-file**:
   - Is it a function of field count? (Angata = cane_supply, large fields → tiled)
   - Is it historical? (Legacy decision?)
   - Should new projects choose based on client type?

---

### 4.3 Client Type Mapping Clarity

**Current structure** in `parameters_project.R`:

```r
CLIENT_TYPE_MAP <- list(
  "angata" = "cane_supply",
  "aura" = "agronomic_support",
  "chemba" = "cane_supply",
  "xinavane" = "cane_supply",
  "esa" = "cane_supply"
)
```

**Issues**:
- Not clear **why** aura is agronomic_support while angata/chemba are cane_supply
- No documentation of what each client type needs
- Scripts branch heavily on `skip_cane_supply_only` logic

**Recommendation**:
Add metadata to explain the distinction:

```r
CLIENT_TYPES <- list(
  "cane_supply" = list(
    description = "Sugar mill supply chain optimization",
    requires_harvest_prediction = TRUE,  # Script 31
    requires_phase_assignment = TRUE,     # Based on planting date
    per_field_detail = TRUE,              # Script 91 Excel report
    data_sources = c("merged_tif", "merged_tif_8b"),
    mosaic_mode = "tiled",
    projects = c("angata", "chemba", "xinavane", "esa")
  ),
  "agronomic_support" = list(
    description = "Farm-level decision support for agronomists",
    requires_harvest_prediction = FALSE,
    requires_phase_assignment = FALSE,
    per_field_detail = FALSE,
    farm_level_kpis = TRUE,               # Script 90 Word report
    data_sources = c("merged_tif"),
    mosaic_mode = "single-file",
    projects = c("aura")
  )
)
```

---

## 5. COMMAND CONSTRUCTION REDUNDANCY

### 5.1 Rscript Path Repetition

**Problem**: The Rscript path is repeated 5 times:

```r
Line 519:  '"C:\\Program Files\\R\\R-4.4.3\\bin\\x64\\Rscript.exe"'
Line 676:  '"C:\\Program Files\\R\\R-4.4.3\\bin\\x64\\Rscript.exe"'
Line 685:  '"C:\\Program Files\\R\\R-4.4.3\\bin\\x64\\Rscript.exe"'
```

**Solution**: Define once in `parameters_project.R`:
```r
RSCRIPT_PATH <- "C:\\Program Files\\R\\R-4.4.3\\bin\\x64\\Rscript.exe"

# Usage:
cmd <- sprintf('"%s" --vanilla r_app/20_ci_extraction.R ...', RSCRIPT_PATH)
```

---

## 6. SPECIFIC LINE-BY-LINE ISSUES

### 6.1 Line 82 Bug: Wrong Format Code

```r
cat(sprintf("       Running week: %02d / %d\n",
            as.numeric(format(end_date, "%V")),
            as.numeric(format(end_date, "%Y"))))  # ❌ Should be %G, not %Y
```

**Issue**: Uses calendar year `%Y` instead of ISO week year `%G`. On dates like 2025-12-30 (week 1 of 2026), this will print "Week 01 / 2025" (confusing).

**Fix**:
```r
wwy <- get_iso_week_year(end_date)
cat(sprintf("       Running week: %02d / %d\n", wwy$week, wwy$year))
```

---

### 6.2 Line 630 Debug Statement

```r
cmd <- sprintf('conda run -n pytorch_gpu python python_app/31_harvest_imminent_weekly.py %s', project_dir)
cat("DEBUG: Running command:", cmd, "\n")  # ❌ Prints full conda command
```

**Solution**: Use `smartcane_debug()` function:
```r
cmd <- sprintf('conda run -n pytorch_gpu python python_app/31_harvest_imminent_weekly.py %s', project_dir)
smartcane_debug(sprintf("Running Python 31: %s", cmd))
```

---

### 6.3 Lines 719-723: Verbose Script 31 Verification

```r
# Check for THIS WEEK's specific file
current_week <- as.numeric(format(end_date, "%V"))
current_year <- as.numeric(format(end_date, "%Y"))
expected_file <- file.path(...)
```

**Issue**: Calculates week twice (already done earlier). Also uses `%Y` (should be `%G`).

**Solution**: Reuse earlier `wwy` calculation or create helper.

---

## 7. REFACTORING ROADMAP

### Phase 1: Foundation (1 hour)
- [ ] Consolidate `detect_mosaic_mode()` into single function in `parameters_project.R`
- [ ] Create `get_iso_week_year()` and `format_week_year()` utilities
- [ ] Create `get_project_storage_path()`, `get_mosaic_dir()`, `get_kpi_dir()` helpers
- [ ] Add logging functions (`smartcane_log()`, `smartcane_debug()`, `smartcane_warn()`)

### Phase 2: Deduplication (1 hour)
- [ ] Replace all 13+ week_num/year_num calculations with `get_iso_week_year()`
- [ ] Replace all 3 `detect_mosaic_mode_*()` calls with single function
- [ ] Combine duplicate KPI checks into `check_kpi_completeness()` function
- [ ] Fix line 82 and 630 format bugs

### Phase 3: Cleanup (1 hour)
- [ ] Remove all debug statements (40+), replace with `smartcane_debug()`
- [ ] Simplify nested conditions in data_source detection
- [ ] Combine missing weeks detection into single loop
- [ ] Extract Rscript path to constant

### Phase 4: Documentation (30 min)
- [ ] Add comments explaining `merged_tif` vs `merged_tif_8b` trade-offs
- [ ] Document `single-file` vs `tiled` mosaic modes and when to use each
- [ ] Clarify client type mapping in `CLIENT_TYPE_MAP`
- [ ] Add inline comments for non-obvious logic

---

## 8. ARCHITECTURE & WORKFLOW RECOMMENDATIONS

### 8.1 Clear Data Flow Diagram

Add to `r_app/system_architecture/system_architecture.md`:

```
INPUT SOURCES:
  ├── Planet API 4-band or 8-band imagery
  ├── Field boundaries (pivot.geojson)
  └── Harvest data (harvest.xlsx, optional for cane_supply)

STORAGE TIERS:
  ├── Tier 1: Raw data (merged_tif/ or merged_tif_8b/)
  ├── Tier 2: Daily tiles (daily_tiles_split/{grid_size}/{dates}/)
  ├── Tier 3: Extracted CI (Data/extracted_ci/daily_vals/*.rds)
  ├── Tier 4: Weekly mosaics (weekly_mosaic/ OR weekly_tile_max/5x5/)
  └── Tier 5: KPI outputs (reports/kpis/{field_level|field_analysis}/)

DECISION POINTS:
  └─ Client type (cane_supply vs agronomic_support)
     ├─ Drives script selection (Scripts 21, 22, 23, 31, 90/91)
     ├─ Drives data source (merged_tif_8b for cane_supply, merged_tif for agronomic)
     ├─ Drives mosaic mode (tiled for cane_supply, single-file for agronomic)
     └─ Drives KPI subdirectory (field_analysis vs field_level)
```

### 8.2 .sh Scripts Alignment

You mention `.sh` scripts in the online environment. If they're **not calling the R pipeline**, there's a **split responsibility** issue:

**Question**: Are the `.sh` scripts:
- (A) Independent duplicates of the R pipeline logic? (BAD - maintenance nightmare)
- (B) Wrappers calling the R pipeline? (GOOD - single source of truth)
- (C) Different workflow for online vs local? (RED FLAG - they diverge)

**Recommendation**: If using `.sh` for production, ensure they **call the same R scripts** (`run_full_pipeline.R`). Example:

```bash
#!/bin/bash
# Wrapper that ensures R pipeline is called
cd /path/to/smartcane
& "C:\Program Files\R\R-4.4.3\bin\x64\Rscript.exe" r_app/run_full_pipeline.R
```

---

## 9. SUMMARY TABLE: Issues by Severity

| Issue | Type | Impact | Effort | Priority |
|-------|------|--------|--------|----------|
| 3 mosaic detection functions | Duplication | HIGH | 30 min | P0 |
| 13+ week/year calculations | Duplication | HIGH | 1 hour | P0 |
| 40+ debug statements | Clutter | MEDIUM | 1 hour | P1 |
| KPI check run twice | Inefficiency | LOW | 30 min | P2 |
| Line 82: %Y should be %G | Bug | LOW | 5 min | P2 |
| Data source confusion | Documentation | MEDIUM | 30 min | P1 |
| Mosaic mode confusion | Documentation | MEDIUM | 30 min | P1 |
| Client type mapping | Documentation | MEDIUM | 30 min | P1 |
| Data source detection complexity | Code style | LOW | 15 min | P3 |

---

## 10. RECOMMENDED NEXT STEPS

1. **Review this report** with your team to align on priorities
2. **Create Linear issues** for each phase of refactoring
3. **Start with Phase 1** (foundation utilities) - builds confidence for Phase 2
4. **Test thoroughly** after each phase - the pipeline is complex and easy to break
5. **Update `.sh` scripts** if they duplicate R logic
6. **Document data flow** in `system_architecture/system_architecture.md`

---

## Questions for Clarification

Before implementing, please clarify:

1. **Data source split**: Why does angata use `merged_tif_8b` (8-band with cloud mask) while aura uses `merged_tif` (4-band)? Is this:
   - A function of client need (cane_supply requires cloud masking)?
   - Historical (legacy decision for angata)?
   - Should new projects choose based on availability?

2. **Mosaic mode split**: Why tiled for angata but single-file for aura? Should this be:
   - Hard-coded per project?
   - Based on field count/client type?
   - Auto-detected from first run?

3. **Production vs local**: Are the `.sh` scripts in the online environment:
   - Calling this same R pipeline?
   - Duplicating logic independently?
   - A different workflow entirely?

4. **Client type growth**: Are there other client types planned beyond `cane_supply` and `agronomic_support`? (e.g., extension_service?)

---

**Report prepared**: January 29, 2026
**Total code reviewed**: ~2,500 lines across 10 files
**Estimated refactoring time**: 3-4 hours
**Estimated maintenance savings**: 5-10 hours/month (fewer bugs, easier updates)