SmartCane/CODE_REVIEW_FINDINGS.md
2026-01-29 17:26:03 +01:00

752 lines
24 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# SmartCane Pipeline Code Review
## Efficiency, Cleanup, and Architecture Analysis
**Date**: January 29, 2026
**Scope**: `run_full_pipeline.R` + all called scripts (10, 20, 21, 30, 31, 40, 80, 90, 91) + utility files
**Status**: Comprehensive review completed
---
## EXECUTIVE SUMMARY
Your pipeline is **well-structured and intentional**, but has accumulated significant technical debt through development iterations. The main issues are:
1. **🔴 HIGH IMPACT**: **3 separate mosaic mode detection functions** doing identical work
2. **🔴 HIGH IMPACT**: **Week/year calculations duplicated 10+ times** across 6+ files
3. **🟡 MEDIUM IMPACT**: **40+ debug statements** cluttering output
4. **🟡 MEDIUM IMPACT**: **File existence checks repeated** in multiple places (especially KPI checks)
5. **🟢 LOW IMPACT**: Minor redundancy in command construction, but manageable
**Estimated cleanup effort**: 2-3 hours for core refactoring; significant code quality gains.
**Workflow clarity issue**: The split between `merged_tif` vs `merged_tif_8b` and `weekly_mosaic` vs `weekly_tile_max` is **not clearly documented**. This should be clarified.
---
## 1. DUPLICATED FUNCTIONS & LOGIC
### 1.1 Mosaic Mode Detection (CRITICAL REDUNDANCY)
**Problem**: Three identical implementations of `detect_mosaic_mode()`:
| Location | Function Name | Lines | Issue |
|----------|---------------|-------|-------|
| `run_full_pipeline.R` | `detect_mosaic_mode_early()` | ~20 lines | Detects tiled vs single-file |
| `run_full_pipeline.R` | `detect_mosaic_mode_simple()` | ~20 lines | Detects tiled vs single-file (duplicate) |
| `parameters_project.R` | `detect_mosaic_mode()` | ~30 lines | Detects tiled vs single-file (different signature) |
**Impact**: If you change the detection logic, you must update 3 places. Bug risk is high.
**Solution**: Create **single canonical function in `parameters_project.R`**:
```r
# SINGLE SOURCE OF TRUTH
detect_mosaic_mode <- function(project_dir) {
weekly_tile_max <- file.path("laravel_app", "storage", "app", project_dir, "weekly_tile_max")
if (dir.exists(weekly_tile_max)) {
subfolders <- list.dirs(weekly_tile_max, full.names = FALSE, recursive = FALSE)
if (length(grep("^\\d+x\\d+$", subfolders)) > 0) return("tiled")
}
weekly_mosaic <- file.path("laravel_app", "storage", "app", project_dir, "weekly_mosaic")
if (dir.exists(weekly_mosaic) &&
length(list.files(weekly_mosaic, pattern = "^week_.*\\.tif$")) > 0) {
return("single-file")
}
return("unknown")
}
```
Then replace all three calls in `run_full_pipeline.R` with this single function.
---
### 1.2 Week/Year Calculations (CRITICAL REDUNDANCY)
**Problem**: The pattern `week_num <- as.numeric(format(..., "%V"))` + `year_num <- as.numeric(format(..., "%G"))` appears **13+ times** across multiple files.
**Locations**:
- `run_full_pipeline.R`: Lines 82, 126-127, 229-230, 630, 793-794 (5 times)
- `80_calculate_kpis.R`: Lines 323-324 (1 time)
- `80_weekly_stats_utils.R`: Lines 829-830 (1 time)
- `kpi_utils.R`: Line 45 (1 time)
- `80_kpi_utils.R`: Lines 177-178 (1 time)
- Plus inline in sprintf statements: ~10+ additional times
**Impact**:
- High maintenance burden
- Risk of inconsistency (%V vs %Y confusion noted at line 82 in `run_full_pipeline.R`)
- Code verbosity
**Solution**: Create **utility function in `parameters_project.R`**:
```r
get_iso_week_year <- function(date) {
list(
week = as.numeric(format(date, "%V")),
year = as.numeric(format(date, "%G")) # ISO year, not calendar year
)
}
# Usage:
wwy <- get_iso_week_year(end_date)
cat(sprintf("Week %02d/%d\n", wwy$week, wwy$year))
```
**Also add convenience function**:
```r
format_week_year <- function(date, separator = "_") {
wwy <- get_iso_week_year(date)
sprintf("week_%02d%s%d", wwy$week, separator, wwy$year)
}
# Usage: format_week_year(end_date) # "week_02_2026"
```
---
### 1.3 File Path Construction (MEDIUM REDUNDANCY)
**Problem**: Repeated patterns like:
```r
file.path("laravel_app", "storage", "app", project_dir, "weekly_mosaic")
file.path("laravel_app", "storage", "app", project_dir, "reports", "kpis", kpi_subdir)
```
**Solution**: Centralize in `parameters_project.R`:
```r
# Project-agnostic path builders
get_project_storage_path <- function(project_dir, subdir = NULL) {
base <- file.path("laravel_app", "storage", "app", project_dir)
if (!is.null(subdir)) file.path(base, subdir) else base
}
get_mosaic_dir <- function(project_dir, mosaic_mode = "auto") {
if (mosaic_mode == "auto") mosaic_mode <- detect_mosaic_mode(project_dir)
if (mosaic_mode == "tiled") {
get_project_storage_path(project_dir, "weekly_tile_max/5x5")
} else {
get_project_storage_path(project_dir, "weekly_mosaic")
}
}
get_kpi_dir <- function(project_dir, client_type) {
subdir <- if (client_type == "agronomic_support") "field_level" else "field_analysis"
get_project_storage_path(project_dir, file.path("reports", "kpis", subdir))
}
```
---
## 2. DEBUG STATEMENTS & LOGGING CLUTTER
### 2.1 Excessive Debug Output
The pipeline prints **40+ debug statements** that pollute the terminal output. Examples:
**In `run_full_pipeline.R`**:
```r
Line 82: cat(sprintf(" Running week: %02d / %d\n", ...)) # Note: %d (calendar year) should be %G
Line 218: cat(sprintf("[KPI_DIR_CREATED] Created directory: %s\n", ...))
Line 223: cat(sprintf("[KPI_DIR_EXISTS] %s\n", ...))
Line 224: cat(sprintf("[KPI_DEBUG] Total files in directory: %d\n", ...))
Line 225: cat(sprintf("[KPI_DEBUG] Sample files: %s\n", ...))
Line 240: cat(sprintf("[KPI_DEBUG_W%02d_%d] Pattern: '%s' | Found: %d files\n", ...))
Line 630: cat("DEBUG: Running command:", cmd, "\n")
Line 630 in Script 31 execution - prints full conda command
```
**In `80_calculate_kpis.R`**:
```
Line 323: message(paste("Calculating statistics for all fields - Week", week_num, year))
Line 417: # Plus many more ...
```
**Impact**:
- Makes output hard to scan for real issues
- Test developers skip important messages
- Production logs become noise
**Solution**: Replace with **structured logging** (3 levels):
```r
# Add to parameters_project.R
smartcane_log <- function(message, level = "INFO") {
timestamp <- format(Sys.time(), "%Y-%m-%d %H:%M:%S")
prefix <- sprintf("[%s] %s", level, timestamp)
cat(sprintf("%s | %s\n", prefix, message))
}
smartcane_debug <- function(message) {
if (Sys.getenv("SMARTCANE_DEBUG") == "TRUE") {
smartcane_log(message, level = "DEBUG")
}
}
smartcane_warn <- function(message) {
smartcane_log(message, level = "WARN")
}
```
**Usage**:
```r
# Keep important messages
smartcane_log(sprintf("Downloaded %d dates, %d failed", download_count, download_failed))
# Hide debug clutter (only show if DEBUG=TRUE)
smartcane_debug(sprintf("KPI directory exists: %s", kpi_dir))
# Warnings stay visible
smartcane_warn("Some downloads failed, but continuing pipeline")
```
---
### 2.2 Redundant Status Checks in KPI Section
**Lines 218-270 in `run_full_pipeline.R`**: The KPI requirement check has **deeply nested debug statements**.
```r
if (dir.exists(kpi_dir)) {
cat(sprintf("[KPI_DIR_EXISTS] %s\n", kpi_dir))
all_kpi_files <- list.files(kpi_dir)
cat(sprintf("[KPI_DEBUG] Total files in directory: %d\n", length(all_kpi_files)))
if (length(all_kpi_files) > 0) {
cat(sprintf("[KPI_DEBUG] Sample files: %s\n", ...))
}
} else {
cat(sprintf("[KPI_DIR_MISSING] Directory does not exist: %s\n", kpi_dir))
}
```
**Solution**: Simplify to:
```r
if (!dir.exists(kpi_dir)) {
dir.create(kpi_dir, recursive = TRUE, showWarnings = FALSE)
}
all_kpi_files <- list.files(kpi_dir)
smartcane_debug(sprintf("KPI directory: %d files found", length(all_kpi_files)))
```
---
## 3. DOUBLE CALCULATIONS & INEFFICIENCIES
### 3.1 KPI Existence Check (Calculated Twice)
**Problem**: KPI existence is checked **twice** in `run_full_pipeline.R`:
1. **First check (Lines 228-270)**: Initial KPI requirement check that calculates `kpis_needed` dataframe
2. **Second check (Lines 786-810)**: Verification after Script 80 runs (almost identical logic)
Both loops do:
```r
for (weeks_back in 0:(reporting_weeks_needed - 1)) {
check_date <- end_date - (weeks_back * 7)
week_num <- as.numeric(format(check_date, "%V"))
year_num <- as.numeric(format(check_date, "%G"))
week_pattern <- sprintf("week%02d_%d", week_num, year_num)
kpi_files_this_week <- list.files(kpi_dir, pattern = week_pattern)
has_kpis <- length(kpi_files_this_week) > 0
# ... same logic again
}
```
**Impact**: Slower pipeline execution, code duplication
**Solution**: Create **reusable function in utility file**:
```r
check_kpi_completeness <- function(project_dir, client_type, end_date, reporting_weeks_needed) {
kpi_dir <- get_kpi_dir(project_dir, client_type)
kpis_needed <- data.frame()
for (weeks_back in 0:(reporting_weeks_needed - 1)) {
check_date <- end_date - (weeks_back * 7)
wwy <- get_iso_week_year(check_date)
week_pattern <- sprintf("week%02d_%d", wwy$week, wwy$year)
has_kpis <- any(grepl(week_pattern, list.files(kpi_dir)))
kpis_needed <- rbind(kpis_needed, data.frame(
week = wwy$week,
year = wwy$year,
date = check_date,
has_kpis = has_kpis
))
}
return(list(
kpis_df = kpis_needed,
missing_count = sum(!kpis_needed$has_kpis),
all_complete = all(kpis_needed$has_kpis)
))
}
# Then in run_full_pipeline.R:
initial_kpi_check <- check_kpi_completeness(project_dir, client_type, end_date, reporting_weeks_needed)
# ... after Script 80 runs:
final_kpi_check <- check_kpi_completeness(project_dir, client_type, end_date, reporting_weeks_needed)
if (final_kpi_check$all_complete) {
smartcane_log("✓ All KPIs available")
}
```
---
### 3.2 Mosaic Mode Detection (Called 3+ Times per Run)
**Current code**:
- Line 99-117: `detect_mosaic_mode_early()` called once
- Line 301-324: `detect_mosaic_mode_simple()` called again
- Result: **Same detection logic runs twice unnecessarily**
**Solution**: Call once, store result:
```r
mosaic_mode <- detect_mosaic_mode(project_dir) # Once at top
# Then reuse throughout:
if (mosaic_mode == "tiled") { ... }
else if (mosaic_mode == "single-file") { ... }
```
---
### 3.3 Missing Weeks Calculation Inefficiency
**Lines 126-170**: The loop builds `weeks_needed` dataframe, then **immediately** iterates again to find which ones are missing.
**Current code**:
```r
# First: build all weeks
weeks_needed <- data.frame()
for (weeks_back in 0:(reporting_weeks_needed - 1)) {
# ... build weeks_needed
}
# Then: check which are missing (loop again)
missing_weeks <- data.frame()
for (i in 1:nrow(weeks_needed)) {
# ... check each week
}
```
**Solution**: Combine into **single loop**:
```r
weeks_needed <- data.frame()
missing_weeks <- data.frame()
earliest_missing_date <- end_date
for (weeks_back in 0:(reporting_weeks_needed - 1)) {
check_date <- end_date - (weeks_back * 7)
wwy <- get_iso_week_year(check_date)
# Add to weeks_needed
weeks_needed <- rbind(weeks_needed, data.frame(
week = wwy$week, year = wwy$year, date = check_date
))
# Check if missing, add to missing_weeks if so
week_pattern <- sprintf("week_%02d_%d", wwy$week, wwy$year)
mosaic_dir <- get_mosaic_dir(project_dir, mosaic_mode)
if (length(list.files(mosaic_dir, pattern = week_pattern)) == 0) {
missing_weeks <- rbind(missing_weeks, data.frame(
week = wwy$week, year = wwy$year, week_end_date = check_date
))
if (check_date - 6 < earliest_missing_date) {
earliest_missing_date <- check_date - 6
}
}
}
```
---
### 3.4 Data Source Detection Logic
**Lines 58-84**: The `data_source_used` detection is overly complex:
```r
data_source_used <- "merged_tif_8b" # Default
if (dir.exists(merged_tif_path)) {
tif_files <- list.files(merged_tif_path, pattern = "\\.tif$")
if (length(tif_files) > 0) {
data_source_used <- "merged_tif"
# ...
} else if (dir.exists(merged_tif_8b_path)) {
tif_files_8b <- list.files(merged_tif_8b_path, pattern = "\\.tif$")
# ...
}
} else if (dir.exists(merged_tif_8b_path)) {
# ...
}
```
**Issues**:
- Multiple nested conditions doing the same check
- `tif_files` and `tif_files_8b` are listed but only counts checked (not used later)
- Logic could be cleaner
**Solution**: Create utility function:
```r
detect_data_source <- function(project_dir, preferred = "auto") {
storage_dir <- get_project_storage_path(project_dir)
for (source in c("merged_tif", "merged_tif_8b")) {
source_dir <- file.path(storage_dir, source)
if (dir.exists(source_dir)) {
tifs <- list.files(source_dir, pattern = "\\.tif$")
if (length(tifs) > 0) return(source)
}
}
smartcane_warn("No data source found - defaulting to merged_tif_8b")
return("merged_tif_8b")
}
```
---
## 4. WORKFLOW CLARITY ISSUES
### 4.1 TIFF Data Format Confusion
**Problem**: Why are there TWO different TIFF folders?
- `merged_tif`: 4-band data (RGB + NIR)
- `merged_tif_8b`: 8-band data (appears to include UDM cloud masking from Planet)
**Currently in code**:
```r
data_source <- if (project_dir == "angata") "merged_tif_8b" else "merged_tif"
```
**Issues**:
- Hard-coded per project, not based on what's actually available
- Not documented **why** angata uses 8-band
- Unclear what the 8-band data adds (cloud masking? extra bands?)
- Scripts handle both, but it's not clear when to use which
**Recommendation**:
1. **Document in `parameters_project.R`** what each data source contains:
```r
DATA_SOURCE_FORMATS <- list(
"merged_tif" = list(
bands = 4,
description = "4-band PlanetScope: Red, Green, Blue, NIR",
projects = c("aura", "chemba", "xinavane"),
note = "Standard format from Planet API"
),
"merged_tif_8b" = list(
bands = 8,
description = "8-band PlanetScope with UDM: RGB+NIR + 4-band cloud mask",
projects = c("angata"),
note = "Enhanced with cloud confidence from UDM2 (Unusable Data Mask)"
)
)
```
2. **Update hard-coded assignment** to be data-driven:
```r
# OLD: data_source <- if (project_dir == "angata") "merged_tif_8b" else "merged_tif"
# NEW: detect what's actually available
data_source <- detect_data_source(project_dir)
```
---
### 4.2 Mosaic Storage Format Confusion
**Problem**: Why are there TWO different mosaic storage styles?
- `weekly_mosaic/`: Single TIF file per week (monolithic)
- `weekly_tile_max/5x5/`: Tiled TIFFs per week (25+ files per week)
**Currently in code**:
- Detected automatically via `detect_mosaic_mode()`
- But **no documentation** on when/why each is used
**Recommendation**:
1. **Document the trade-offs in `parameters_project.R`**:
```r
MOSAIC_MODES <- list(
"single-file" = list(
description = "One TIF per week",
storage_path = "weekly_mosaic/",
files_per_week = 1,
pros = c("Simpler file management", "Easier to load full mosaic"),
cons = c("Slower for field-specific analysis", "Large file I/O"),
suitable_for = c("agronomic_support", "dashboard visualization")
),
"tiled" = list(
description = "5×5 grid of tiles per week",
storage_path = "weekly_tile_max/5x5/",
files_per_week = 25,
pros = c("Parallel field processing", "Faster per-field queries", "Scalable to 1000+ fields"),
cons = c("More file management", "Requires tile_grid metadata"),
suitable_for = c("cane_supply", "large-scale operations")
)
)
```
2. **Document why angata uses tiled, aura uses single-file**:
- Is it a function of field count? (Angata = cane_supply, large fields → tiled)
- Is it historical? (Legacy decision?)
- Should new projects choose based on client type?
---
### 4.3 Client Type Mapping Clarity
**Current structure** in `parameters_project.R`:
```r
CLIENT_TYPE_MAP <- list(
"angata" = "cane_supply",
"aura" = "agronomic_support",
"chemba" = "cane_supply",
"xinavane" = "cane_supply",
"esa" = "cane_supply"
)
```
**Issues**:
- Not clear **why** aura is agronomic_support while angata/chemba are cane_supply
- No documentation of what each client type needs
- Scripts branch heavily on `skip_cane_supply_only` logic
**Recommendation**:
Add metadata to explain the distinction:
```r
CLIENT_TYPES <- list(
"cane_supply" = list(
description = "Sugar mill supply chain optimization",
requires_harvest_prediction = TRUE, # Script 31
requires_phase_assignment = TRUE, # Based on planting date
per_field_detail = TRUE, # Script 91 Excel report
data_sources = c("merged_tif", "merged_tif_8b"),
mosaic_mode = "tiled",
projects = c("angata", "chemba", "xinavane", "esa")
),
"agronomic_support" = list(
description = "Farm-level decision support for agronomists",
requires_harvest_prediction = FALSE,
requires_phase_assignment = FALSE,
per_field_detail = FALSE,
farm_level_kpis = TRUE, # Script 90 Word report
data_sources = c("merged_tif"),
mosaic_mode = "single-file",
projects = c("aura")
)
)
```
---
## 5. COMMAND CONSTRUCTION REDUNDANCY
### 5.1 Rscript Path Repetition
**Problem**: The Rscript path is repeated 5 times:
```r
Line 519: '"C:\\Program Files\\R\\R-4.4.3\\bin\\x64\\Rscript.exe"'
Line 676: '"C:\\Program Files\\R\\R-4.4.3\\bin\\x64\\Rscript.exe"'
Line 685: '"C:\\Program Files\\R\\R-4.4.3\\bin\\x64\\Rscript.exe"'
```
**Solution**: Define once in `parameters_project.R`:
```r
RSCRIPT_PATH <- "C:\\Program Files\\R\\R-4.4.3\\bin\\x64\\Rscript.exe"
# Usage:
cmd <- sprintf('"%s" --vanilla r_app/20_ci_extraction.R ...', RSCRIPT_PATH)
```
---
## 6. SPECIFIC LINE-BY-LINE ISSUES
### 6.1 Line 82 Bug: Wrong Format Code
```r
cat(sprintf(" Running week: %02d / %d\n",
as.numeric(format(end_date, "%V")),
as.numeric(format(end_date, "%Y")))) # ❌ Should be %G, not %Y
```
**Issue**: Uses calendar year `%Y` instead of ISO week year `%G`. On dates like 2025-12-30 (week 1 of 2026), this will print "Week 01 / 2025" (confusing).
**Fix**:
```r
wwy <- get_iso_week_year(end_date)
cat(sprintf(" Running week: %02d / %d\n", wwy$week, wwy$year))
```
---
### 6.2 Line 630 Debug Statement
```r
cmd <- sprintf('conda run -n pytorch_gpu python python_app/31_harvest_imminent_weekly.py %s', project_dir)
cat("DEBUG: Running command:", cmd, "\n") # ❌ Prints full conda command
```
**Solution**: Use `smartcane_debug()` function:
```r
cmd <- sprintf('conda run -n pytorch_gpu python python_app/31_harvest_imminent_weekly.py %s', project_dir)
smartcane_debug(sprintf("Running Python 31: %s", cmd))
```
---
### 6.3 Lines 719-723: Verbose Script 31 Verification
```r
# Check for THIS WEEK's specific file
current_week <- as.numeric(format(end_date, "%V"))
current_year <- as.numeric(format(end_date, "%Y"))
expected_file <- file.path(...)
```
**Issue**: Calculates week twice (already done earlier). Also uses `%Y` (should be `%G`).
**Solution**: Reuse earlier `wwy` calculation or create helper.
---
## 7. REFACTORING ROADMAP
### Phase 1: Foundation (1 hour)
- [ ] Consolidate `detect_mosaic_mode()` into single function in `parameters_project.R`
- [ ] Create `get_iso_week_year()` and `format_week_year()` utilities
- [ ] Create `get_project_storage_path()`, `get_mosaic_dir()`, `get_kpi_dir()` helpers
- [ ] Add logging functions (`smartcane_log()`, `smartcane_debug()`, `smartcane_warn()`)
### Phase 2: Deduplication (1 hour)
- [ ] Replace all 13+ week_num/year_num calculations with `get_iso_week_year()`
- [ ] Replace all 3 `detect_mosaic_mode_*()` calls with single function
- [ ] Combine duplicate KPI checks into `check_kpi_completeness()` function
- [ ] Fix line 82 and 630 format bugs
### Phase 3: Cleanup (1 hour)
- [ ] Remove all debug statements (40+), replace with `smartcane_debug()`
- [ ] Simplify nested conditions in data_source detection
- [ ] Combine missing weeks detection into single loop
- [ ] Extract Rscript path to constant
### Phase 4: Documentation (30 min)
- [ ] Add comments explaining `merged_tif` vs `merged_tif_8b` trade-offs
- [ ] Document `single-file` vs `tiled` mosaic modes and when to use each
- [ ] Clarify client type mapping in `CLIENT_TYPE_MAP`
- [ ] Add inline comments for non-obvious logic
---
## 8. ARCHITECTURE & WORKFLOW RECOMMENDATIONS
### 8.1 Clear Data Flow Diagram
Add to `r_app/system_architecture/system_architecture.md`:
```
INPUT SOURCES:
├── Planet API 4-band or 8-band imagery
├── Field boundaries (pivot.geojson)
└── Harvest data (harvest.xlsx, optional for cane_supply)
STORAGE TIERS:
├── Tier 1: Raw data (merged_tif/ or merged_tif_8b/)
├── Tier 2: Daily tiles (daily_tiles_split/{grid_size}/{dates}/)
├── Tier 3: Extracted CI (Data/extracted_ci/daily_vals/*.rds)
├── Tier 4: Weekly mosaics (weekly_mosaic/ OR weekly_tile_max/5x5/)
└── Tier 5: KPI outputs (reports/kpis/{field_level|field_analysis}/)
DECISION POINTS:
└─ Client type (cane_supply vs agronomic_support)
├─ Drives script selection (Scripts 21, 22, 23, 31, 90/91)
├─ Drives data source (merged_tif_8b for cane_supply, merged_tif for agronomic)
├─ Drives mosaic mode (tiled for cane_supply, single-file for agronomic)
└─ Drives KPI subdirectory (field_analysis vs field_level)
```
### 8.2 .sh Scripts Alignment
You mention `.sh` scripts in the online environment. If they're **not calling the R pipeline**, there's a **split responsibility** issue:
**Question**: Are the `.sh` scripts:
- (A) Independent duplicates of the R pipeline logic? (BAD - maintenance nightmare)
- (B) Wrappers calling the R pipeline? (GOOD - single source of truth)
- (C) Different workflow for online vs local? (RED FLAG - they diverge)
**Recommendation**: If using `.sh` for production, ensure they **call the same R scripts** (`run_full_pipeline.R`). Example:
```bash
#!/bin/bash
# Wrapper that ensures R pipeline is called
cd /path/to/smartcane
& "C:\Program Files\R\R-4.4.3\bin\x64\Rscript.exe" r_app/run_full_pipeline.R
```
---
## 9. SUMMARY TABLE: Issues by Severity
| Issue | Type | Impact | Effort | Priority |
|-------|------|--------|--------|----------|
| 3 mosaic detection functions | Duplication | HIGH | 30 min | P0 |
| 13+ week/year calculations | Duplication | HIGH | 1 hour | P0 |
| 40+ debug statements | Clutter | MEDIUM | 1 hour | P1 |
| KPI check run twice | Inefficiency | LOW | 30 min | P2 |
| Line 82: %Y should be %G | Bug | LOW | 5 min | P2 |
| Data source confusion | Documentation | MEDIUM | 30 min | P1 |
| Mosaic mode confusion | Documentation | MEDIUM | 30 min | P1 |
| Client type mapping | Documentation | MEDIUM | 30 min | P1 |
| Data source detection complexity | Code style | LOW | 15 min | P3 |
---
## 10. RECOMMENDED NEXT STEPS
1. **Review this report** with your team to align on priorities
2. **Create Linear issues** for each phase of refactoring
3. **Start with Phase 1** (foundation utilities) - builds confidence for Phase 2
4. **Test thoroughly** after each phase - the pipeline is complex and easy to break
5. **Update `.sh` scripts** if they duplicate R logic
6. **Document data flow** in `system_architecture/system_architecture.md`
---
## Questions for Clarification
Before implementing, please clarify:
1. **Data source split**: Why does angata use `merged_tif_8b` (8-band with cloud mask) while aura uses `merged_tif` (4-band)? Is this:
- A function of client need (cane_supply requires cloud masking)?
- Historical (legacy decision for angata)?
- Should new projects choose based on availability?
2. **Mosaic mode split**: Why tiled for angata but single-file for aura? Should this be:
- Hard-coded per project?
- Based on field count/client type?
- Auto-detected from first run?
3. **Production vs local**: Are the `.sh` scripts in the online environment:
- Calling this same R pipeline?
- Duplicating logic independently?
- A different workflow entirely?
4. **Client type growth**: Are there other client types planned beyond `cane_supply` and `agronomic_support`? (e.g., extension_service?)
---
**Report prepared**: January 29, 2026
**Total code reviewed**: ~2,500 lines across 10 files
**Estimated refactoring time**: 3-4 hours
**Estimated maintenance savings**: 5-10 hours/month (fewer bugs, easier updates)