SmartCane/python_app/harvest_detection_experiments/experiment_framework/README.md
2026-01-06 14:17:37 +01:00

7.4 KiB
Raw Blame History

Harvest Detection Experiment Framework

Systematic experimentation framework for harvest detection using LSTM/GRU models with comprehensive feature engineering and automated result tracking.

Overview

This framework enables systematic, reproducible experiments for optimizing harvest detection models. It separates concerns:

  • Configuration (YAML files) - Define experiments without touching code
  • Execution (Python scripts) - Automated training, evaluation, comparison
  • Results (Organized folders) - All metrics, models, and plots saved automatically

Quick Start

1. Run a Single Experiment

cd experiment_framework
python run_experiment.py --exp exp_001

This will:

  • Load data from lstm_complete_data.csv
  • Extract features defined in config/experiments.yaml
  • Train with 5-fold cross-validation
  • Evaluate on held-out test set
  • Save all results to results/001_trends_only/

2. Run Multiple Experiments (Batch)

python run_experiment.py --exp exp_001,exp_002,exp_003

Runs experiments 001, 002, and 003 sequentially.

3. Compare All Results

python analyze_results.py --experiments all --rank-by imminent_auc

This generates:

  • results/comparison_table.csv - Sortable metrics table
  • results/comparison_imminent_auc.png - Bar chart of AUC scores
  • results/comparison_all_metrics.png - Multi-metric comparison

4. Find Top Performers

python analyze_results.py --rank-by imminent_auc --top 3

Shows the top 3 experiments ranked by imminent AUC.

Project Structure

experiment_framework/
├── config/
│   └── experiments.yaml          # All experiment configurations
├── src/
│   ├── data_loader.py            # Data loading & preprocessing
│   ├── feature_engineering.py    # 25-feature extraction system
│   ├── models.py                 # LSTM/GRU architectures
│   ├── training.py               # K-fold CV training engine
│   └── evaluation.py             # Metrics & visualization
├── run_experiment.py             # Main execution script
├── analyze_results.py            # Comparison dashboard
└── results/                      # Auto-generated results
    ├── 001_trends_only/
    │   ├── config.json           # Exact config used
    │   ├── model.pt              # Trained weights
    │   ├── metrics.json          # All metrics
    │   ├── training_curves.png   # Loss curves
    │   ├── roc_curves.png        # ROC plots
    │   └── confusion_matrices.png
    └── comparison/
        ├── comparison_table.csv
        └── comparison_*.png

Phase 1 Experiments (Feature Selection)

Goal: Identify which feature types improve harvest detection most.

Exp ID Features Count Purpose
001 CI, 7d_MA, 14d_MA, 21d_MA 4 Baseline (trends only)
002 001 + velocities 7 Add rate of change
003 002 + accelerations 10 Add momentum
004 001 + mins 7 Add structural lows
005 001 + maxs 7 Add structural highs
006 001 + ranges 7 Add volatility
007 001 + stds 7 Add noise indicators
008 001 + CVs 7 Add relative stability
009 Trends + vel + mins + std 13 Combined best features
010 All 25 features 25 Full feature set

All experiments use:

  • Model: LSTM, hidden_size=128, num_layers=1, dropout=0.5
  • Window: 28-1 days before harvest
  • Training: 5-fold CV, 150 epochs, early stopping (patience=20)

Feature Engineering System

25 Total Features (All Causal/Operational)

Tier 1: State (4)

  • CI_raw, 7d_MA, 14d_MA, 21d_MA

Tier 2: Velocity (3)

  • 7d_velocity, 14d_velocity, 21d_velocity

Tier 3: Acceleration (3)

  • 7d_acceleration, 14d_acceleration, 21d_acceleration

Tier 4: Structural (9)

  • Min: 7d_min, 14d_min, 21d_min
  • Max: 7d_max, 14d_max, 21d_max
  • Range: 7d_range, 14d_range, 21d_range

Tier 5: Stability (6)

  • Std: 7d_std, 14d_std, 21d_std
  • CV: 7d_CV, 14d_CV, 21d_CV

All features use backward-looking rolling windows (causal) for operational deployment.

Output Metrics

Cross-Validation (K-Fold)

  • Imminent AUC (mean ± std across folds)
  • Detected AUC (mean ± std across folds)

Test Set (Held-Out 15%)

  • Imminent: AUC, F1, Precision, Recall
  • Detected: AUC, F1, Precision, Recall
  • Total predictions (timesteps)

Visualizations Per Experiment

  • Training/validation loss curves (all folds)
  • ROC curves (imminent + detected)
  • Confusion matrices (imminent + detected)

Customization

Add New Experiment

Edit config/experiments.yaml:

exp_011:
  name: "011_my_custom_experiment"
  description: "Testing something new"
  features:
    - CI_raw
    - 7d_MA
    - 7d_velocity
  model:
    type: LSTM  # or GRU
    hidden_size: 256
    num_layers: 2
    dropout: 0.6
  training:
    imminent_days_before: 30
    imminent_days_before_end: 1
    k_folds: 5
    num_epochs: 200
    # ... other params

Then run:

python run_experiment.py --exp exp_011

Add New Feature

Edit src/feature_engineering.py, add to compute_feature():

elif feature_name == '30d_MA':
    return ci_series.rolling(window=30, min_periods=1, center=False).mean().values

Then use in experiment config.

Workflow Recommendations

1. Feature Selection (Phase 1)

# Run all Phase 1 experiments
python run_experiment.py --exp exp_001,exp_002,exp_003,exp_004,exp_005,exp_006,exp_007,exp_008,exp_009,exp_010

# Compare results
python analyze_results.py --experiments all --rank-by imminent_auc

Expected Time: ~30-60 minutes per experiment on GPU (5-fold CV × 150 epochs)

2. Identify Best Features

# Show top 3
python analyze_results.py --rank-by imminent_auc --top 3

Decision: Choose feature set with highest test AUC that generalizes well (CV AUC ≈ test AUC).

3. Model Architecture Optimization (Phase 2)

Once best features identified, test different architectures:

  • Vary hidden_size: 64, 128, 256
  • Vary num_layers: 1, 2
  • Try GRU vs LSTM

4. Hyperparameter Tuning (Phase 3)

Fine-tune best model:

  • Dropout: 0.3, 0.5, 0.7
  • Learning rate: 0.0005, 0.001, 0.002
  • Window length: 21-1, 28-1, 35-1

Tips

Always compare CV AUC vs Test AUC - Large gap = overfitting
Start with baseline (exp_001) - Establishes minimum performance
Change one thing at a time - Isolate impact of features vs model vs hyperparams
Check confusion matrices - Understand failure modes (false positives vs negatives)
Monitor training curves - Early stopping = converged, long plateaus = needs more capacity

Troubleshooting

CUDA out of memory:

python run_experiment.py --exp exp_001 --device cpu

Experiment not found: Check exact name in config/experiments.yaml (case-sensitive)

Import errors: Ensure you're running from experiment_framework/ directory

Next Steps

After Phase 1 completes:

  1. Identify best feature set
  2. Configure Phase 2 experiments (model architecture) in experiments.yaml
  3. Run Phase 2, compare results
  4. Select final model for production

Requirements

  • Python 3.8+
  • PyTorch 1.10+
  • scikit-learn
  • pandas
  • numpy
  • matplotlib
  • seaborn
  • pyyaml

Install:

pip install torch scikit-learn pandas numpy matplotlib seaborn pyyaml