# Harvest Detection Experiment Framework

Systematic experimentation framework for harvest detection using LSTM/GRU models with comprehensive feature engineering and automated result tracking.

## Overview

This framework enables **systematic, reproducible experiments** for optimizing harvest detection models. It separates concerns:
- **Configuration** (YAML files) - Define experiments without touching code
- **Execution** (Python scripts) - Automated training, evaluation, comparison
- **Results** (Organized folders) - All metrics, models, and plots saved automatically

## Quick Start

### 1. Run a Single Experiment

```bash
cd experiment_framework
python run_experiment.py --exp exp_001
```

This will:
- Load data from `lstm_complete_data.csv`
- Extract features defined in `config/experiments.yaml`
- Train with 5-fold cross-validation
- Evaluate on held-out test set
- Save all results to `results/001_trends_only/`

### 2. Run Multiple Experiments (Batch)

```bash
python run_experiment.py --exp exp_001,exp_002,exp_003
```

Runs experiments 001, 002, and 003 sequentially.

### 3. Compare All Results

```bash
python analyze_results.py --experiments all --rank-by imminent_auc
```

This generates:
- `results/comparison_table.csv` - Sortable metrics table
- `results/comparison_imminent_auc.png` - Bar chart of AUC scores
- `results/comparison_all_metrics.png` - Multi-metric comparison

### 4. Find Top Performers

```bash
python analyze_results.py --rank-by imminent_auc --top 3
```

Shows the top 3 experiments ranked by imminent AUC.

## Project Structure

```
experiment_framework/
├── config/
│   └── experiments.yaml          # All experiment configurations
├── src/
│   ├── data_loader.py            # Data loading & preprocessing
│   ├── feature_engineering.py    # 25-feature extraction system
│   ├── models.py                 # LSTM/GRU architectures
│   ├── training.py               # K-fold CV training engine
│   └── evaluation.py             # Metrics & visualization
├── run_experiment.py             # Main execution script
├── analyze_results.py            # Comparison dashboard
└── results/                      # Auto-generated results
    ├── 001_trends_only/
    │   ├── config.json           # Exact config used
    │   ├── model.pt              # Trained weights
    │   ├── metrics.json          # All metrics
    │   ├── training_curves.png   # Loss curves
    │   ├── roc_curves.png        # ROC plots
    │   └── confusion_matrices.png
    └── comparison/
        ├── comparison_table.csv
        └── comparison_*.png
```

## Phase 1 Experiments (Feature Selection)

**Goal:** Identify which feature types improve harvest detection most.

| Exp ID | Features | Count | Purpose |
|--------|----------|-------|---------|
| **001** | CI, 7d_MA, 14d_MA, 21d_MA | 4 | Baseline (trends only) |
| **002** | 001 + velocities | 7 | Add rate of change |
| **003** | 002 + accelerations | 10 | Add momentum |
| **004** | 001 + mins | 7 | Add structural lows |
| **005** | 001 + maxs | 7 | Add structural highs |
| **006** | 001 + ranges | 7 | Add volatility |
| **007** | 001 + stds | 7 | Add noise indicators |
| **008** | 001 + CVs | 7 | Add relative stability |
| **009** | Trends + vel + mins + std | 13 | Combined best features |
| **010** | All 25 features | 25 | Full feature set |

**All experiments use:**
- Model: LSTM, hidden_size=128, num_layers=1, dropout=0.5
- Window: 28-1 days before harvest
- Training: 5-fold CV, 150 epochs, early stopping (patience=20)

## Feature Engineering System

### 25 Total Features (All Causal/Operational)

**Tier 1: State (4)**
- `CI_raw`, `7d_MA`, `14d_MA`, `21d_MA`

**Tier 2: Velocity (3)**
- `7d_velocity`, `14d_velocity`, `21d_velocity`

**Tier 3: Acceleration (3)**
- `7d_acceleration`, `14d_acceleration`, `21d_acceleration`

**Tier 4: Structural (9)**
- Min: `7d_min`, `14d_min`, `21d_min`
- Max: `7d_max`, `14d_max`, `21d_max`
- Range: `7d_range`, `14d_range`, `21d_range`

**Tier 5: Stability (6)**
- Std: `7d_std`, `14d_std`, `21d_std`
- CV: `7d_CV`, `14d_CV`, `21d_CV`

All features use **backward-looking rolling windows** (causal) for operational deployment.

## Output Metrics

### Cross-Validation (K-Fold)
- Imminent AUC (mean ± std across folds)
- Detected AUC (mean ± std across folds)

### Test Set (Held-Out 15%)
- **Imminent:** AUC, F1, Precision, Recall
- **Detected:** AUC, F1, Precision, Recall
- Total predictions (timesteps)

### Visualizations Per Experiment
- Training/validation loss curves (all folds)
- ROC curves (imminent + detected)
- Confusion matrices (imminent + detected)

## Customization

### Add New Experiment

Edit `config/experiments.yaml`:

```yaml
exp_011:
  name: "011_my_custom_experiment"
  description: "Testing something new"
  features:
    - CI_raw
    - 7d_MA
    - 7d_velocity
  model:
    type: LSTM  # or GRU
    hidden_size: 256
    num_layers: 2
    dropout: 0.6
  training:
    imminent_days_before: 30
    imminent_days_before_end: 1
    k_folds: 5
    num_epochs: 200
    # ... other params
```

Then run:
```bash
python run_experiment.py --exp exp_011
```

### Add New Feature

Edit `src/feature_engineering.py`, add to `compute_feature()`:

```python
elif feature_name == '30d_MA':
    return ci_series.rolling(window=30, min_periods=1, center=False).mean().values
```

Then use in experiment config.

## Workflow Recommendations

### 1. Feature Selection (Phase 1)
```bash
# Run all Phase 1 experiments
python run_experiment.py --exp exp_001,exp_002,exp_003,exp_004,exp_005,exp_006,exp_007,exp_008,exp_009,exp_010

# Compare results
python analyze_results.py --experiments all --rank-by imminent_auc
```

**Expected Time:** ~30-60 minutes per experiment on GPU (5-fold CV × 150 epochs)

### 2. Identify Best Features
```bash
# Show top 3
python analyze_results.py --rank-by imminent_auc --top 3
```

**Decision:** Choose feature set with highest test AUC that generalizes well (CV AUC ≈ test AUC).

### 3. Model Architecture Optimization (Phase 2)

Once best features identified, test different architectures:
- Vary `hidden_size`: 64, 128, 256
- Vary `num_layers`: 1, 2
- Try `GRU` vs `LSTM`

### 4. Hyperparameter Tuning (Phase 3)

Fine-tune best model:
- Dropout: 0.3, 0.5, 0.7
- Learning rate: 0.0005, 0.001, 0.002
- Window length: 21-1, 28-1, 35-1

## Tips

✅ **Always compare CV AUC vs Test AUC** - Large gap = overfitting  
✅ **Start with baseline (exp_001)** - Establishes minimum performance  
✅ **Change one thing at a time** - Isolate impact of features vs model vs hyperparams  
✅ **Check confusion matrices** - Understand failure modes (false positives vs negatives)  
✅ **Monitor training curves** - Early stopping = converged, long plateaus = needs more capacity  

## Troubleshooting

**CUDA out of memory:**
```bash
python run_experiment.py --exp exp_001 --device cpu
```

**Experiment not found:**
Check exact name in `config/experiments.yaml` (case-sensitive)

**Import errors:**
Ensure you're running from `experiment_framework/` directory

## Next Steps

After Phase 1 completes:
1. Identify best feature set
2. Configure Phase 2 experiments (model architecture) in `experiments.yaml`
3. Run Phase 2, compare results
4. Select final model for production

## Requirements

- Python 3.8+
- PyTorch 1.10+
- scikit-learn
- pandas
- numpy
- matplotlib
- seaborn
- pyyaml

Install:
```bash
pip install torch scikit-learn pandas numpy matplotlib seaborn pyyaml
```