| .. | ||
| 01_phase_1_detection | ||
| 02_phase_2_refinement | ||
| 03_phase_3_monitoring | ||
| 04_production_export | ||
| config | ||
| experiments | ||
| results | ||
| src | ||
| AUTO_TUNING_GUIDE.md | ||
| HARVEST_DETECTION_CONTEXT.md | ||
| PRODUCTION_WORKFLOW.md | ||
| QUICKSTART.md | ||
| README.md | ||
Harvest Detection Experiment Framework
Systematic experimentation framework for harvest detection using LSTM/GRU models with comprehensive feature engineering and automated result tracking.
Overview
This framework enables systematic, reproducible experiments for optimizing harvest detection models. It separates concerns:
- Configuration (YAML files) - Define experiments without touching code
- Execution (Python scripts) - Automated training, evaluation, comparison
- Results (Organized folders) - All metrics, models, and plots saved automatically
Quick Start
1. Run a Single Experiment
cd experiment_framework
python run_experiment.py --exp exp_001
This will:
- Load data from
lstm_complete_data.csv - Extract features defined in
config/experiments.yaml - Train with 5-fold cross-validation
- Evaluate on held-out test set
- Save all results to
results/001_trends_only/
2. Run Multiple Experiments (Batch)
python run_experiment.py --exp exp_001,exp_002,exp_003
Runs experiments 001, 002, and 003 sequentially.
3. Compare All Results
python analyze_results.py --experiments all --rank-by imminent_auc
This generates:
results/comparison_table.csv- Sortable metrics tableresults/comparison_imminent_auc.png- Bar chart of AUC scoresresults/comparison_all_metrics.png- Multi-metric comparison
4. Find Top Performers
python analyze_results.py --rank-by imminent_auc --top 3
Shows the top 3 experiments ranked by imminent AUC.
Project Structure
experiment_framework/
├── config/
│ └── experiments.yaml # All experiment configurations
├── src/
│ ├── data_loader.py # Data loading & preprocessing
│ ├── feature_engineering.py # 25-feature extraction system
│ ├── models.py # LSTM/GRU architectures
│ ├── training.py # K-fold CV training engine
│ └── evaluation.py # Metrics & visualization
├── run_experiment.py # Main execution script
├── analyze_results.py # Comparison dashboard
└── results/ # Auto-generated results
├── 001_trends_only/
│ ├── config.json # Exact config used
│ ├── model.pt # Trained weights
│ ├── metrics.json # All metrics
│ ├── training_curves.png # Loss curves
│ ├── roc_curves.png # ROC plots
│ └── confusion_matrices.png
└── comparison/
├── comparison_table.csv
└── comparison_*.png
Phase 1 Experiments (Feature Selection)
Goal: Identify which feature types improve harvest detection most.
| Exp ID | Features | Count | Purpose |
|---|---|---|---|
| 001 | CI, 7d_MA, 14d_MA, 21d_MA | 4 | Baseline (trends only) |
| 002 | 001 + velocities | 7 | Add rate of change |
| 003 | 002 + accelerations | 10 | Add momentum |
| 004 | 001 + mins | 7 | Add structural lows |
| 005 | 001 + maxs | 7 | Add structural highs |
| 006 | 001 + ranges | 7 | Add volatility |
| 007 | 001 + stds | 7 | Add noise indicators |
| 008 | 001 + CVs | 7 | Add relative stability |
| 009 | Trends + vel + mins + std | 13 | Combined best features |
| 010 | All 25 features | 25 | Full feature set |
All experiments use:
- Model: LSTM, hidden_size=128, num_layers=1, dropout=0.5
- Window: 28-1 days before harvest
- Training: 5-fold CV, 150 epochs, early stopping (patience=20)
Feature Engineering System
25 Total Features (All Causal/Operational)
Tier 1: State (4)
CI_raw,7d_MA,14d_MA,21d_MA
Tier 2: Velocity (3)
7d_velocity,14d_velocity,21d_velocity
Tier 3: Acceleration (3)
7d_acceleration,14d_acceleration,21d_acceleration
Tier 4: Structural (9)
- Min:
7d_min,14d_min,21d_min - Max:
7d_max,14d_max,21d_max - Range:
7d_range,14d_range,21d_range
Tier 5: Stability (6)
- Std:
7d_std,14d_std,21d_std - CV:
7d_CV,14d_CV,21d_CV
All features use backward-looking rolling windows (causal) for operational deployment.
Output Metrics
Cross-Validation (K-Fold)
- Imminent AUC (mean ± std across folds)
- Detected AUC (mean ± std across folds)
Test Set (Held-Out 15%)
- Imminent: AUC, F1, Precision, Recall
- Detected: AUC, F1, Precision, Recall
- Total predictions (timesteps)
Visualizations Per Experiment
- Training/validation loss curves (all folds)
- ROC curves (imminent + detected)
- Confusion matrices (imminent + detected)
Customization
Add New Experiment
Edit config/experiments.yaml:
exp_011:
name: "011_my_custom_experiment"
description: "Testing something new"
features:
- CI_raw
- 7d_MA
- 7d_velocity
model:
type: LSTM # or GRU
hidden_size: 256
num_layers: 2
dropout: 0.6
training:
imminent_days_before: 30
imminent_days_before_end: 1
k_folds: 5
num_epochs: 200
# ... other params
Then run:
python run_experiment.py --exp exp_011
Add New Feature
Edit src/feature_engineering.py, add to compute_feature():
elif feature_name == '30d_MA':
return ci_series.rolling(window=30, min_periods=1, center=False).mean().values
Then use in experiment config.
Workflow Recommendations
1. Feature Selection (Phase 1)
# Run all Phase 1 experiments
python run_experiment.py --exp exp_001,exp_002,exp_003,exp_004,exp_005,exp_006,exp_007,exp_008,exp_009,exp_010
# Compare results
python analyze_results.py --experiments all --rank-by imminent_auc
Expected Time: ~30-60 minutes per experiment on GPU (5-fold CV × 150 epochs)
2. Identify Best Features
# Show top 3
python analyze_results.py --rank-by imminent_auc --top 3
Decision: Choose feature set with highest test AUC that generalizes well (CV AUC ≈ test AUC).
3. Model Architecture Optimization (Phase 2)
Once best features identified, test different architectures:
- Vary
hidden_size: 64, 128, 256 - Vary
num_layers: 1, 2 - Try
GRUvsLSTM
4. Hyperparameter Tuning (Phase 3)
Fine-tune best model:
- Dropout: 0.3, 0.5, 0.7
- Learning rate: 0.0005, 0.001, 0.002
- Window length: 21-1, 28-1, 35-1
Tips
✅ Always compare CV AUC vs Test AUC - Large gap = overfitting
✅ Start with baseline (exp_001) - Establishes minimum performance
✅ Change one thing at a time - Isolate impact of features vs model vs hyperparams
✅ Check confusion matrices - Understand failure modes (false positives vs negatives)
✅ Monitor training curves - Early stopping = converged, long plateaus = needs more capacity
Troubleshooting
CUDA out of memory:
python run_experiment.py --exp exp_001 --device cpu
Experiment not found:
Check exact name in config/experiments.yaml (case-sensitive)
Import errors:
Ensure you're running from experiment_framework/ directory
Next Steps
After Phase 1 completes:
- Identify best feature set
- Configure Phase 2 experiments (model architecture) in
experiments.yaml - Run Phase 2, compare results
- Select final model for production
Requirements
- Python 3.8+
- PyTorch 1.10+
- scikit-learn
- pandas
- numpy
- matplotlib
- seaborn
- pyyaml
Install:
pip install torch scikit-learn pandas numpy matplotlib seaborn pyyaml