# Harvest Detection Experiment Framework Systematic experimentation framework for harvest detection using LSTM/GRU models with comprehensive feature engineering and automated result tracking. ## Overview This framework enables **systematic, reproducible experiments** for optimizing harvest detection models. It separates concerns: - **Configuration** (YAML files) - Define experiments without touching code - **Execution** (Python scripts) - Automated training, evaluation, comparison - **Results** (Organized folders) - All metrics, models, and plots saved automatically ## Quick Start ### 1. Run a Single Experiment ```bash cd experiment_framework python run_experiment.py --exp exp_001 ``` This will: - Load data from `lstm_complete_data.csv` - Extract features defined in `config/experiments.yaml` - Train with 5-fold cross-validation - Evaluate on held-out test set - Save all results to `results/001_trends_only/` ### 2. Run Multiple Experiments (Batch) ```bash python run_experiment.py --exp exp_001,exp_002,exp_003 ``` Runs experiments 001, 002, and 003 sequentially. ### 3. Compare All Results ```bash python analyze_results.py --experiments all --rank-by imminent_auc ``` This generates: - `results/comparison_table.csv` - Sortable metrics table - `results/comparison_imminent_auc.png` - Bar chart of AUC scores - `results/comparison_all_metrics.png` - Multi-metric comparison ### 4. Find Top Performers ```bash python analyze_results.py --rank-by imminent_auc --top 3 ``` Shows the top 3 experiments ranked by imminent AUC. ## Project Structure ``` experiment_framework/ ├── config/ │ └── experiments.yaml # All experiment configurations ├── src/ │ ├── data_loader.py # Data loading & preprocessing │ ├── feature_engineering.py # 25-feature extraction system │ ├── models.py # LSTM/GRU architectures │ ├── training.py # K-fold CV training engine │ └── evaluation.py # Metrics & visualization ├── run_experiment.py # Main execution script ├── analyze_results.py # Comparison dashboard └── results/ # Auto-generated results ├── 001_trends_only/ │ ├── config.json # Exact config used │ ├── model.pt # Trained weights │ ├── metrics.json # All metrics │ ├── training_curves.png # Loss curves │ ├── roc_curves.png # ROC plots │ └── confusion_matrices.png └── comparison/ ├── comparison_table.csv └── comparison_*.png ``` ## Phase 1 Experiments (Feature Selection) **Goal:** Identify which feature types improve harvest detection most. | Exp ID | Features | Count | Purpose | |--------|----------|-------|---------| | **001** | CI, 7d_MA, 14d_MA, 21d_MA | 4 | Baseline (trends only) | | **002** | 001 + velocities | 7 | Add rate of change | | **003** | 002 + accelerations | 10 | Add momentum | | **004** | 001 + mins | 7 | Add structural lows | | **005** | 001 + maxs | 7 | Add structural highs | | **006** | 001 + ranges | 7 | Add volatility | | **007** | 001 + stds | 7 | Add noise indicators | | **008** | 001 + CVs | 7 | Add relative stability | | **009** | Trends + vel + mins + std | 13 | Combined best features | | **010** | All 25 features | 25 | Full feature set | **All experiments use:** - Model: LSTM, hidden_size=128, num_layers=1, dropout=0.5 - Window: 28-1 days before harvest - Training: 5-fold CV, 150 epochs, early stopping (patience=20) ## Feature Engineering System ### 25 Total Features (All Causal/Operational) **Tier 1: State (4)** - `CI_raw`, `7d_MA`, `14d_MA`, `21d_MA` **Tier 2: Velocity (3)** - `7d_velocity`, `14d_velocity`, `21d_velocity` **Tier 3: Acceleration (3)** - `7d_acceleration`, `14d_acceleration`, `21d_acceleration` **Tier 4: Structural (9)** - Min: `7d_min`, `14d_min`, `21d_min` - Max: `7d_max`, `14d_max`, `21d_max` - Range: `7d_range`, `14d_range`, `21d_range` **Tier 5: Stability (6)** - Std: `7d_std`, `14d_std`, `21d_std` - CV: `7d_CV`, `14d_CV`, `21d_CV` All features use **backward-looking rolling windows** (causal) for operational deployment. ## Output Metrics ### Cross-Validation (K-Fold) - Imminent AUC (mean ± std across folds) - Detected AUC (mean ± std across folds) ### Test Set (Held-Out 15%) - **Imminent:** AUC, F1, Precision, Recall - **Detected:** AUC, F1, Precision, Recall - Total predictions (timesteps) ### Visualizations Per Experiment - Training/validation loss curves (all folds) - ROC curves (imminent + detected) - Confusion matrices (imminent + detected) ## Customization ### Add New Experiment Edit `config/experiments.yaml`: ```yaml exp_011: name: "011_my_custom_experiment" description: "Testing something new" features: - CI_raw - 7d_MA - 7d_velocity model: type: LSTM # or GRU hidden_size: 256 num_layers: 2 dropout: 0.6 training: imminent_days_before: 30 imminent_days_before_end: 1 k_folds: 5 num_epochs: 200 # ... other params ``` Then run: ```bash python run_experiment.py --exp exp_011 ``` ### Add New Feature Edit `src/feature_engineering.py`, add to `compute_feature()`: ```python elif feature_name == '30d_MA': return ci_series.rolling(window=30, min_periods=1, center=False).mean().values ``` Then use in experiment config. ## Workflow Recommendations ### 1. Feature Selection (Phase 1) ```bash # Run all Phase 1 experiments python run_experiment.py --exp exp_001,exp_002,exp_003,exp_004,exp_005,exp_006,exp_007,exp_008,exp_009,exp_010 # Compare results python analyze_results.py --experiments all --rank-by imminent_auc ``` **Expected Time:** ~30-60 minutes per experiment on GPU (5-fold CV × 150 epochs) ### 2. Identify Best Features ```bash # Show top 3 python analyze_results.py --rank-by imminent_auc --top 3 ``` **Decision:** Choose feature set with highest test AUC that generalizes well (CV AUC ≈ test AUC). ### 3. Model Architecture Optimization (Phase 2) Once best features identified, test different architectures: - Vary `hidden_size`: 64, 128, 256 - Vary `num_layers`: 1, 2 - Try `GRU` vs `LSTM` ### 4. Hyperparameter Tuning (Phase 3) Fine-tune best model: - Dropout: 0.3, 0.5, 0.7 - Learning rate: 0.0005, 0.001, 0.002 - Window length: 21-1, 28-1, 35-1 ## Tips ✅ **Always compare CV AUC vs Test AUC** - Large gap = overfitting ✅ **Start with baseline (exp_001)** - Establishes minimum performance ✅ **Change one thing at a time** - Isolate impact of features vs model vs hyperparams ✅ **Check confusion matrices** - Understand failure modes (false positives vs negatives) ✅ **Monitor training curves** - Early stopping = converged, long plateaus = needs more capacity ## Troubleshooting **CUDA out of memory:** ```bash python run_experiment.py --exp exp_001 --device cpu ``` **Experiment not found:** Check exact name in `config/experiments.yaml` (case-sensitive) **Import errors:** Ensure you're running from `experiment_framework/` directory ## Next Steps After Phase 1 completes: 1. Identify best feature set 2. Configure Phase 2 experiments (model architecture) in `experiments.yaml` 3. Run Phase 2, compare results 4. Select final model for production ## Requirements - Python 3.8+ - PyTorch 1.10+ - scikit-learn - pandas - numpy - matplotlib - seaborn - pyyaml Install: ```bash pip install torch scikit-learn pandas numpy matplotlib seaborn pyyaml ```