Merge branch 'code-improvements' into review_perField_code
21
.gitignore
vendored
|
|
@ -9,6 +9,14 @@
|
||||||
__pycache__/
|
__pycache__/
|
||||||
*.pyc
|
*.pyc
|
||||||
*.pyo
|
*.pyo
|
||||||
|
*.py[cod]
|
||||||
|
*.py[cod]
|
||||||
|
myenv/myenv/
|
||||||
|
pip-wheel-metadata/
|
||||||
|
dist/
|
||||||
|
*.egg-info/
|
||||||
|
*.egg/*.egg/
|
||||||
|
.ipynb_checkpoints
|
||||||
|
|
||||||
# R Output Files
|
# R Output Files
|
||||||
*.Rout
|
*.Rout
|
||||||
|
|
@ -16,6 +24,7 @@ __pycache__/
|
||||||
*.RData
|
*.RData
|
||||||
*.Rdata
|
*.Rdata
|
||||||
.Rproj.user
|
.Rproj.user
|
||||||
|
.Rprofile
|
||||||
Rplots.pdf
|
Rplots.pdf
|
||||||
*.pdf
|
*.pdf
|
||||||
|
|
||||||
|
|
@ -48,15 +57,25 @@ reports/
|
||||||
# Experiment Outputs (temporary plots, analysis artifacts)
|
# Experiment Outputs (temporary plots, analysis artifacts)
|
||||||
python_app/harvest_detection_experiments/*/plots/
|
python_app/harvest_detection_experiments/*/plots/
|
||||||
python_app/harvest_detection_experiments/*/*.ipynb_checkpoints/
|
python_app/harvest_detection_experiments/*/*.ipynb_checkpoints/
|
||||||
|
CI_report_dashboard_planet_files/
|
||||||
|
|
||||||
# Cache Files
|
# Cache Files
|
||||||
rosm.cache/
|
rosm.cache/
|
||||||
*.cache
|
renv/cache/
|
||||||
|
|
||||||
# Logs
|
# Logs
|
||||||
*.log
|
*.log
|
||||||
package_manager.log
|
package_manager.log
|
||||||
|
|
||||||
|
# Temporary Files
|
||||||
|
*.tmp
|
||||||
|
*.save
|
||||||
|
*.bak
|
||||||
|
*.swp
|
||||||
|
*.swo
|
||||||
|
package_manager.log
|
||||||
|
experiments_and_old_code_python_r/
|
||||||
|
|
||||||
# Laravel Storage (contains user data and outputs)
|
# Laravel Storage (contains user data and outputs)
|
||||||
laravel_app/storage/app/*/Data/
|
laravel_app/storage/app/*/Data/
|
||||||
laravel_app/storage/app/*/reports/
|
laravel_app/storage/app/*/reports/
|
||||||
|
|
|
||||||
|
|
@ -5,9 +5,12 @@ laravel_app/
|
||||||
data_validation_tool/
|
data_validation_tool/
|
||||||
python_app/harvest_detection_experiments/
|
python_app/harvest_detection_experiments/
|
||||||
python_app/experiments/
|
python_app/experiments/
|
||||||
|
r_app/old_scripts/
|
||||||
|
r_app/experiments/
|
||||||
phase2_refinement/
|
phase2_refinement/
|
||||||
webapps/
|
webapps/
|
||||||
tools/
|
tools/
|
||||||
|
old_sh/
|
||||||
output/
|
output/
|
||||||
renv/
|
renv/
|
||||||
*.py
|
*.py
|
||||||
|
|
|
||||||
967
IMPLEMENTATION_GUIDE.md
Normal file
|
|
@ -0,0 +1,967 @@
|
||||||
|
# SmartCane Code-Improvements Merge Guide (Complete)
|
||||||
|
|
||||||
|
**Target Repo**: https://bitbucket.org/sobitnl/smartcane/src/master/
|
||||||
|
**Source Branch**: `code-improvements` (Timon's experimental area)
|
||||||
|
**Status**: Ready for merge (all changes, including optional enhancements)
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 📋 SCOPE: CORE + OPTIONAL FEATURES
|
||||||
|
|
||||||
|
### 🔴 CORE (Required for merge)
|
||||||
|
- Database migration (client_type column)
|
||||||
|
- File path changes (merged_final_tif → field_tiles_CI)
|
||||||
|
- Laravel form/job/mailing updates
|
||||||
|
- Shell script wrappers (5 files)
|
||||||
|
- Python package files (2 files)
|
||||||
|
|
||||||
|
### 🟡 OPTIONAL (Post-merge enhancements)
|
||||||
|
- Country-based project organization (MZ/UG/TZ folders)
|
||||||
|
- Download scheduling (staggered 00:01 per-project)
|
||||||
|
- Project search feature
|
||||||
|
- Harvest prediction setup docs
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 🚀 FULL IMPLEMENTATION CHECKLIST
|
||||||
|
|
||||||
|
```
|
||||||
|
═══════════════════════════════════════════════════════════════
|
||||||
|
PHASE 1: CORE MERGE (Required)
|
||||||
|
═══════════════════════════════════════════════════════════════
|
||||||
|
|
||||||
|
[1.1] Database Migration (10 min)
|
||||||
|
✓ Create: database/migrations/YYYY_MM_DD_add_client_type_to_projects_table.php
|
||||||
|
✓ Run: php artisan migrate
|
||||||
|
|
||||||
|
[1.2] Laravel Model Changes (20 min)
|
||||||
|
✓ Edit: laravel_app/app/Models/Project.php
|
||||||
|
- Add 'client_type' to $fillable
|
||||||
|
- Update getMergedTiffList() — path change
|
||||||
|
- Update startDownload() — path change
|
||||||
|
- Update getTifsAsZip() — path change
|
||||||
|
- Add getLatestKpiFile() method
|
||||||
|
|
||||||
|
[1.3] Laravel Job Changes (15 min)
|
||||||
|
✓ Edit: laravel_app/app/Jobs/ProjectDownloadTiffJob.php
|
||||||
|
- Change path in handleForDate()
|
||||||
|
✓ Edit: laravel_app/app/Jobs/ProjectMosiacGeneratorJob.php
|
||||||
|
- Replace command array
|
||||||
|
- Improve error handling
|
||||||
|
|
||||||
|
[1.4] Laravel Forms (15 min)
|
||||||
|
✓ Edit: laravel_app/app/Livewire/Projects/ProjectManager.php
|
||||||
|
- Add client_type to form
|
||||||
|
✓ Edit: Project form Blade template
|
||||||
|
- Add <select> for client_type
|
||||||
|
✓ Edit: laravel_app/app/Livewire/Forms/MailingForm.php
|
||||||
|
- Auto-attach KPI Excel for cane_supply
|
||||||
|
|
||||||
|
[1.5] Shell Script Wrappers (15 min)
|
||||||
|
✓ Create: 10_create_per_field_tiffs.sh
|
||||||
|
✓ Create: 21_convert_ci_rds_to_csv.sh
|
||||||
|
✓ Create: 22_harvest_baseline_prediction.sh
|
||||||
|
✓ Create: 23_convert_harvest_format.sh
|
||||||
|
✓ Create: 31_harvest_imminent_weekly.sh
|
||||||
|
|
||||||
|
[1.6] Python Package Files (5 min)
|
||||||
|
✓ Create: python_app/requirements_harvest.txt
|
||||||
|
✓ Create: python_app/environment_pytorch.yml
|
||||||
|
|
||||||
|
[1.7] Testing Core (20 min)
|
||||||
|
✓ Run migration ✓ Download test ✓ Mosaic test ✓ Mail test
|
||||||
|
|
||||||
|
═══════════════════════════════════════════════════════════════
|
||||||
|
PHASE 2: ENHANCEMENTS
|
||||||
|
═══════════════════════════════════════════════════════════════
|
||||||
|
|
||||||
|
[2.1] Country Organization (45 min)
|
||||||
|
- Add DB migration (country, country_code columns)
|
||||||
|
- Update Project.php fillable
|
||||||
|
- Add ProjectManager form fields
|
||||||
|
- Add Blade country selector + auto-populate
|
||||||
|
- Add country filtering to ProjectList
|
||||||
|
- Create "Add New Country" feature
|
||||||
|
|
||||||
|
[2.2] Download Scheduling (30 min)
|
||||||
|
- Option A: Windows Task Scheduler
|
||||||
|
- Option B: Linux cron
|
||||||
|
- Option C: Laravel Task Scheduler
|
||||||
|
|
||||||
|
[2.3] Project Search (30 min)
|
||||||
|
- Add search/filter to ProjectList.php
|
||||||
|
- Add Blade input fields
|
||||||
|
|
||||||
|
[2.4] Harvest Prediction Setup (20 min)
|
||||||
|
- Document conda env setup
|
||||||
|
- Document script execution
|
||||||
|
|
||||||
|
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 🔴 PHASE 1: CORE MERGE
|
||||||
|
|
||||||
|
### STEP 1: Create & Run Database Migration
|
||||||
|
|
||||||
|
**File**: `database/migrations/2024_02_19_000000_add_client_type_to_projects_table.php`
|
||||||
|
|
||||||
|
```php
|
||||||
|
<?php
|
||||||
|
|
||||||
|
use Illuminate\Database\Migrations\Migration;
|
||||||
|
use Illuminate\Database\Schema\Blueprint;
|
||||||
|
use Illuminate\Support\Facades\Schema;
|
||||||
|
|
||||||
|
return new class extends Migration
|
||||||
|
{
|
||||||
|
public function up(): void
|
||||||
|
{
|
||||||
|
Schema::table('projects', function (Blueprint $table) {
|
||||||
|
$table->enum('client_type', ['agronomic_support', 'cane_supply'])
|
||||||
|
->default('agronomic_support')
|
||||||
|
->after('download_path');
|
||||||
|
});
|
||||||
|
}
|
||||||
|
|
||||||
|
public function down(): void
|
||||||
|
{
|
||||||
|
Schema::table('projects', function (Blueprint $table) {
|
||||||
|
$table->dropColumn('client_type');
|
||||||
|
});
|
||||||
|
}
|
||||||
|
};
|
||||||
|
```
|
||||||
|
|
||||||
|
**Run migration:**
|
||||||
|
```bash
|
||||||
|
php artisan migrate
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### STEP 2: Edit `laravel_app/app/Models/Project.php`
|
||||||
|
|
||||||
|
**Edit 2.1: Add to `$fillable` array**
|
||||||
|
|
||||||
|
Find and add `'client_type'` after `'download_path'`:
|
||||||
|
|
||||||
|
```php
|
||||||
|
protected $fillable = [
|
||||||
|
'name',
|
||||||
|
'download_path',
|
||||||
|
'client_type', // ADD THIS
|
||||||
|
'mail_template',
|
||||||
|
'mail_subject',
|
||||||
|
'mail_frequency',
|
||||||
|
'mail_day',
|
||||||
|
'mail_scheduled',
|
||||||
|
'pivot_json_path',
|
||||||
|
'span_json_path',
|
||||||
|
'harvest_json_path',
|
||||||
|
'min_harvest_date',
|
||||||
|
'borders',
|
||||||
|
];
|
||||||
|
```
|
||||||
|
|
||||||
|
**Edit 2.2: Update `getMergedTiffList()` method (around line 232)**
|
||||||
|
|
||||||
|
```php
|
||||||
|
public function getMergedTiffList()
|
||||||
|
{
|
||||||
|
return collect(Storage::files($this->download_path.'/field_tiles_CI')) // CHANGED
|
||||||
|
->filter(fn($file) => Str::endsWith($file, '.tif'))
|
||||||
|
->sortByDesc(function ($file) {
|
||||||
|
$parts = explode('_', str_replace('.tif', '', $file));
|
||||||
|
$date = $parts[1];
|
||||||
|
return $date;
|
||||||
|
})
|
||||||
|
->values();
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
**Edit 2.3: Update `startDownload()` method (around line 265)**
|
||||||
|
|
||||||
|
```php
|
||||||
|
public function startDownload(Carbon $date)
|
||||||
|
{
|
||||||
|
$downloadRequest = $this->downloads()->updateOrCreate(
|
||||||
|
[
|
||||||
|
'project_id' => $this->id,
|
||||||
|
'name' => sprintf('%s.tif', $date->format('Y-m-d')),
|
||||||
|
],
|
||||||
|
[
|
||||||
|
'path' => sprintf('%s/%s/%s.tif', $this->download_path, 'field_tiles_CI', $date->format('Y-m-d')), // CHANGED
|
||||||
|
]
|
||||||
|
);
|
||||||
|
ProjectDownloadTiffJob::dispatch($downloadRequest, $date);
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
**Edit 2.4: Update `getTifsAsZip()` method (around line 489)**
|
||||||
|
|
||||||
|
```php
|
||||||
|
public function getTifsAsZip()
|
||||||
|
{
|
||||||
|
return collect(Storage::files($this->download_path . '/field_tiles_CI')) // CHANGED
|
||||||
|
->filter(fn($file) => Str::endsWith($file, '.tif'))
|
||||||
|
->values();
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
**Edit 2.5: Add new method `getLatestKpiFile()`**
|
||||||
|
|
||||||
|
Add this method at the end of the Project class:
|
||||||
|
|
||||||
|
```php
|
||||||
|
public function getLatestKpiFile(): ?string
|
||||||
|
{
|
||||||
|
$kpiPath = $this->download_path . '/reports/kpis/';
|
||||||
|
|
||||||
|
try {
|
||||||
|
$files = Storage::files($kpiPath);
|
||||||
|
} catch (\Exception $e) {
|
||||||
|
return null;
|
||||||
|
}
|
||||||
|
|
||||||
|
return collect($files)
|
||||||
|
->filter(fn($f) => Str::endsWith($f, '.xlsx'))
|
||||||
|
->sortByDesc(fn($f) => Storage::lastModified($f))
|
||||||
|
->first();
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### STEP 3: Edit `laravel_app/app/Jobs/ProjectDownloadTiffJob.php`
|
||||||
|
|
||||||
|
**Around line 73 in `handleForDate()` method:**
|
||||||
|
|
||||||
|
Change:
|
||||||
|
```php
|
||||||
|
$path = $project->download_path . '/merged_final_tif/' . $filename;
|
||||||
|
```
|
||||||
|
|
||||||
|
To:
|
||||||
|
```php
|
||||||
|
$path = $project->download_path . '/field_tiles_CI/' . $filename;
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### STEP 4: Edit `laravel_app/app/Jobs/ProjectMosiacGeneratorJob.php`
|
||||||
|
|
||||||
|
**Lines 50-70, replace the `$command` array initialization:**
|
||||||
|
|
||||||
|
OLD:
|
||||||
|
```php
|
||||||
|
$command = [
|
||||||
|
sprintf('%sbuild_mosaic.sh', $projectFolder),
|
||||||
|
sprintf('--end_date=%s', $this->mosaic->end_date->format('Y-m-d')),
|
||||||
|
sprintf('--offset=%s', $this->mosaic->offset),
|
||||||
|
sprintf('--data_dir=%s', $this->mosaic->project->download_path),
|
||||||
|
sprintf('--file_name_tif=%s', basename($this->mosaic->path)),
|
||||||
|
];
|
||||||
|
```
|
||||||
|
|
||||||
|
NEW:
|
||||||
|
```php
|
||||||
|
$command = [
|
||||||
|
sprintf('%s40_mosaic_creation_per_field.sh', $projectFolder),
|
||||||
|
sprintf('--project=%s', $project->name),
|
||||||
|
sprintf('--end_date=%s', $this->mosaic->end_date->format('Y-m-d')),
|
||||||
|
sprintf('--offset=%s', $this->mosaic->offset),
|
||||||
|
];
|
||||||
|
```
|
||||||
|
|
||||||
|
**Also improve exception handling (around line 65-75):**
|
||||||
|
|
||||||
|
```php
|
||||||
|
try {
|
||||||
|
$process = ProcessNew::timeout(300)
|
||||||
|
->env(['PATH' => $currentPath.':/usr/local/Cellar/pandoc/3.1.8/bin/pandoc'])
|
||||||
|
->start($command, function (string $type, string $output) use ($project) {
|
||||||
|
ProjectLogger::log($project, $output);
|
||||||
|
$this->throwIfOutputContainsError($output);
|
||||||
|
});
|
||||||
|
$results = $process->wait();
|
||||||
|
if ($results->successful()) {
|
||||||
|
$this->mosaic->setStatusSuccess();
|
||||||
|
}
|
||||||
|
} catch (\RuntimeException|ProcessTimedOutException|ProcessFailedException $e) {
|
||||||
|
ProjectLogger::log($project, "MOSAIC JOB ERROR: " . $e->getMessage());
|
||||||
|
$this->mosaic->setStatusFailed();
|
||||||
|
throw $e;
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### STEP 5: Edit `laravel_app/app/Livewire/Projects/ProjectManager.php`
|
||||||
|
|
||||||
|
**In `createProject()` method:**
|
||||||
|
|
||||||
|
```php
|
||||||
|
public function createProject()
|
||||||
|
{
|
||||||
|
$projectIdentifier = $this->formData['id'] ?? null;
|
||||||
|
Validator::make(
|
||||||
|
['name' => $this->formData['name']],
|
||||||
|
['name' => ['required', Rule::unique('projects')->ignore($projectIdentifier), 'string', 'max:255']]
|
||||||
|
)->validate();
|
||||||
|
|
||||||
|
$project = Project::create([
|
||||||
|
'name' => $this->formData['name'],
|
||||||
|
'download_path' => $this->makeValidDirectoryName($this->formData['name']),
|
||||||
|
'client_type' => $this->formData['client_type'] ?? 'agronomic_support', // ADD THIS
|
||||||
|
]);
|
||||||
|
return redirect()->route('project.show', [$project->name, 'settings']);
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
**In `resetFormData()` method:**
|
||||||
|
|
||||||
|
```php
|
||||||
|
private function resetFormData()
|
||||||
|
{
|
||||||
|
$this->formData = [
|
||||||
|
'name' => '',
|
||||||
|
'client_type' => 'agronomic_support', // ADD THIS
|
||||||
|
'mail_template' => '',
|
||||||
|
'mail_subject' => '',
|
||||||
|
'mail_frequency' => '',
|
||||||
|
'mail_day' => '',
|
||||||
|
// ... rest of fields
|
||||||
|
];
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### STEP 6: Edit Project Form Blade Template
|
||||||
|
|
||||||
|
**Find the project create/edit form and add this field:**
|
||||||
|
|
||||||
|
```blade
|
||||||
|
<div class="form-group mb-3">
|
||||||
|
<label for="client_type" class="form-label">Client Type <span class="text-danger">*</span></label>
|
||||||
|
<select wire:model="formData.client_type" id="client_type" class="form-control" required>
|
||||||
|
<option value="">-- Select Client Type --</option>
|
||||||
|
<option value="agronomic_support">Agronomic Support</option>
|
||||||
|
<option value="cane_supply">Cane Supply</option>
|
||||||
|
</select>
|
||||||
|
@error('formData.client_type')
|
||||||
|
<span class="text-danger small">{{ $message }}</span>
|
||||||
|
@enderror
|
||||||
|
</div>
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### STEP 7: Edit `laravel_app/app/Livewire/Forms/MailingForm.php`
|
||||||
|
|
||||||
|
**In `saveAndSendMailing()` static method:**
|
||||||
|
|
||||||
|
```php
|
||||||
|
public static function saveAndSendMailing($report, $subject, $message, $recipients) {
|
||||||
|
if ($report->documentExists()) {
|
||||||
|
$mailing = $report->project->mailings()->create([
|
||||||
|
'subject' => $subject,
|
||||||
|
'message' => $message,
|
||||||
|
'report_id' => $report->id,
|
||||||
|
]);
|
||||||
|
|
||||||
|
// Attach main report
|
||||||
|
$mailing->attachments()->create([
|
||||||
|
'name' => $report->name,
|
||||||
|
'path' => $report->path,
|
||||||
|
]);
|
||||||
|
|
||||||
|
// Auto-attach KPI Excel for cane_supply projects
|
||||||
|
if ($report->project->client_type === 'cane_supply') {
|
||||||
|
$kpiFile = $report->project->getLatestKpiFile();
|
||||||
|
if ($kpiFile) {
|
||||||
|
$mailing->attachments()->create([
|
||||||
|
'name' => 'KPI Summary',
|
||||||
|
'path' => $kpiFile,
|
||||||
|
]);
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
$mailing->recipients()->createMany($recipients);
|
||||||
|
Mail::to($mailing->recipients()->pluck('email')->toArray())
|
||||||
|
->send(new \App\Mail\ReportMailer($mailing, $report));
|
||||||
|
} else {
|
||||||
|
self::sendReportNotFoundNotificationToAdmin($report);
|
||||||
|
}
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### STEP 8: CREATE 5 Shell Script Wrappers
|
||||||
|
|
||||||
|
**File: `10_create_per_field_tiffs.sh`**
|
||||||
|
|
||||||
|
```bash
|
||||||
|
#!/bin/bash
|
||||||
|
# Wrapper for R script 10: Create per-field TIFFs
|
||||||
|
# Usage: ./10_create_per_field_tiffs.sh --project=angata
|
||||||
|
|
||||||
|
set -e
|
||||||
|
|
||||||
|
PROJECT=""
|
||||||
|
|
||||||
|
while [[ $# -gt 0 ]]; do
|
||||||
|
case $1 in
|
||||||
|
--project=*) PROJECT="${1#*=}" ;;
|
||||||
|
--*) ;; # Ignore other args
|
||||||
|
esac
|
||||||
|
shift
|
||||||
|
done
|
||||||
|
|
||||||
|
[ -z "$PROJECT" ] && { echo "ERROR: --project required"; exit 1; }
|
||||||
|
|
||||||
|
cd "$(dirname "$0")/r_app"
|
||||||
|
Rscript -e "PROJECT='$PROJECT'; source('parameters_project.R'); source('10_create_per_field_tiffs.R')"
|
||||||
|
```
|
||||||
|
|
||||||
|
**File: `21_convert_ci_rds_to_csv.sh`**
|
||||||
|
|
||||||
|
```bash
|
||||||
|
#!/bin/bash
|
||||||
|
# Wrapper for R script 21: Convert CI RDS to CSV
|
||||||
|
# Usage: ./21_convert_ci_rds_to_csv.sh --project=angata
|
||||||
|
|
||||||
|
set -e
|
||||||
|
|
||||||
|
PROJECT=""
|
||||||
|
|
||||||
|
while [[ $# -gt 0 ]]; do
|
||||||
|
case $1 in
|
||||||
|
--project=*) PROJECT="${1#*=}" ;;
|
||||||
|
--*) ;;
|
||||||
|
esac
|
||||||
|
shift
|
||||||
|
done
|
||||||
|
|
||||||
|
[ -z "$PROJECT" ] && { echo "ERROR: --project required"; exit 1; }
|
||||||
|
|
||||||
|
cd "$(dirname "$0")/r_app"
|
||||||
|
Rscript -e "PROJECT='$PROJECT'; source('parameters_project.R'); source('21_convert_ci_rds_to_csv.R')"
|
||||||
|
```
|
||||||
|
|
||||||
|
**File: `22_harvest_baseline_prediction.sh`**
|
||||||
|
|
||||||
|
```bash
|
||||||
|
#!/bin/bash
|
||||||
|
# Wrapper for Python script 22: Harvest baseline prediction
|
||||||
|
# Usage: ./22_harvest_baseline_prediction.sh --project=angata
|
||||||
|
|
||||||
|
set -e
|
||||||
|
|
||||||
|
PROJECT=""
|
||||||
|
|
||||||
|
while [[ $# -gt 0 ]]; do
|
||||||
|
case $1 in
|
||||||
|
--project=*) PROJECT="${1#*=}" ;;
|
||||||
|
--*) ;;
|
||||||
|
esac
|
||||||
|
shift
|
||||||
|
done
|
||||||
|
|
||||||
|
[ -z "$PROJECT" ] && { echo "ERROR: --project required"; exit 1; }
|
||||||
|
|
||||||
|
cd "$(dirname "$0")/python_app"
|
||||||
|
|
||||||
|
if command -v conda &> /dev/null; then
|
||||||
|
conda run -n pytorch_gpu python 22_harvest_baseline_prediction.py "$PROJECT" 2>&1 || \
|
||||||
|
conda run -n pytorch_cpu python 22_harvest_baseline_prediction.py "$PROJECT" 2>&1
|
||||||
|
else
|
||||||
|
python 22_harvest_baseline_prediction.py "$PROJECT"
|
||||||
|
fi
|
||||||
|
```
|
||||||
|
|
||||||
|
**File: `23_convert_harvest_format.sh`**
|
||||||
|
|
||||||
|
```bash
|
||||||
|
#!/bin/bash
|
||||||
|
# Wrapper for Python script 23: Convert harvest format
|
||||||
|
# Usage: ./23_convert_harvest_format.sh --project=angata
|
||||||
|
|
||||||
|
set -e
|
||||||
|
|
||||||
|
PROJECT=""
|
||||||
|
|
||||||
|
while [[ $# -gt 0 ]]; do
|
||||||
|
case $1 in
|
||||||
|
--project=*) PROJECT="${1#*=}" ;;
|
||||||
|
--*) ;;
|
||||||
|
esac
|
||||||
|
shift
|
||||||
|
done
|
||||||
|
|
||||||
|
[ -z "$PROJECT" ] && { echo "ERROR: --project required"; exit 1; }
|
||||||
|
|
||||||
|
cd "$(dirname "$0")/python_app"
|
||||||
|
|
||||||
|
if command -v conda &> /dev/null; then
|
||||||
|
conda run -n pytorch_gpu python 23_convert_harvest_format.py "$PROJECT" 2>&1 || \
|
||||||
|
conda run -n pytorch_cpu python 23_convert_harvest_format.py "$PROJECT" 2>&1
|
||||||
|
else
|
||||||
|
python 23_convert_harvest_format.py "$PROJECT"
|
||||||
|
fi
|
||||||
|
```
|
||||||
|
|
||||||
|
**File: `31_harvest_imminent_weekly.sh`**
|
||||||
|
|
||||||
|
```bash
|
||||||
|
#!/bin/bash
|
||||||
|
# Wrapper for Python script 31: Harvest imminent weekly
|
||||||
|
# Usage: ./31_harvest_imminent_weekly.sh --project=angata
|
||||||
|
|
||||||
|
set -e
|
||||||
|
|
||||||
|
PROJECT=""
|
||||||
|
|
||||||
|
while [[ $# -gt 0 ]]; do
|
||||||
|
case $1 in
|
||||||
|
--project=*) PROJECT="${1#*=}" ;;
|
||||||
|
--*) ;;
|
||||||
|
esac
|
||||||
|
shift
|
||||||
|
done
|
||||||
|
|
||||||
|
[ -z "$PROJECT" ] && { echo "ERROR: --project required"; exit 1; }
|
||||||
|
|
||||||
|
cd "$(dirname "$0")/python_app"
|
||||||
|
|
||||||
|
if command -v conda &> /dev/null; then
|
||||||
|
conda run -n pytorch_gpu python 31_harvest_imminent_weekly.py "$PROJECT" 2>&1 || \
|
||||||
|
conda run -n pytorch_cpu python 31_harvest_imminent_weekly.py "$PROJECT" 2>&1
|
||||||
|
else
|
||||||
|
python 31_harvest_imminent_weekly.py "$PROJECT"
|
||||||
|
fi
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### STEP 9: CREATE 2 Python Package Files
|
||||||
|
|
||||||
|
**File: `python_app/requirements_harvest.txt`**
|
||||||
|
|
||||||
|
```
|
||||||
|
torch>=2.0.0
|
||||||
|
pandas>=1.5.0
|
||||||
|
numpy>=1.23.0
|
||||||
|
scikit-learn>=1.3.0
|
||||||
|
GDAL>=3.7.0
|
||||||
|
sentinelhub>=3.9.0
|
||||||
|
shapely>=2.0.0
|
||||||
|
pyproj>=3.4.0
|
||||||
|
```
|
||||||
|
|
||||||
|
**File: `python_app/environment_pytorch.yml`**
|
||||||
|
|
||||||
|
```yaml
|
||||||
|
name: pytorch_gpu
|
||||||
|
channels:
|
||||||
|
- pytorch
|
||||||
|
- nvidia
|
||||||
|
- conda-forge
|
||||||
|
dependencies:
|
||||||
|
- python=3.10
|
||||||
|
- pytorch::pytorch
|
||||||
|
- pytorch::torchvision
|
||||||
|
- pytorch::torchaudio
|
||||||
|
- pytorch::pytorch-cuda=11.8
|
||||||
|
- gdal>=3.7.0
|
||||||
|
- pip
|
||||||
|
- pip:
|
||||||
|
- sentinelhub>=3.9.0
|
||||||
|
- shapely>=2.0.0
|
||||||
|
- pyproj>=3.4.0
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### STEP 10: CORE TESTING CHECKLIST
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# 1. Migration
|
||||||
|
php artisan migrate
|
||||||
|
# ✅ Expected: No errors, client_type column added
|
||||||
|
|
||||||
|
# 2. Download test
|
||||||
|
# Go to Laravel UI → Create project with client_type=agronomic_support
|
||||||
|
# → Download Manager → Add image → Download
|
||||||
|
# Expected: File in laravel_app/storage/app/{project}/field_tiles_CI/
|
||||||
|
|
||||||
|
# 3. Mosaic test
|
||||||
|
# Go to Mosaic Manager → Create mosaic
|
||||||
|
# Check logs: grep "Unknown option" laravel.log
|
||||||
|
# Expected: No --data_dir errors, mosaic created
|
||||||
|
|
||||||
|
# 4. Mail test
|
||||||
|
# Create project with client_type=cane_supply
|
||||||
|
# Generate & send report
|
||||||
|
# Expected: Email has 2 attachments (report + KPI Excel)
|
||||||
|
|
||||||
|
# 5. Shell wrapper test
|
||||||
|
./10_create_per_field_tiffs.sh --project=angata
|
||||||
|
# Expected: R script executes without error
|
||||||
|
```
|
||||||
|
|
||||||
|
✅ **CORE MERGE COMPLETE**
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 🟡 PHASE 2: ENHANCEMENTS (Post-Merge)
|
||||||
|
|
||||||
|
### OPTIONAL 1: Country-Based Organization
|
||||||
|
|
||||||
|
**Why**: Organize projects by geographic location (MZ/UG/TZ folders)
|
||||||
|
|
||||||
|
**Create Migration**: `database/migrations/YYYY_MM_DD_add_country_to_projects_table.php`
|
||||||
|
|
||||||
|
```php
|
||||||
|
<?php
|
||||||
|
|
||||||
|
use Illuminate\Database\Migrations\Migration;
|
||||||
|
use Illuminate\Database\Schema\Blueprint;
|
||||||
|
use Illuminate\Support\Facades\Schema;
|
||||||
|
use Illuminate\Support\Facades\DB;
|
||||||
|
|
||||||
|
return new class extends Migration
|
||||||
|
{
|
||||||
|
public function up(): void
|
||||||
|
{
|
||||||
|
Schema::table('projects', function (Blueprint $table) {
|
||||||
|
$table->string('country')->default('Mozambique')->after('name');
|
||||||
|
$table->string('country_code', 2)->default('MZ')->after('country');
|
||||||
|
});
|
||||||
|
|
||||||
|
// Update existing projects
|
||||||
|
DB::table('projects')->where('name', 'angata')->update(['country' => 'Mozambique', 'country_code' => 'MZ']);
|
||||||
|
DB::table('projects')->where('name', 'aura')->update(['country' => 'Mozambique', 'country_code' => 'MZ']);
|
||||||
|
DB::table('projects')->where('name', 'chemba')->update(['country' => 'Mozambique', 'country_code' => 'MZ']);
|
||||||
|
DB::table('projects')->where('name', 'xinavane')->update(['country' => 'Tanzania', 'country_code' => 'TZ']);
|
||||||
|
DB::table('projects')->where('name', 'esa')->update(['country' => 'Kenya', 'country_code' => 'KQ']);
|
||||||
|
DB::table('projects')->where('name', 'simba')->update(['country' => 'Uganda', 'country_code' => 'UG']);
|
||||||
|
DB::table('projects')->where('name', 'john')->update(['country' => 'Uganda', 'country_code' => 'UG']);
|
||||||
|
DB::table('projects')->where('name', 'huss')->update(['country' => 'Tanzania', 'country_code' => 'TZ']);
|
||||||
|
}
|
||||||
|
|
||||||
|
public function down(): void
|
||||||
|
{
|
||||||
|
Schema::table('projects', function (Blueprint $table) {
|
||||||
|
$table->dropColumn(['country', 'country_code']);
|
||||||
|
});
|
||||||
|
}
|
||||||
|
};
|
||||||
|
```
|
||||||
|
|
||||||
|
**Update Project.php `$fillable`:**
|
||||||
|
|
||||||
|
```php
|
||||||
|
protected $fillable = [
|
||||||
|
'name',
|
||||||
|
'country',
|
||||||
|
'country_code',
|
||||||
|
'download_path',
|
||||||
|
'client_type',
|
||||||
|
// ... rest
|
||||||
|
];
|
||||||
|
```
|
||||||
|
|
||||||
|
**Update ProjectManager.php:**
|
||||||
|
|
||||||
|
```php
|
||||||
|
public array $countries = [
|
||||||
|
'MZ' => 'Mozambique',
|
||||||
|
'TZ' => 'Tanzania',
|
||||||
|
'UG' => 'Uganda',
|
||||||
|
'KQ' => 'Kenya',
|
||||||
|
'SA' => 'South Africa',
|
||||||
|
'ZW' => 'Zimbabwe',
|
||||||
|
'BR' => 'Brazil',
|
||||||
|
'MX' => 'Mexico',
|
||||||
|
'IN' => 'India',
|
||||||
|
];
|
||||||
|
|
||||||
|
public function createProject()
|
||||||
|
{
|
||||||
|
$projectIdentifier = $this->formData['id'] ?? null;
|
||||||
|
Validator::make(
|
||||||
|
['name' => $this->formData['name']],
|
||||||
|
['name' => ['required', Rule::unique('projects')->ignore($projectIdentifier)]]
|
||||||
|
)->validate();
|
||||||
|
|
||||||
|
$projectPath = $this->formData['country_code'] . '/' . $this->formData['name'];
|
||||||
|
Storage::makeDirectory($projectPath, recursive: true);
|
||||||
|
|
||||||
|
$project = Project::create([
|
||||||
|
'name' => $this->formData['name'],
|
||||||
|
'country' => $this->formData['country'],
|
||||||
|
'country_code' => $this->formData['country_code'],
|
||||||
|
'download_path' => $projectPath,
|
||||||
|
'client_type' => $this->formData['client_type'] ?? 'agronomic_support',
|
||||||
|
]);
|
||||||
|
return redirect()->route('project.show', [$project->name, 'settings']);
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
**Add to Blade template:**
|
||||||
|
|
||||||
|
```blade
|
||||||
|
<div class="form-group mb-3">
|
||||||
|
<label for="country" class="form-label">Country</label>
|
||||||
|
<select wire:model="formData.country" id="country" class="form-control" required>
|
||||||
|
@foreach($countries as $code => $name)
|
||||||
|
<option value="{{ $name }}" @selected($formData['country'] === $name)>
|
||||||
|
{{ $name }} ({{ $code }})
|
||||||
|
</option>
|
||||||
|
@endforeach
|
||||||
|
</select>
|
||||||
|
</div>
|
||||||
|
|
||||||
|
<div class="form-group mb-3">
|
||||||
|
<label for="country_code" class="form-label">Country Code (Auto-populated)</label>
|
||||||
|
<input type="text" wire:model="formData.country_code" id="country_code" readonly class="form-control" />
|
||||||
|
</div>
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### OPTIONAL 2: Download Scheduling
|
||||||
|
|
||||||
|
**Why**: Avoid API rate limits by staggering downloads per project at 00:01
|
||||||
|
|
||||||
|
#### Option A: Linux Cron (If server is Linux)
|
||||||
|
|
||||||
|
**Add to `/etc/cron.d/smartcane_downloads`:**
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Stagger downloads by 10 minutes per project
|
||||||
|
1 0 * * * /usr/bin/python /home/user/smartcane/python_app/00_download_8band_pu_optimized.py angata 2>&1 | logger
|
||||||
|
15 0 * * * /usr/bin/python /home/user/smartcane/python_app/00_download_8band_pu_optimized.py chemba 2>&1 | logger
|
||||||
|
25 0 * * * /usr/bin/python /home/user/smartcane/python_app/00_download_8band_pu_optimized.py xinavane 2>&1 | logger
|
||||||
|
35 0 * * * /usr/bin/python /home/user/smartcane/python_app/00_download_8band_pu_optimized.py esa 2>&1 | logger
|
||||||
|
45 0 * * * /usr/bin/python /home/user/smartcane/python_app/00_download_8band_pu_optimized.py simba 2>&1 | logger
|
||||||
|
0 1 * * * /usr/bin/python /home/user/smartcane/python_app/00_download_8band_pu_optimized.py aura 2>&1 | logger
|
||||||
|
```
|
||||||
|
|
||||||
|
#### Option B: Windows Task Scheduler (If server is Windows)
|
||||||
|
|
||||||
|
```powershell
|
||||||
|
# Create task for each project
|
||||||
|
$taskName = "SmartCane-Download-angata"
|
||||||
|
$action = New-ScheduledTaskAction -Execute "powershell.exe" -Argument "-NoProfile -WindowStyle Hidden -Command python C:\smartcane\python_app\00_download_8band_pu_optimized.py angata"
|
||||||
|
$trigger = New-ScheduledTaskTrigger -Daily -At 00:01
|
||||||
|
Register-ScheduledTask -TaskName $taskName -Action $action -Trigger $trigger -RunLevel Highest
|
||||||
|
```
|
||||||
|
|
||||||
|
#### Option C: Laravel Task Scheduler
|
||||||
|
|
||||||
|
**Add to `laravel_app/app/Console/Kernel.php`:**
|
||||||
|
|
||||||
|
```php
|
||||||
|
protected function schedule(Schedule $schedule)
|
||||||
|
{
|
||||||
|
$schedule->exec('python python_app/00_download_8band_pu_optimized.py angata')
|
||||||
|
->dailyAt('00:01');
|
||||||
|
$schedule->exec('python python_app/00_download_8band_pu_optimized.py chemba')
|
||||||
|
->dailyAt('00:15');
|
||||||
|
$schedule->exec('python python_app/00_download_8band_pu_optimized.py xinavane')
|
||||||
|
->dailyAt('00:25');
|
||||||
|
$schedule->exec('python python_app/00_download_8band_pu_optimized.py esa')
|
||||||
|
->dailyAt('00:35');
|
||||||
|
$schedule->exec('python python_app/00_download_8band_pu_optimized.py simba')
|
||||||
|
->dailyAt('00:45');
|
||||||
|
$schedule->exec('python python_app/00_download_8band_pu_optimized.py aura')
|
||||||
|
->dailyAt('01:00');
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### OPTIONAL 3: Project Search Feature
|
||||||
|
|
||||||
|
**Why**: Find projects quickly if there are many
|
||||||
|
|
||||||
|
**Add to `laravel_app/app/Livewire/Projects/ProjectList.php`:**
|
||||||
|
|
||||||
|
```php
|
||||||
|
public string $searchQuery = '';
|
||||||
|
|
||||||
|
public function getProjectsProperty()
|
||||||
|
{
|
||||||
|
$query = Project::query();
|
||||||
|
|
||||||
|
if (!empty($this->searchQuery)) {
|
||||||
|
$query->where('name', 'like', '%' . $this->searchQuery . '%')
|
||||||
|
->orWhere('download_path', 'like', '%' . $this->searchQuery . '%');
|
||||||
|
}
|
||||||
|
|
||||||
|
return $query->orderBy('name')->paginate(15);
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
**Add to Blade template:**
|
||||||
|
|
||||||
|
```blade
|
||||||
|
<div class="search-bar mb-4">
|
||||||
|
<input type="text"
|
||||||
|
wire:model.live="searchQuery"
|
||||||
|
placeholder="Search projects..."
|
||||||
|
class="form-control" />
|
||||||
|
</div>
|
||||||
|
|
||||||
|
<div class="projects-list">
|
||||||
|
@forelse($this->projects as $project)
|
||||||
|
<div class="project-card">
|
||||||
|
<h3>{{ $project->name }}</h3>
|
||||||
|
<p>{{ $project->download_path }}</p>
|
||||||
|
<span class="badge badge-info">{{ $project->client_type }}</span>
|
||||||
|
</div>
|
||||||
|
@empty
|
||||||
|
<p>No projects found</p>
|
||||||
|
@endforelse
|
||||||
|
</div>
|
||||||
|
|
||||||
|
{{ $this->projects->links() }}
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### OPTIONAL 4: Harvest Date Prediction Setup
|
||||||
|
|
||||||
|
**Why**: Enable harvest Date forecasting for cane_supply projects
|
||||||
|
|
||||||
|
**Create conda environment:**
|
||||||
|
|
||||||
|
```bash
|
||||||
|
conda env create -f python_app/environment_pytorch.yml
|
||||||
|
|
||||||
|
# Activate
|
||||||
|
conda activate pytorch_gpu
|
||||||
|
|
||||||
|
# Or CPU-only if no GPU
|
||||||
|
conda create -n pytorch_cpu python=3.10 pytorch::pytorch torchvision torchaudio -c pytorch
|
||||||
|
conda activate pytorch_cpu
|
||||||
|
```
|
||||||
|
|
||||||
|
**Run baseline prediction (once):**
|
||||||
|
|
||||||
|
```bash
|
||||||
|
python python_app/22_harvest_baseline_prediction.py angata
|
||||||
|
python python_app/23_convert_harvest_format.py angata
|
||||||
|
```
|
||||||
|
|
||||||
|
**Schedule weekly prediction:**
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Add to cron (Linux)
|
||||||
|
0 23 * * 0 conda run -n pytorch_gpu python /home/user/smartcane/python_app/31_harvest_imminent_weekly.py angata 2>&1 | logger
|
||||||
|
|
||||||
|
# Or Task Scheduler (Windows)
|
||||||
|
# Similar to download scheduling above, but Sunday 23:00
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 📊 Summary: What Gets Changed
|
||||||
|
|
||||||
|
| Category | Files Modified | Changes Required |
|
||||||
|
|----------|---|---|
|
||||||
|
| Database | migrations/ | 1 file: add client_type column |
|
||||||
|
| Models | Project.php | 5 edits: fillable, 3 methods, 1 new method |
|
||||||
|
| Jobs | 2 files | ProjectDownloadTiffJob (1 line), ProjectMosiacGeneratorJob (full array) |
|
||||||
|
| Forms | 3 files | ProjectManager.php, Blade template, MailingForm.php |
|
||||||
|
| Scripts | 5 files created | Shell wrappers (R/Python) |
|
||||||
|
| Python | 2 files created | requirements_harvest.txt, environment_pytorch.yml |
|
||||||
|
|
||||||
|
**TOTAL**: 6 files created + 6 files modified + 1 template modified = **13 changes**
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## ✅ FINAL VERIFICATION
|
||||||
|
|
||||||
|
After ALL changes (core + optionals), test:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# 1. Migration worked
|
||||||
|
php artisan migrate
|
||||||
|
|
||||||
|
# 2. Download saves to correct path
|
||||||
|
# → Download image → check laravel_app/storage/app/{project}/field_tiles_CI/
|
||||||
|
|
||||||
|
# 3. Mosaic runs without errors
|
||||||
|
# → Create mosaic → check logs for no --data_dir errors
|
||||||
|
|
||||||
|
# 4. Mail has 2 attachments for cane_supply
|
||||||
|
# → Send report for cane_supply project → verify report + KPI Excel
|
||||||
|
|
||||||
|
# 5. Shell wrappers work
|
||||||
|
./10_create_per_field_tiffs.sh --project=angata
|
||||||
|
# → Should execute R script successfully
|
||||||
|
|
||||||
|
# 6. Search works (if implemented)
|
||||||
|
# → Search for project by name on Projects page
|
||||||
|
|
||||||
|
# 7. Country filter works (if implemented)
|
||||||
|
# → Filter projects by country code
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 🌍 POST-MERGE: Data Recreation Strategy
|
||||||
|
|
||||||
|
After merge is live, existing projects need new directory structure.
|
||||||
|
|
||||||
|
### Option A: Delete & Redownload (Small projects)
|
||||||
|
**Projects**: aura, chemba, xinavane, esa, simba
|
||||||
|
|
||||||
|
```
|
||||||
|
1. Backup project folder (optional)
|
||||||
|
2. Delete project from Laravel UI
|
||||||
|
3. Recreate with new client_type selector
|
||||||
|
4. Redownload 2-3 years of data (~50-150 GB per project)
|
||||||
|
5. Run pipeline normally
|
||||||
|
```
|
||||||
|
|
||||||
|
### Option B: Preserve merged_tif (Large projects)
|
||||||
|
**Projects**: angata
|
||||||
|
|
||||||
|
```
|
||||||
|
1. Backup merged_tif/ folder externally
|
||||||
|
2. Delete all other folders in project
|
||||||
|
3. Keep only: merged_tif/
|
||||||
|
4. Run Scripts 10-80 on existing data
|
||||||
|
→ Regenerates field_tiles_CI/, reports/, etc.
|
||||||
|
5. No need to redownload
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 🔴 CORE vs OPTIONAL Quick List
|
||||||
|
|
||||||
|
**MUST DO** (for merge):
|
||||||
|
- ✅ Database migration
|
||||||
|
- ✅ Project.php edits (4 path changes + 1 new method)
|
||||||
|
- ✅ Job edits (ProjectDownloadTiffJob, ProjectMosiacGeneratorJob)
|
||||||
|
- ✅ Form edits (ProjectManager, MailingForm, Blade template)
|
||||||
|
- ✅ 5 shell wrappers
|
||||||
|
- ✅ 2 Python files
|
||||||
|
|
||||||
|
**NICE-TO-HAVE** (post-merge):
|
||||||
|
- 🟡 Country organization (adds ~45 min)
|
||||||
|
- 🟡 Download scheduling (adds ~30 min)
|
||||||
|
- 🟡 Project search (adds ~30 min)
|
||||||
|
- 🟡 Harvest prediction setup (adds ~20 min)
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
**Ready to implement everything?** All code is copy-paste ready above.
|
||||||
|
|
||||||
|
|
@ -1,2 +0,0 @@
|
||||||
install.packages(c("googledrive", "here", "tidyverse", "lubridate", "readxl", "googlesheets4", "here", "sf", "tidyverse", "lubridate", "terra", "exactextractr")
|
|
||||||
install.packages("packages/CIprep_0.1.4.tar.gz",repos=NULL, type="source")
|
|
||||||
|
|
@ -48,11 +48,23 @@ public static function saveAndSendMailing($report, $subject, $message, $recipien
|
||||||
'report_id' => $report->id,
|
'report_id' => $report->id,
|
||||||
]);
|
]);
|
||||||
|
|
||||||
|
// Attach main report
|
||||||
$mailing->attachments()->create([
|
$mailing->attachments()->create([
|
||||||
'name' => $report->name,
|
'name' => $report->name,
|
||||||
'path' => $report->path,
|
'path' => $report->path,
|
||||||
]);
|
]);
|
||||||
|
|
||||||
|
// For cane_supply projects, also attach latest KPI Excel file
|
||||||
|
if ($report->project->client_type === 'cane_supply') {
|
||||||
|
$kpiFile = $report->project->getLatestKpiFile();
|
||||||
|
if ($kpiFile) {
|
||||||
|
$mailing->attachments()->create([
|
||||||
|
'name' => 'KPI Data - ' . basename($kpiFile),
|
||||||
|
'path' => $kpiFile,
|
||||||
|
]);
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
$mailing->recipients()->createMany($recipients);
|
$mailing->recipients()->createMany($recipients);
|
||||||
Mail::to($mailing->recipients()->pluck('email')->toArray())
|
Mail::to($mailing->recipients()->pluck('email')->toArray())
|
||||||
->send(new \App\Mail\ReportMailer($mailing, $report));
|
->send(new \App\Mail\ReportMailer($mailing, $report));
|
||||||
|
|
|
||||||
|
|
@ -251,6 +251,19 @@ class="w-5 h-5 text-red-400 dark:text-red-200">
|
||||||
class="relative w-11 h-6 bg-gray-200 peer-focus:outline-none peer-focus:ring-4 peer-focus:ring-indigo-300 dark:peer-focus:ring-indigo-800 rounded-full peer dark:bg-gray-700 peer-checked:after:translate-x-full rtl:peer-checked:after:-translate-x-full peer-checked:after:border-white after:content-[''] after:absolute after:top-[2px] after:start-[2px] after:bg-white after:border-gray-300 after:border after:rounded-full after:h-5 after:w-5 after:transition-all dark:border-gray-600 peer-checked:bg-indigo-600"></div>
|
class="relative w-11 h-6 bg-gray-200 peer-focus:outline-none peer-focus:ring-4 peer-focus:ring-indigo-300 dark:peer-focus:ring-indigo-800 rounded-full peer dark:bg-gray-700 peer-checked:after:translate-x-full rtl:peer-checked:after:-translate-x-full peer-checked:after:border-white after:content-[''] after:absolute after:top-[2px] after:start-[2px] after:bg-white after:border-gray-300 after:border after:rounded-full after:h-5 after:w-5 after:transition-all dark:border-gray-600 peer-checked:bg-indigo-600"></div>
|
||||||
</label>
|
</label>
|
||||||
</div>
|
</div>
|
||||||
|
<div class="mb-2">
|
||||||
|
<x-label for="client_type" value="{{ __('Client Type') }}"/>
|
||||||
|
<select id="client_type" class="border-gray-300 dark:border-gray-700 dark:bg-gray-900 dark:text-gray-300 focus:border-indigo-500 dark:focus:border-indigo-600 focus:ring-indigo-500 dark:focus:ring-indigo-600 rounded-md shadow-sm mt-1 block w-full"
|
||||||
|
wire:model="formData.client_type"
|
||||||
|
@if($this->formData['id'] ?? false) disabled @endif>
|
||||||
|
<option value="agronomic_support">{{ __('Agronomic Support') }}</option>
|
||||||
|
<option value="cane_supply">{{ __('Cane Supply') }}</option>
|
||||||
|
</select>
|
||||||
|
@if($this->formData['id'] ?? false)
|
||||||
|
<p class="text-sm text-gray-500 mt-1">{{ __('Client type cannot be changed after project creation') }}</p>
|
||||||
|
@endif
|
||||||
|
<x-input-error for="client_type" class="mt-2"/>
|
||||||
|
</div>
|
||||||
</form>
|
</form>
|
||||||
</div>
|
</div>
|
||||||
</div>
|
</div>
|
||||||
|
|
|
||||||
|
|
@ -1,111 +0,0 @@
|
||||||
[INFO] 2025-06-24 14:49:29 - SmartCane Project - Package Manager Started
|
|
||||||
[INFO] 2025-06-24 14:49:29 - Working directory: C:/Users/timon/Resilience BV/4020 SCane ESA DEMO - Documenten/General/4020 SCDEMO Team/4020 TechnicalData/WP3/smartcane
|
|
||||||
[INFO] 2025-06-24 14:49:29 - Checking renv initialization...
|
|
||||||
[INFO] 2025-06-24 14:49:29 - ✓ renv already initialized
|
|
||||||
[INFO] 2025-06-24 14:49:29 - ✓ renv already active
|
|
||||||
[INFO] 2025-06-24 14:49:29 -
|
|
||||||
=== INITIAL STATE ===
|
|
||||||
[INFO] 2025-06-24 14:49:29 - === PACKAGE REPORT ===
|
|
||||||
[INFO] 2025-06-24 14:49:29 - dplyr | Required: >= 1.1.4 | Installed: 1.1.4 | ✅ OK
|
|
||||||
[INFO] 2025-06-24 14:49:29 - here | Required: >= 1.0.1 | Installed: 1.0.1 | ✅ OK
|
|
||||||
[INFO] 2025-06-24 14:49:29 - lubridate | Required: >= 1.9.4 | Installed: 1.9.4 | ✅ OK
|
|
||||||
[INFO] 2025-06-24 14:49:29 - readr | Required: >= 2.1.5 | Installed: 2.1.5 | ✅ OK
|
|
||||||
[INFO] 2025-06-24 14:49:29 - readxl | Required: >= 1.4.5 | Installed: 1.4.5 | ✅ OK
|
|
||||||
[INFO] 2025-06-24 14:49:29 - stringr | Required: >= 1.5.1 | Installed: 1.5.1 | ✅ OK
|
|
||||||
[INFO] 2025-06-24 14:49:29 - tidyr | Required: >= 1.3.1 | Installed: 1.3.1 | ✅ OK
|
|
||||||
[INFO] 2025-06-24 14:49:29 - purrr | Required: >= 1.0.2 | Installed: 1.0.2 | ✅ OK
|
|
||||||
[INFO] 2025-06-24 14:49:29 - magrittr | Required: >= 2.0.0 | Installed: 2.0.3 | ✅ OK
|
|
||||||
[INFO] 2025-06-24 14:49:29 - exactextractr | Required: >= 0.10.0 | Installed: 0.10.0 | ✅ OK
|
|
||||||
[INFO] 2025-06-24 14:49:29 - raster | Required: >= 3.6.32 | Installed: 3.6.32 | ✅ OK
|
|
||||||
[INFO] 2025-06-24 14:49:29 - sf | Required: >= 1.0.19 | Installed: 1.0.19 | ✅ OK
|
|
||||||
[INFO] 2025-06-24 14:49:29 - terra | Required: >= 1.8.43 | Installed: 1.8.43 | ✅ OK
|
|
||||||
[INFO] 2025-06-24 14:49:29 - ggplot2 | Required: >= 3.5.1 | Installed: 3.5.1 | ✅ OK
|
|
||||||
[INFO] 2025-06-24 14:49:29 - tmap | Required: >= 4.0 | Installed: 4.0 | ✅ OK
|
|
||||||
[INFO] 2025-06-24 14:49:29 - gridExtra | Required: >= 2.3 | Installed: 2.3 | ✅ OK
|
|
||||||
[INFO] 2025-06-24 14:49:29 - knitr | Required: >= 1.50 | Installed: 1.50 | ✅ OK
|
|
||||||
[INFO] 2025-06-24 14:49:29 - rmarkdown | Required: >= 2.21.0 | Installed: 2.29 | ✅ OK
|
|
||||||
[INFO] 2025-06-24 14:49:29 - tidyverse | Required: >= 2.0.0 | Installed: 2.0.0 | ✅ OK
|
|
||||||
[INFO] 2025-06-24 14:49:29 - caret | Required: >= 7.0.1 | Installed: 7.0.1 | ✅ OK
|
|
||||||
[INFO] 2025-06-24 14:49:29 - CAST | Required: >= 1.0.3 | Installed: 1.0.3 | ✅ OK
|
|
||||||
[INFO] 2025-06-24 14:49:29 - randomForest | Required: >= 4.7.1.2 | Installed: 4.7.1.2 | ✅ OK
|
|
||||||
[INFO] 2025-06-24 14:49:29 - rsample | Required: >= 1.3.0 | Installed: 1.3.0 | ✅ OK
|
|
||||||
[INFO] 2025-06-24 14:49:29 - furrr | Required: >= 0.3.1 | Installed: 0.3.1 | ✅ OK
|
|
||||||
[INFO] 2025-06-24 14:49:29 - future | Required: >= 1.40.0 | Installed: 1.40.0 | ✅ OK
|
|
||||||
[INFO] 2025-06-24 14:49:29 - progressr | Required: >= 0.15.1 | Installed: 0.15.1 | ✅ OK
|
|
||||||
[INFO] 2025-06-24 14:49:29 - reshape2 | Required: >= 1.4.4 | Installed: 1.4.4 | ✅ OK
|
|
||||||
[INFO] 2025-06-24 14:49:29 - zoo | Required: >= 1.8.13 | Installed: 1.8.13 | ✅ OK
|
|
||||||
[INFO] 2025-06-24 14:49:29 - === END PACKAGE REPORT ===
|
|
||||||
[INFO] 2025-06-24 14:49:29 -
|
|
||||||
=== PACKAGE INSTALLATION/UPDATES ===
|
|
||||||
[INFO] 2025-06-24 14:49:29 - === PACKAGE MANAGEMENT STARTED ===
|
|
||||||
[INFO] 2025-06-24 14:49:29 - R version: R version 4.4.2 (2024-10-31 ucrt)
|
|
||||||
[INFO] 2025-06-24 14:49:29 - ✓ dplyr version 1.1.4 meets requirement (>= 1.1.4)
|
|
||||||
[INFO] 2025-06-24 14:49:29 - ✓ here version 1.0.1 meets requirement (>= 1.0.1)
|
|
||||||
[INFO] 2025-06-24 14:49:29 - ✓ lubridate version 1.9.4 meets requirement (>= 1.9.4)
|
|
||||||
[INFO] 2025-06-24 14:49:29 - ✓ readr version 2.1.5 meets requirement (>= 2.1.5)
|
|
||||||
[INFO] 2025-06-24 14:49:29 - ✓ readxl version 1.4.5 meets requirement (>= 1.4.5)
|
|
||||||
[INFO] 2025-06-24 14:49:29 - ✓ stringr version 1.5.1 meets requirement (>= 1.5.1)
|
|
||||||
[INFO] 2025-06-24 14:49:29 - ✓ tidyr version 1.3.1 meets requirement (>= 1.3.1)
|
|
||||||
[INFO] 2025-06-24 14:49:29 - ✓ purrr version 1.0.2 meets requirement (>= 1.0.2)
|
|
||||||
[INFO] 2025-06-24 14:49:29 - ✓ magrittr version 2.0.3 meets requirement (>= 2.0.0)
|
|
||||||
[INFO] 2025-06-24 14:49:29 - ✓ exactextractr version 0.10.0 meets requirement (>= 0.10.0)
|
|
||||||
[INFO] 2025-06-24 14:49:29 - ✓ raster version 3.6.32 meets requirement (>= 3.6.32)
|
|
||||||
[INFO] 2025-06-24 14:49:29 - ✓ sf version 1.0.19 meets requirement (>= 1.0.19)
|
|
||||||
[INFO] 2025-06-24 14:49:29 - ✓ terra version 1.8.43 meets requirement (>= 1.8.43)
|
|
||||||
[INFO] 2025-06-24 14:49:29 - ✓ ggplot2 version 3.5.1 meets requirement (>= 3.5.1)
|
|
||||||
[INFO] 2025-06-24 14:49:29 - ✓ tmap version 4.0 meets requirement (>= 4.0)
|
|
||||||
[INFO] 2025-06-24 14:49:29 - ✓ gridExtra version 2.3 meets requirement (>= 2.3)
|
|
||||||
[INFO] 2025-06-24 14:49:29 - ✓ knitr version 1.50 meets requirement (>= 1.50)
|
|
||||||
[INFO] 2025-06-24 14:49:29 - ✓ rmarkdown version 2.29 meets requirement (>= 2.21.0)
|
|
||||||
[INFO] 2025-06-24 14:49:29 - ✓ tidyverse version 2.0.0 meets requirement (>= 2.0.0)
|
|
||||||
[INFO] 2025-06-24 14:49:29 - ✓ caret version 7.0.1 meets requirement (>= 7.0.1)
|
|
||||||
[INFO] 2025-06-24 14:49:29 - ✓ CAST version 1.0.3 meets requirement (>= 1.0.3)
|
|
||||||
[INFO] 2025-06-24 14:49:29 - ✓ randomForest version 4.7.1.2 meets requirement (>= 4.7.1.2)
|
|
||||||
[INFO] 2025-06-24 14:49:29 - ✓ rsample version 1.3.0 meets requirement (>= 1.3.0)
|
|
||||||
[INFO] 2025-06-24 14:49:29 - ✓ furrr version 0.3.1 meets requirement (>= 0.3.1)
|
|
||||||
[INFO] 2025-06-24 14:49:29 - ✓ future version 1.40.0 meets requirement (>= 1.40.0)
|
|
||||||
[INFO] 2025-06-24 14:49:29 - ✓ progressr version 0.15.1 meets requirement (>= 0.15.1)
|
|
||||||
[INFO] 2025-06-24 14:49:29 - ✓ reshape2 version 1.4.4 meets requirement (>= 1.4.4)
|
|
||||||
[INFO] 2025-06-24 14:49:29 - ✓ zoo version 1.8.13 meets requirement (>= 1.8.13)
|
|
||||||
[INFO] 2025-06-24 14:49:29 - Package management complete: 28 success, 0 failures
|
|
||||||
[INFO] 2025-06-24 14:49:29 - Updating renv lockfile...
|
|
||||||
[ERROR] 2025-06-24 14:49:33 - ✗ Failed to update lockfile: aborting snapshot due to pre-flight validation failure
|
|
||||||
[INFO] 2025-06-24 14:49:33 -
|
|
||||||
=== FINAL STATE ===
|
|
||||||
[INFO] 2025-06-24 14:49:33 - === PACKAGE REPORT ===
|
|
||||||
[INFO] 2025-06-24 14:49:33 - dplyr | Required: >= 1.1.4 | Installed: 1.1.4 | ✅ OK
|
|
||||||
[INFO] 2025-06-24 14:49:33 - here | Required: >= 1.0.1 | Installed: 1.0.1 | ✅ OK
|
|
||||||
[INFO] 2025-06-24 14:49:33 - lubridate | Required: >= 1.9.4 | Installed: 1.9.4 | ✅ OK
|
|
||||||
[INFO] 2025-06-24 14:49:33 - readr | Required: >= 2.1.5 | Installed: 2.1.5 | ✅ OK
|
|
||||||
[INFO] 2025-06-24 14:49:33 - readxl | Required: >= 1.4.5 | Installed: 1.4.5 | ✅ OK
|
|
||||||
[INFO] 2025-06-24 14:49:33 - stringr | Required: >= 1.5.1 | Installed: 1.5.1 | ✅ OK
|
|
||||||
[INFO] 2025-06-24 14:49:33 - tidyr | Required: >= 1.3.1 | Installed: 1.3.1 | ✅ OK
|
|
||||||
[INFO] 2025-06-24 14:49:33 - purrr | Required: >= 1.0.2 | Installed: 1.0.2 | ✅ OK
|
|
||||||
[INFO] 2025-06-24 14:49:33 - magrittr | Required: >= 2.0.0 | Installed: 2.0.3 | ✅ OK
|
|
||||||
[INFO] 2025-06-24 14:49:33 - exactextractr | Required: >= 0.10.0 | Installed: 0.10.0 | ✅ OK
|
|
||||||
[INFO] 2025-06-24 14:49:33 - raster | Required: >= 3.6.32 | Installed: 3.6.32 | ✅ OK
|
|
||||||
[INFO] 2025-06-24 14:49:33 - sf | Required: >= 1.0.19 | Installed: 1.0.19 | ✅ OK
|
|
||||||
[INFO] 2025-06-24 14:49:33 - terra | Required: >= 1.8.43 | Installed: 1.8.43 | ✅ OK
|
|
||||||
[INFO] 2025-06-24 14:49:33 - ggplot2 | Required: >= 3.5.1 | Installed: 3.5.1 | ✅ OK
|
|
||||||
[INFO] 2025-06-24 14:49:33 - tmap | Required: >= 4.0 | Installed: 4.0 | ✅ OK
|
|
||||||
[INFO] 2025-06-24 14:49:33 - gridExtra | Required: >= 2.3 | Installed: 2.3 | ✅ OK
|
|
||||||
[INFO] 2025-06-24 14:49:33 - knitr | Required: >= 1.50 | Installed: 1.50 | ✅ OK
|
|
||||||
[INFO] 2025-06-24 14:49:33 - rmarkdown | Required: >= 2.21.0 | Installed: 2.29 | ✅ OK
|
|
||||||
[INFO] 2025-06-24 14:49:33 - tidyverse | Required: >= 2.0.0 | Installed: 2.0.0 | ✅ OK
|
|
||||||
[INFO] 2025-06-24 14:49:33 - caret | Required: >= 7.0.1 | Installed: 7.0.1 | ✅ OK
|
|
||||||
[INFO] 2025-06-24 14:49:33 - CAST | Required: >= 1.0.3 | Installed: 1.0.3 | ✅ OK
|
|
||||||
[INFO] 2025-06-24 14:49:33 - randomForest | Required: >= 4.7.1.2 | Installed: 4.7.1.2 | ✅ OK
|
|
||||||
[INFO] 2025-06-24 14:49:33 - rsample | Required: >= 1.3.0 | Installed: 1.3.0 | ✅ OK
|
|
||||||
[INFO] 2025-06-24 14:49:33 - furrr | Required: >= 0.3.1 | Installed: 0.3.1 | ✅ OK
|
|
||||||
[INFO] 2025-06-24 14:49:33 - future | Required: >= 1.40.0 | Installed: 1.40.0 | ✅ OK
|
|
||||||
[INFO] 2025-06-24 14:49:33 - progressr | Required: >= 0.15.1 | Installed: 0.15.1 | ✅ OK
|
|
||||||
[INFO] 2025-06-24 14:49:33 - reshape2 | Required: >= 1.4.4 | Installed: 1.4.4 | ✅ OK
|
|
||||||
[INFO] 2025-06-24 14:49:33 - zoo | Required: >= 1.8.13 | Installed: 1.8.13 | ✅ OK
|
|
||||||
[INFO] 2025-06-24 14:49:33 - === END PACKAGE REPORT ===
|
|
||||||
[INFO] 2025-06-24 14:49:33 - Package management completed in 7.72 seconds
|
|
||||||
[INFO] 2025-06-24 14:49:33 - Log saved to: C:/Users/timon/Resilience BV/4020 SCane ESA DEMO - Documenten/General/4020 SCDEMO Team/4020 TechnicalData/WP3/smartcane/package_manager.log
|
|
||||||
[SUCCESS] 2025-06-24 14:49:33 - 🎉 All packages successfully managed!
|
|
||||||
[INFO] 2025-06-24 14:49:33 - 📋 Next steps:
|
|
||||||
[INFO] 2025-06-24 14:49:33 - 1. Test your R scripts to ensure everything works
|
|
||||||
[INFO] 2025-06-24 14:49:33 - 2. Commit renv.lock to version control
|
|
||||||
[INFO] 2025-06-24 14:49:33 - 3. Share this script with your team
|
|
||||||
45
python_app/.gitignore
vendored
|
|
@ -1,45 +0,0 @@
|
||||||
# .gitignore
|
|
||||||
|
|
||||||
# Ignore the virtual environment directory
|
|
||||||
myenv/
|
|
||||||
|
|
||||||
# Byte-compiled / optimized files
|
|
||||||
__pycache__/
|
|
||||||
*.py[cod]
|
|
||||||
*$py.class
|
|
||||||
|
|
||||||
# Ignore all .pyc files in any directory
|
|
||||||
**/*.pyc
|
|
||||||
|
|
||||||
# Ignore pip wheel metadata
|
|
||||||
pip-wheel-metadata/
|
|
||||||
|
|
||||||
# Ignore the `dist/` directory used for package distributions
|
|
||||||
dist/
|
|
||||||
|
|
||||||
# Ignore Python egg metadata, regenerated from source files by setuptools
|
|
||||||
*.egg-info/
|
|
||||||
*.egg-info/*
|
|
||||||
*.egg/
|
|
||||||
|
|
||||||
# Ignore Jupyter Notebook
|
|
||||||
.ipynb_checkpoints
|
|
||||||
|
|
||||||
# Ignore VSCode directory
|
|
||||||
.vscode/
|
|
||||||
|
|
||||||
# Ignore JetBrains IDEs directory
|
|
||||||
.idea/
|
|
||||||
|
|
||||||
# Ignore MacOS directory
|
|
||||||
.DS_Store
|
|
||||||
|
|
||||||
# Ignore other unwanted file
|
|
||||||
*.log
|
|
||||||
*.bak
|
|
||||||
*.swp
|
|
||||||
*.swo
|
|
||||||
|
|
||||||
*.png
|
|
||||||
|
|
||||||
|
|
||||||
|
|
@ -1,86 +0,0 @@
|
||||||
# Sentinel-1 SAR Download for Aura Fields
|
|
||||||
|
|
||||||
This folder contains scripts to download and preprocess Sentinel-1 SAR data for crop monitoring.
|
|
||||||
|
|
||||||
## Quick Start
|
|
||||||
|
|
||||||
### 1. Setup Environment
|
|
||||||
```powershell
|
|
||||||
# Navigate to the python_scripts directory
|
|
||||||
cd "c:\Users\timon\Resilience BV\4020 SCane ESA DEMO - Documenten\General\4020 SCDEMO Team\4020 TechnicalData\WP3\smartcane\python_scripts"
|
|
||||||
|
|
||||||
# Run setup script
|
|
||||||
python setup_sar_environment.py
|
|
||||||
```
|
|
||||||
|
|
||||||
### 2. Get SentinelHub Credentials
|
|
||||||
- Go to https://apps.sentinel-hub.com/
|
|
||||||
- Create account (free tier available)
|
|
||||||
- Get your Client ID and Client Secret
|
|
||||||
- The script will prompt for these when you first run it
|
|
||||||
|
|
||||||
### 3. Prepare Field Boundaries
|
|
||||||
- Make sure you have your field boundaries in GeoJSON format
|
|
||||||
- The script will look for files like:
|
|
||||||
- `pivot.geojson` (current directory)
|
|
||||||
- `pivot_20210625.geojson` (current directory)
|
|
||||||
- `data/aura/field_boundaries/aura_fields.geojson`
|
|
||||||
|
|
||||||
### 4. Download SAR Data
|
|
||||||
```powershell
|
|
||||||
python download_s1_aura.py
|
|
||||||
```
|
|
||||||
|
|
||||||
## What the Script Does
|
|
||||||
|
|
||||||
1. **Downloads last 8 weeks** of Sentinel-1 data
|
|
||||||
2. **Downloads both VV and VH polarizations**
|
|
||||||
3. **Provides both linear and dB scales** for analysis
|
|
||||||
4. **Applies basic speckle filtering**
|
|
||||||
5. **Organizes by week** (week_XX_YYYY_BAND.tif format)
|
|
||||||
|
|
||||||
## Output Structure
|
|
||||||
```
|
|
||||||
data/aura/weekly_SAR_mosaic/
|
|
||||||
├── week_24_2025_VV.tif
|
|
||||||
├── week_24_2025_VH.tif
|
|
||||||
├── week_24_2025_VV_dB.tif
|
|
||||||
├── week_24_2025_VH_dB.tif
|
|
||||||
├── week_24_2025_VV_dB_filtered.tif
|
|
||||||
├── week_24_2025_VH_dB_filtered.tif
|
|
||||||
└── ... (for each week)
|
|
||||||
```
|
|
||||||
|
|
||||||
## Files Created
|
|
||||||
|
|
||||||
- **`download_s1_aura.py`** - Main download script
|
|
||||||
- **`requirements_sar.txt`** - Python dependencies
|
|
||||||
- **`setup_sar_environment.py`** - Environment setup helper
|
|
||||||
- **`sar_download.log`** - Download log file
|
|
||||||
|
|
||||||
## Troubleshooting
|
|
||||||
|
|
||||||
### Common Issues:
|
|
||||||
1. **Import errors**: Run `python setup_sar_environment.py` first
|
|
||||||
2. **Credential errors**: Make sure SentinelHub credentials are correct
|
|
||||||
3. **No data found**: Check if field boundaries are loaded correctly
|
|
||||||
4. **Large downloads**: SAR data can be large, ensure good internet connection
|
|
||||||
|
|
||||||
### Check Log File:
|
|
||||||
```powershell
|
|
||||||
Get-Content sar_download.log -Tail 20
|
|
||||||
```
|
|
||||||
|
|
||||||
## Next Steps
|
|
||||||
|
|
||||||
After successful download:
|
|
||||||
1. Check the output files in `data/aura/weekly_SAR_mosaic/`
|
|
||||||
2. Move to R for analysis and visualization
|
|
||||||
3. Create SAR analysis scripts in R
|
|
||||||
|
|
||||||
## Notes
|
|
||||||
|
|
||||||
- **Free Tier Limits**: SentinelHub free tier has processing unit limits
|
|
||||||
- **Data Size**: Each weekly mosaic can be 50-200MB depending on area
|
|
||||||
- **Processing Time**: Downloads can take 5-15 minutes per week
|
|
||||||
- **Format**: All outputs are GeoTIFF with proper CRS information
|
|
||||||
|
|
@ -1,58 +0,0 @@
|
||||||
from osgeo import gdal
|
|
||||||
import numpy as np
|
|
||||||
from pathlib import Path
|
|
||||||
|
|
||||||
print("="*70)
|
|
||||||
print("CHECKING INDIVIDUAL TILES")
|
|
||||||
print("="*70)
|
|
||||||
|
|
||||||
# Check individual tiles
|
|
||||||
base = Path(r"C:\Users\timon\Resilience BV\4020 SCane ESA DEMO - Documenten\General\4020 SCDEMO Team\4020 TechnicalData\WP3\smartcane_v2\smartcane\laravel_app\storage\app\aura\cloud_test_single_images\2025-10-17")
|
|
||||||
tiles = [x for x in base.iterdir() if x.is_dir()]
|
|
||||||
print(f"\nTotal tiles: {len(tiles)}")
|
|
||||||
|
|
||||||
good_tiles = 0
|
|
||||||
empty_tiles = 0
|
|
||||||
|
|
||||||
for t in tiles:
|
|
||||||
tif = t / 'response.tiff'
|
|
||||||
if tif.exists():
|
|
||||||
ds = gdal.Open(str(tif))
|
|
||||||
r = ds.GetRasterBand(1).ReadAsArray()
|
|
||||||
pct = (r > 0).sum() / r.size * 100
|
|
||||||
mean_val = r[r > 0].mean() if (r > 0).sum() > 0 else 0
|
|
||||||
|
|
||||||
if pct > 10:
|
|
||||||
good_tiles += 1
|
|
||||||
print(f" ✓ Tile {t.name[:8]}... : {pct:5.1f}% non-zero, mean={mean_val:.3f}")
|
|
||||||
elif pct > 0:
|
|
||||||
print(f" ~ Tile {t.name[:8]}... : {pct:5.1f}% non-zero (sparse)")
|
|
||||||
else:
|
|
||||||
empty_tiles += 1
|
|
||||||
|
|
||||||
print(f"\nSummary: {good_tiles} good tiles, {empty_tiles} completely empty tiles")
|
|
||||||
|
|
||||||
print("\n" + "="*70)
|
|
||||||
print("CHECKING MERGED TIF")
|
|
||||||
print("="*70)
|
|
||||||
|
|
||||||
tif_path = r"C:\Users\timon\Resilience BV\4020 SCane ESA DEMO - Documenten\General\4020 SCDEMO Team\4020 TechnicalData\WP3\smartcane_v2\smartcane\laravel_app\storage\app\aura\cloud_test_merged_tif\2025-10-17.tif"
|
|
||||||
|
|
||||||
ds = gdal.Open(tif_path)
|
|
||||||
print(f"\nFile: 2025-10-17.tif")
|
|
||||||
print(f"Size: {ds.RasterXSize} x {ds.RasterYSize}")
|
|
||||||
print(f"Bands: {ds.RasterCount}")
|
|
||||||
|
|
||||||
red = ds.GetRasterBand(1).ReadAsArray()
|
|
||||||
print(f"\nRed band:")
|
|
||||||
print(f" Non-zero pixels: {(red > 0).sum() / red.size * 100:.2f}%")
|
|
||||||
print(f" Mean (all): {red.mean():.6f}")
|
|
||||||
print(f" Mean (non-zero): {red[red > 0].mean():.4f}")
|
|
||||||
print(f" Max: {red.max():.4f}")
|
|
||||||
|
|
||||||
print("\n" + "="*70)
|
|
||||||
print("DIAGNOSIS")
|
|
||||||
print("="*70)
|
|
||||||
print("\nThe problem: Most tiles are EMPTY (outside Planet imagery footprint)")
|
|
||||||
print("When merged, empty tiles dominate, making the image appear almost black.")
|
|
||||||
print("\nSolution: Use tighter bounding boxes or single bbox for the actual fields.")
|
|
||||||
|
|
@ -1,725 +0,0 @@
|
||||||
{
|
|
||||||
"cells": [
|
|
||||||
{
|
|
||||||
"cell_type": "markdown",
|
|
||||||
"id": "5ea10771",
|
|
||||||
"metadata": {},
|
|
||||||
"source": [
|
|
||||||
"# Cloud Detection - Step 1: Identify Cloudy Images\n",
|
|
||||||
"\n",
|
|
||||||
"This notebook downloads Planet imagery for the **Aura** project (last 3 weeks) and helps identify which images contain clouds.\n",
|
|
||||||
"\n",
|
|
||||||
"**Workflow:**\n",
|
|
||||||
"1. Connect to SentinelHub\n",
|
|
||||||
"2. Define Aura project area\n",
|
|
||||||
"3. Download images from last 3 weeks\n",
|
|
||||||
"4. Generate quick-look visualizations\n",
|
|
||||||
"5. Identify cloudy images for testing with OmniCloudMask"
|
|
||||||
]
|
|
||||||
},
|
|
||||||
{
|
|
||||||
"cell_type": "markdown",
|
|
||||||
"id": "4f43a8b9",
|
|
||||||
"metadata": {},
|
|
||||||
"source": [
|
|
||||||
"## 1. Setup and Imports"
|
|
||||||
]
|
|
||||||
},
|
|
||||||
{
|
|
||||||
"cell_type": "code",
|
|
||||||
"execution_count": null,
|
|
||||||
"id": "1b300ebc",
|
|
||||||
"metadata": {},
|
|
||||||
"outputs": [],
|
|
||||||
"source": [
|
|
||||||
"# Install required packages (uncomment if needed)\n",
|
|
||||||
"# !pip install sentinelhub\n",
|
|
||||||
"# !pip install geopandas matplotlib pillow\n",
|
|
||||||
"\n",
|
|
||||||
"import os\n",
|
|
||||||
"import json\n",
|
|
||||||
"import datetime\n",
|
|
||||||
"import numpy as np\n",
|
|
||||||
"import matplotlib.pyplot as plt\n",
|
|
||||||
"from pathlib import Path\n",
|
|
||||||
"from osgeo import gdal\n",
|
|
||||||
"\n",
|
|
||||||
"from sentinelhub import (\n",
|
|
||||||
" MimeType, CRS, BBox, SentinelHubRequest, SentinelHubDownloadClient,\n",
|
|
||||||
" DataCollection, bbox_to_dimensions, SHConfig, BBoxSplitter, Geometry, SentinelHubCatalog\n",
|
|
||||||
")\n",
|
|
||||||
"\n",
|
|
||||||
"import time\n",
|
|
||||||
"import shutil\n",
|
|
||||||
"import geopandas as gpd\n",
|
|
||||||
"from shapely.geometry import MultiLineString, MultiPolygon, Polygon\n",
|
|
||||||
"from PIL import Image"
|
|
||||||
]
|
|
||||||
},
|
|
||||||
{
|
|
||||||
"cell_type": "markdown",
|
|
||||||
"id": "6b0d9534",
|
|
||||||
"metadata": {},
|
|
||||||
"source": [
|
|
||||||
"## 2. Configure SentinelHub"
|
|
||||||
]
|
|
||||||
},
|
|
||||||
{
|
|
||||||
"cell_type": "code",
|
|
||||||
"execution_count": null,
|
|
||||||
"id": "72a2d6ca",
|
|
||||||
"metadata": {},
|
|
||||||
"outputs": [],
|
|
||||||
"source": [
|
|
||||||
"config = SHConfig()\n",
|
|
||||||
"config.sh_client_id = '1a72d811-4f0e-4447-8282-df09608cff44'\n",
|
|
||||||
"config.sh_client_secret = 'FcBlRL29i9ZmTzhmKTv1etSMFs5PxSos'\n",
|
|
||||||
"\n",
|
|
||||||
"catalog = SentinelHubCatalog(config=config)\n",
|
|
||||||
"\n",
|
|
||||||
"# Define BYOC collection\n",
|
|
||||||
"collection_id = 'c691479f-358c-46b1-b0f0-e12b70a9856c'\n",
|
|
||||||
"byoc = DataCollection.define_byoc(\n",
|
|
||||||
" collection_id,\n",
|
|
||||||
" name='planet_data2',\n",
|
|
||||||
" is_timeless=True\n",
|
|
||||||
")\n",
|
|
||||||
"\n",
|
|
||||||
"print(\"✓ SentinelHub configured\")"
|
|
||||||
]
|
|
||||||
},
|
|
||||||
{
|
|
||||||
"cell_type": "markdown",
|
|
||||||
"id": "b43e776d",
|
|
||||||
"metadata": {},
|
|
||||||
"source": [
|
|
||||||
"## 3. Define Project and Paths"
|
|
||||||
]
|
|
||||||
},
|
|
||||||
{
|
|
||||||
"cell_type": "code",
|
|
||||||
"execution_count": null,
|
|
||||||
"id": "595021b5",
|
|
||||||
"metadata": {},
|
|
||||||
"outputs": [],
|
|
||||||
"source": [
|
|
||||||
"project = 'aura'\n",
|
|
||||||
"resolution = 3 # 3m resolution for Planet\n",
|
|
||||||
"\n",
|
|
||||||
"# Define paths\n",
|
|
||||||
"BASE_PATH = Path('../laravel_app/storage/app') / project\n",
|
|
||||||
"BASE_PATH_SINGLE_IMAGES = BASE_PATH / 'cloud_test_single_images'\n",
|
|
||||||
"folder_for_merged_tifs = BASE_PATH / 'cloud_test_merged_tif'\n",
|
|
||||||
"folder_for_virtual_raster = BASE_PATH / 'cloud_test_merged_virtual'\n",
|
|
||||||
"geojson_file = BASE_PATH / 'Data' / 'pivot.geojson'\n",
|
|
||||||
"\n",
|
|
||||||
"# Create folders if they don't exist\n",
|
|
||||||
"for folder in [BASE_PATH_SINGLE_IMAGES, folder_for_merged_tifs, folder_for_virtual_raster]:\n",
|
|
||||||
" folder.mkdir(parents=True, exist_ok=True)\n",
|
|
||||||
"\n",
|
|
||||||
"print(f\"Project: {project}\")\n",
|
|
||||||
"print(f\"Base path: {BASE_PATH}\")\n",
|
|
||||||
"print(f\"GeoJSON: {geojson_file}\")\n",
|
|
||||||
"print(f\"✓ Folders created/verified\")"
|
|
||||||
]
|
|
||||||
},
|
|
||||||
{
|
|
||||||
"cell_type": "markdown",
|
|
||||||
"id": "ca46160a",
|
|
||||||
"metadata": {},
|
|
||||||
"source": [
|
|
||||||
"## 4. Define Time Period (Last 3 Weeks)"
|
|
||||||
]
|
|
||||||
},
|
|
||||||
{
|
|
||||||
"cell_type": "code",
|
|
||||||
"execution_count": null,
|
|
||||||
"id": "1e6d4013",
|
|
||||||
"metadata": {},
|
|
||||||
"outputs": [],
|
|
||||||
"source": [
|
|
||||||
"# Calculate last 3 weeks (21 days)\n",
|
|
||||||
"end_date = datetime.date.today()\n",
|
|
||||||
"start_date = end_date - datetime.timedelta(days=21)\n",
|
|
||||||
"\n",
|
|
||||||
"# Generate daily slots\n",
|
|
||||||
"days_needed = 21\n",
|
|
||||||
"slots = [(start_date + datetime.timedelta(days=i)).strftime('%Y-%m-%d') for i in range(days_needed)]\n",
|
|
||||||
"\n",
|
|
||||||
"print(f\"Date range: {start_date} to {end_date}\")\n",
|
|
||||||
"print(f\"Total days: {len(slots)}\")\n",
|
|
||||||
"print(f\"\\nFirst 5 dates: {slots[:5]}\")\n",
|
|
||||||
"print(f\"Last 5 dates: {slots[-5:]}\")"
|
|
||||||
]
|
|
||||||
},
|
|
||||||
{
|
|
||||||
"cell_type": "markdown",
|
|
||||||
"id": "df16c395",
|
|
||||||
"metadata": {},
|
|
||||||
"source": [
|
|
||||||
"## 5. Load Field Boundaries and Create BBox Grid"
|
|
||||||
]
|
|
||||||
},
|
|
||||||
{
|
|
||||||
"cell_type": "code",
|
|
||||||
"execution_count": null,
|
|
||||||
"id": "cf88f697",
|
|
||||||
"metadata": {},
|
|
||||||
"outputs": [],
|
|
||||||
"source": [
|
|
||||||
"# Load GeoJSON\n",
|
|
||||||
"geo_json = gpd.read_file(str(geojson_file))\n",
|
|
||||||
"print(f\"Loaded {len(geo_json)} field polygons\")\n",
|
|
||||||
"\n",
|
|
||||||
"# Create geometries\n",
|
|
||||||
"geometries = [Geometry(geometry, crs=CRS.WGS84) for geometry in geo_json.geometry]\n",
|
|
||||||
"shapely_geometries = [geometry.geometry for geometry in geometries]\n",
|
|
||||||
"\n",
|
|
||||||
"# Get total bounds\n",
|
|
||||||
"from shapely.geometry import box\n",
|
|
||||||
"total_bounds = geo_json.total_bounds # [minx, miny, maxx, maxy]\n",
|
|
||||||
"print(f\"\\nTotal bounds: {total_bounds}\")\n",
|
|
||||||
"\n",
|
|
||||||
"# Calculate approximate image size for single bbox\n",
|
|
||||||
"single_bbox_test = BBox(bbox=tuple(total_bounds), crs=CRS.WGS84)\n",
|
|
||||||
"single_size = bbox_to_dimensions(single_bbox_test, resolution=resolution)\n",
|
|
||||||
"print(f\"Single bbox would create image of: {single_size[0]} x {single_size[1]} pixels\")\n",
|
|
||||||
"\n",
|
|
||||||
"# SentinelHub limit is 2500x2500 pixels\n",
|
|
||||||
"if single_size[0] > 2500 or single_size[1] > 2500:\n",
|
|
||||||
" print(f\"⚠️ Image too large for single download (max 2500x2500)\")\n",
|
|
||||||
" print(f\" Using 2x2 grid to split into smaller tiles...\")\n",
|
|
||||||
" \n",
|
|
||||||
" # Use BBoxSplitter with 2x2 grid\n",
|
|
||||||
" bbox_splitter = BBoxSplitter(\n",
|
|
||||||
" shapely_geometries, CRS.WGS84, (2, 2), reduce_bbox_sizes=True\n",
|
|
||||||
" )\n",
|
|
||||||
" bbox_list = bbox_splitter.get_bbox_list()\n",
|
|
||||||
" print(f\" Split into {len(bbox_list)} tiles\")\n",
|
|
||||||
"else:\n",
|
|
||||||
" print(f\"✓ Single bbox works - using 1 tile per date\")\n",
|
|
||||||
" bbox_list = [single_bbox_test]\n",
|
|
||||||
"\n",
|
|
||||||
"# Verify tile sizes\n",
|
|
||||||
"print(f\"\\nVerifying tile sizes:\")\n",
|
|
||||||
"for i, bbox in enumerate(bbox_list, 1):\n",
|
|
||||||
" size = bbox_to_dimensions(bbox, resolution=resolution)\n",
|
|
||||||
" status = \"✓\" if size[0] <= 2500 and size[1] <= 2500 else \"✗\"\n",
|
|
||||||
" print(f\" Tile {i}: {size[0]} x {size[1]} pixels {status}\")\n"
|
|
||||||
]
|
|
||||||
},
|
|
||||||
{
|
|
||||||
"cell_type": "markdown",
|
|
||||||
"id": "f78964df",
|
|
||||||
"metadata": {},
|
|
||||||
"source": [
|
|
||||||
"## 6. Check Image Availability"
|
|
||||||
]
|
|
||||||
},
|
|
||||||
{
|
|
||||||
"cell_type": "markdown",
|
|
||||||
"id": "09c2fcc6",
|
|
||||||
"metadata": {},
|
|
||||||
"source": [
|
|
||||||
"## 5.5. Visualize Download Grid (Optional)"
|
|
||||||
]
|
|
||||||
},
|
|
||||||
{
|
|
||||||
"cell_type": "code",
|
|
||||||
"execution_count": null,
|
|
||||||
"id": "1e1a7660",
|
|
||||||
"metadata": {},
|
|
||||||
"outputs": [],
|
|
||||||
"source": [
|
|
||||||
"# Visualize the download grid to ensure good coverage\n",
|
|
||||||
"fig, ax = plt.subplots(1, 1, figsize=(12, 12))\n",
|
|
||||||
"\n",
|
|
||||||
"# Plot field boundaries\n",
|
|
||||||
"geo_json.boundary.plot(ax=ax, color='green', linewidth=2, label='Fields')\n",
|
|
||||||
"\n",
|
|
||||||
"# Plot bboxes\n",
|
|
||||||
"for i, bbox in enumerate(bbox_list):\n",
|
|
||||||
" bbox_geom = box(bbox[0], bbox[1], bbox[2], bbox[3])\n",
|
|
||||||
" x, y = bbox_geom.exterior.xy\n",
|
|
||||||
" ax.plot(x, y, 'r--', linewidth=1, alpha=0.7)\n",
|
|
||||||
" # Add bbox number\n",
|
|
||||||
" centroid = bbox_geom.centroid\n",
|
|
||||||
" ax.text(centroid.x, centroid.y, str(i+1), fontsize=10, ha='center', \n",
|
|
||||||
" bbox=dict(boxstyle='round', facecolor='yellow', alpha=0.5))\n",
|
|
||||||
"\n",
|
|
||||||
"ax.set_xlabel('Longitude')\n",
|
|
||||||
"ax.set_ylabel('Latitude')\n",
|
|
||||||
"ax.set_title('Download Grid (Red) vs Field Boundaries (Green)', fontsize=14, fontweight='bold')\n",
|
|
||||||
"ax.legend()\n",
|
|
||||||
"ax.grid(True, alpha=0.3)\n",
|
|
||||||
"plt.tight_layout()\n",
|
|
||||||
"plt.show()\n",
|
|
||||||
"\n",
|
|
||||||
"print(f\"✓ Visualization complete - verify that red boxes cover green field boundaries\")\n"
|
|
||||||
]
|
|
||||||
},
|
|
||||||
{
|
|
||||||
"cell_type": "code",
|
|
||||||
"execution_count": null,
|
|
||||||
"id": "2fcded08",
|
|
||||||
"metadata": {},
|
|
||||||
"outputs": [],
|
|
||||||
"source": [
|
|
||||||
"def is_image_available(date):\n",
|
|
||||||
" \"\"\"Check if Planet images are available for a given date.\"\"\"\n",
|
|
||||||
" for bbox in bbox_list:\n",
|
|
||||||
" search_iterator = catalog.search(\n",
|
|
||||||
" collection=byoc,\n",
|
|
||||||
" bbox=bbox,\n",
|
|
||||||
" time=(date, date)\n",
|
|
||||||
" )\n",
|
|
||||||
" if len(list(search_iterator)) > 0:\n",
|
|
||||||
" return True\n",
|
|
||||||
" return False\n",
|
|
||||||
"\n",
|
|
||||||
"# Filter to available dates only\n",
|
|
||||||
"print(\"Checking image availability...\")\n",
|
|
||||||
"available_slots = [slot for slot in slots if is_image_available(slot)]\n",
|
|
||||||
"\n",
|
|
||||||
"print(f\"\\n{'='*60}\")\n",
|
|
||||||
"print(f\"Total requested dates: {len(slots)}\")\n",
|
|
||||||
"print(f\"Available dates: {len(available_slots)}\")\n",
|
|
||||||
"print(f\"Excluded (no data): {len(slots) - len(available_slots)}\")\n",
|
|
||||||
"print(f\"{'='*60}\")\n",
|
|
||||||
"print(f\"\\nAvailable dates:\")\n",
|
|
||||||
"for slot in available_slots:\n",
|
|
||||||
" print(f\" - {slot}\")"
|
|
||||||
]
|
|
||||||
},
|
|
||||||
{
|
|
||||||
"cell_type": "markdown",
|
|
||||||
"id": "b67f5deb",
|
|
||||||
"metadata": {},
|
|
||||||
"source": [
|
|
||||||
"## 7. Define Download Functions"
|
|
||||||
]
|
|
||||||
},
|
|
||||||
{
|
|
||||||
"cell_type": "code",
|
|
||||||
"execution_count": null,
|
|
||||||
"id": "26cd367f",
|
|
||||||
"metadata": {},
|
|
||||||
"outputs": [],
|
|
||||||
"source": [
|
|
||||||
"# Evalscript to get RGB + NIR + UDM1 mask\n",
|
|
||||||
"# NOTE: Not specifying sampleType makes SentinelHub auto-convert 0-1 float to 0-255 byte\n",
|
|
||||||
"# This matches the production script behavior\n",
|
|
||||||
"evalscript_with_udm = \"\"\"\n",
|
|
||||||
" //VERSION=3\n",
|
|
||||||
"\n",
|
|
||||||
" function setup() {\n",
|
|
||||||
" return {\n",
|
|
||||||
" input: [{\n",
|
|
||||||
" bands: [\"red\", \"green\", \"blue\", \"nir\", \"udm1\"]\n",
|
|
||||||
" }],\n",
|
|
||||||
" output: {\n",
|
|
||||||
" bands: 5\n",
|
|
||||||
" // sampleType: \"FLOAT32\" - commented out to get 0-255 byte output like production\n",
|
|
||||||
" }\n",
|
|
||||||
" };\n",
|
|
||||||
" }\n",
|
|
||||||
"\n",
|
|
||||||
" function evaluatePixel(sample) {\n",
|
|
||||||
" // Return all bands including udm1 (last band)\n",
|
|
||||||
" return [\n",
|
|
||||||
" 2.5 * sample.red / 10000,\n",
|
|
||||||
" 2.5 * sample.green / 10000,\n",
|
|
||||||
" 2.5 * sample.blue / 10000,\n",
|
|
||||||
" 2.5 * sample.nir / 10000,\n",
|
|
||||||
" sample.udm1 // 0 = usable, 1 = unusable (clouds, shadows, etc.)\n",
|
|
||||||
" ];\n",
|
|
||||||
" }\n",
|
|
||||||
"\"\"\"\n",
|
|
||||||
"\n",
|
|
||||||
"def get_download_request(time_interval, bbox, size):\n",
|
|
||||||
" \"\"\"Create a SentinelHub request for a given date and bbox.\"\"\"\n",
|
|
||||||
" return SentinelHubRequest(\n",
|
|
||||||
" evalscript=evalscript_with_udm,\n",
|
|
||||||
" input_data=[\n",
|
|
||||||
" SentinelHubRequest.input_data(\n",
|
|
||||||
" data_collection=DataCollection.planet_data2,\n",
|
|
||||||
" time_interval=(time_interval, time_interval)\n",
|
|
||||||
" )\n",
|
|
||||||
" ],\n",
|
|
||||||
" responses=[\n",
|
|
||||||
" SentinelHubRequest.output_response('default', MimeType.TIFF)\n",
|
|
||||||
" ],\n",
|
|
||||||
" bbox=bbox,\n",
|
|
||||||
" size=size,\n",
|
|
||||||
" config=config,\n",
|
|
||||||
" data_folder=str(BASE_PATH_SINGLE_IMAGES / time_interval),\n",
|
|
||||||
" )\n",
|
|
||||||
"\n",
|
|
||||||
"def download_for_date_and_bbox(slot, bbox, size):\n",
|
|
||||||
" \"\"\"Download image for a specific date and bounding box.\"\"\"\n",
|
|
||||||
" list_of_requests = [get_download_request(slot, bbox, size)]\n",
|
|
||||||
" list_of_requests = [request.download_list[0] for request in list_of_requests]\n",
|
|
||||||
" \n",
|
|
||||||
" data = SentinelHubDownloadClient(config=config).download(list_of_requests, max_threads=5)\n",
|
|
||||||
" time.sleep(0.1)\n",
|
|
||||||
" return data\n",
|
|
||||||
"\n",
|
|
||||||
"def merge_tiles_for_date(slot):\n",
|
|
||||||
" \"\"\"Merge all tiles for a given date into one GeoTIFF.\"\"\"\n",
|
|
||||||
" # List downloaded tiles\n",
|
|
||||||
" file_list = [str(x / \"response.tiff\") for x in Path(BASE_PATH_SINGLE_IMAGES / slot).iterdir() if x.is_dir()]\n",
|
|
||||||
" \n",
|
|
||||||
" if not file_list:\n",
|
|
||||||
" print(f\" No tiles found for {slot}\")\n",
|
|
||||||
" return None\n",
|
|
||||||
" \n",
|
|
||||||
" vrt_path = str(folder_for_virtual_raster / f\"merged_{slot}.vrt\")\n",
|
|
||||||
" output_path = str(folder_for_merged_tifs / f\"{slot}.tif\")\n",
|
|
||||||
" \n",
|
|
||||||
" # Create virtual raster with proper options\n",
|
|
||||||
" vrt_options = gdal.BuildVRTOptions(\n",
|
|
||||||
" resolution='highest',\n",
|
|
||||||
" separate=False,\n",
|
|
||||||
" addAlpha=False\n",
|
|
||||||
" )\n",
|
|
||||||
" vrt = gdal.BuildVRT(vrt_path, file_list, options=vrt_options)\n",
|
|
||||||
" vrt = None # Close\n",
|
|
||||||
" \n",
|
|
||||||
" # Convert to GeoTIFF with proper options\n",
|
|
||||||
" # Use COMPRESS=LZW to save space, TILED for better performance\n",
|
|
||||||
" translate_options = gdal.TranslateOptions(\n",
|
|
||||||
" creationOptions=['COMPRESS=LZW', 'TILED=YES', 'BIGTIFF=IF_SAFER']\n",
|
|
||||||
" )\n",
|
|
||||||
" gdal.Translate(output_path, vrt_path, options=translate_options)\n",
|
|
||||||
" \n",
|
|
||||||
" return output_path\n",
|
|
||||||
"\n",
|
|
||||||
"print(\"✓ Download functions defined\")"
|
|
||||||
]
|
|
||||||
},
|
|
||||||
{
|
|
||||||
"cell_type": "markdown",
|
|
||||||
"id": "e9f17ba8",
|
|
||||||
"metadata": {},
|
|
||||||
"source": [
|
|
||||||
"## 8. Download Images"
|
|
||||||
]
|
|
||||||
},
|
|
||||||
{
|
|
||||||
"cell_type": "code",
|
|
||||||
"execution_count": null,
|
|
||||||
"id": "e66173ea",
|
|
||||||
"metadata": {},
|
|
||||||
"outputs": [],
|
|
||||||
"source": [
|
|
||||||
"print(f\"Starting download for {len(available_slots)} dates...\\n\")\n",
|
|
||||||
"\n",
|
|
||||||
"for i, slot in enumerate(available_slots, 1):\n",
|
|
||||||
" print(f\"[{i}/{len(available_slots)}] Downloading {slot}...\")\n",
|
|
||||||
" \n",
|
|
||||||
" for j, bbox in enumerate(bbox_list, 1):\n",
|
|
||||||
" bbox_obj = BBox(bbox=bbox, crs=CRS.WGS84)\n",
|
|
||||||
" size = bbox_to_dimensions(bbox_obj, resolution=resolution)\n",
|
|
||||||
" \n",
|
|
||||||
" try:\n",
|
|
||||||
" download_for_date_and_bbox(slot, bbox_obj, size)\n",
|
|
||||||
" print(f\" ✓ Tile {j}/{len(bbox_list)} downloaded\")\n",
|
|
||||||
" except Exception as e:\n",
|
|
||||||
" print(f\" ✗ Tile {j}/{len(bbox_list)} failed: {e}\")\n",
|
|
||||||
" \n",
|
|
||||||
" time.sleep(0.2)\n",
|
|
||||||
" \n",
|
|
||||||
" print()\n",
|
|
||||||
"\n",
|
|
||||||
"print(\"\\n✓ All downloads complete!\")"
|
|
||||||
]
|
|
||||||
},
|
|
||||||
{
|
|
||||||
"cell_type": "markdown",
|
|
||||||
"id": "e4bec74c",
|
|
||||||
"metadata": {},
|
|
||||||
"source": [
|
|
||||||
"## 9. Merge Tiles into Single Images"
|
|
||||||
]
|
|
||||||
},
|
|
||||||
{
|
|
||||||
"cell_type": "code",
|
|
||||||
"execution_count": null,
|
|
||||||
"id": "e9b270be",
|
|
||||||
"metadata": {},
|
|
||||||
"outputs": [],
|
|
||||||
"source": [
|
|
||||||
"print(\"Merging tiles for each date...\\n\")\n",
|
|
||||||
"\n",
|
|
||||||
"merged_files = {}\n",
|
|
||||||
"for slot in available_slots:\n",
|
|
||||||
" print(f\"Merging {slot}...\")\n",
|
|
||||||
" output_path = merge_tiles_for_date(slot)\n",
|
|
||||||
" if output_path:\n",
|
|
||||||
" merged_files[slot] = output_path\n",
|
|
||||||
" print(f\" ✓ Saved to: {output_path}\")\n",
|
|
||||||
" else:\n",
|
|
||||||
" print(f\" ✗ Failed to merge\")\n",
|
|
||||||
"\n",
|
|
||||||
"print(f\"\\n✓ Successfully merged {len(merged_files)} images\")"
|
|
||||||
]
|
|
||||||
},
|
|
||||||
{
|
|
||||||
"cell_type": "markdown",
|
|
||||||
"id": "ec3f1a6d",
|
|
||||||
"metadata": {},
|
|
||||||
"source": [
|
|
||||||
"## 10. Analyze Cloud Coverage Using UDM1"
|
|
||||||
]
|
|
||||||
},
|
|
||||||
{
|
|
||||||
"cell_type": "code",
|
|
||||||
"execution_count": null,
|
|
||||||
"id": "9f4047e5",
|
|
||||||
"metadata": {},
|
|
||||||
"outputs": [],
|
|
||||||
"source": [
|
|
||||||
"def analyze_cloud_coverage(tif_path):\n",
|
|
||||||
" \"\"\"Calculate cloud coverage percentage using UDM1 band (band 5).\"\"\"\n",
|
|
||||||
" ds = gdal.Open(tif_path)\n",
|
|
||||||
" if ds is None:\n",
|
|
||||||
" return None, None\n",
|
|
||||||
" \n",
|
|
||||||
" # Band 5 is UDM1 (0 = clear, 1 = cloudy/unusable)\n",
|
|
||||||
" udm_band = ds.GetRasterBand(5).ReadAsArray()\n",
|
|
||||||
" \n",
|
|
||||||
" total_pixels = udm_band.size\n",
|
|
||||||
" cloudy_pixels = np.sum(udm_band == 1)\n",
|
|
||||||
" cloud_percentage = (cloudy_pixels / total_pixels) * 100\n",
|
|
||||||
" \n",
|
|
||||||
" ds = None\n",
|
|
||||||
" return cloud_percentage, udm_band\n",
|
|
||||||
"\n",
|
|
||||||
"# Analyze all images\n",
|
|
||||||
"cloud_stats = {}\n",
|
|
||||||
"print(\"Analyzing cloud coverage...\\n\")\n",
|
|
||||||
"print(f\"{'Date':<12} {'Cloud %':<10} {'Status'}\")\n",
|
|
||||||
"print(\"-\" * 40)\n",
|
|
||||||
"\n",
|
|
||||||
"for date, path in sorted(merged_files.items()):\n",
|
|
||||||
" cloud_pct, _ = analyze_cloud_coverage(path)\n",
|
|
||||||
" if cloud_pct is not None:\n",
|
|
||||||
" cloud_stats[date] = cloud_pct\n",
|
|
||||||
" \n",
|
|
||||||
" # Categorize\n",
|
|
||||||
" if cloud_pct < 5:\n",
|
|
||||||
" status = \"☀️ Clear\"\n",
|
|
||||||
" elif cloud_pct < 20:\n",
|
|
||||||
" status = \"🌤️ Mostly clear\"\n",
|
|
||||||
" elif cloud_pct < 50:\n",
|
|
||||||
" status = \"⛅ Partly cloudy\"\n",
|
|
||||||
" else:\n",
|
|
||||||
" status = \"☁️ Very cloudy\"\n",
|
|
||||||
" \n",
|
|
||||||
" print(f\"{date:<12} {cloud_pct:>6.2f}% {status}\")\n",
|
|
||||||
"\n",
|
|
||||||
"print(f\"\\n✓ Analysis complete for {len(cloud_stats)} images\")"
|
|
||||||
]
|
|
||||||
},
|
|
||||||
{
|
|
||||||
"cell_type": "markdown",
|
|
||||||
"id": "3d966858",
|
|
||||||
"metadata": {},
|
|
||||||
"source": [
|
|
||||||
"## 11. Visualize Images with Cloud Coverage"
|
|
||||||
]
|
|
||||||
},
|
|
||||||
{
|
|
||||||
"cell_type": "code",
|
|
||||||
"execution_count": null,
|
|
||||||
"id": "f8b2b2fc",
|
|
||||||
"metadata": {},
|
|
||||||
"outputs": [],
|
|
||||||
"source": [
|
|
||||||
"def create_quicklook(tif_path, date, cloud_pct):\n",
|
|
||||||
" \"\"\"Create RGB quicklook with UDM1 overlay.\"\"\"\n",
|
|
||||||
" ds = gdal.Open(tif_path)\n",
|
|
||||||
" if ds is None:\n",
|
|
||||||
" return None\n",
|
|
||||||
" \n",
|
|
||||||
" # Read RGB bands (1=R, 2=G, 3=B)\n",
|
|
||||||
" red = ds.GetRasterBand(1).ReadAsArray()\n",
|
|
||||||
" green = ds.GetRasterBand(2).ReadAsArray()\n",
|
|
||||||
" blue = ds.GetRasterBand(3).ReadAsArray()\n",
|
|
||||||
" udm = ds.GetRasterBand(5).ReadAsArray()\n",
|
|
||||||
" \n",
|
|
||||||
" # Clip to 0-1 range\n",
|
|
||||||
" rgb = np.dstack([np.clip(red, 0, 1), np.clip(green, 0, 1), np.clip(blue, 0, 1)])\n",
|
|
||||||
" \n",
|
|
||||||
" # Create figure\n",
|
|
||||||
" fig, axes = plt.subplots(1, 2, figsize=(14, 6))\n",
|
|
||||||
" \n",
|
|
||||||
" # RGB image\n",
|
|
||||||
" axes[0].imshow(rgb)\n",
|
|
||||||
" axes[0].set_title(f\"RGB - {date}\", fontsize=14, fontweight='bold')\n",
|
|
||||||
" axes[0].axis('off')\n",
|
|
||||||
" \n",
|
|
||||||
" # UDM1 mask (clouds in red)\n",
|
|
||||||
" cloud_overlay = rgb.copy()\n",
|
|
||||||
" cloud_overlay[udm == 1] = [1, 0, 0] # Red for clouds\n",
|
|
||||||
" axes[1].imshow(cloud_overlay)\n",
|
|
||||||
" axes[1].set_title(f\"Cloud Mask (UDM1) - {cloud_pct:.1f}% cloudy\", fontsize=14, fontweight='bold')\n",
|
|
||||||
" axes[1].axis('off')\n",
|
|
||||||
" \n",
|
|
||||||
" plt.tight_layout()\n",
|
|
||||||
" ds = None\n",
|
|
||||||
" return fig\n",
|
|
||||||
"\n",
|
|
||||||
"# Display images sorted by cloud coverage (most cloudy first)\n",
|
|
||||||
"sorted_by_clouds = sorted(cloud_stats.items(), key=lambda x: x[1], reverse=True)\n",
|
|
||||||
"\n",
|
|
||||||
"print(\"Generating visualizations...\\n\")\n",
|
|
||||||
"for date, cloud_pct in sorted_by_clouds[:5]: # Show top 5 cloudiest\n",
|
|
||||||
" if date in merged_files:\n",
|
|
||||||
" fig = create_quicklook(merged_files[date], date, cloud_pct)\n",
|
|
||||||
" if fig:\n",
|
|
||||||
" plt.show()\n",
|
|
||||||
" plt.close()\n",
|
|
||||||
"\n",
|
|
||||||
"print(\"✓ Visualizations complete\")"
|
|
||||||
]
|
|
||||||
},
|
|
||||||
{
|
|
||||||
"cell_type": "markdown",
|
|
||||||
"id": "94de1b4b",
|
|
||||||
"metadata": {},
|
|
||||||
"source": [
|
|
||||||
"## 12. Select Candidate Images for OmniCloudMask Testing"
|
|
||||||
]
|
|
||||||
},
|
|
||||||
{
|
|
||||||
"cell_type": "code",
|
|
||||||
"execution_count": null,
|
|
||||||
"id": "4ae8c727",
|
|
||||||
"metadata": {},
|
|
||||||
"outputs": [],
|
|
||||||
"source": [
|
|
||||||
"# Select images with moderate to high cloud coverage (20-70%)\n",
|
|
||||||
"# These are good candidates for testing cloud detection\n",
|
|
||||||
"test_candidates = [\n",
|
|
||||||
" (date, cloud_pct, merged_files[date]) \n",
|
|
||||||
" for date, cloud_pct in cloud_stats.items() \n",
|
|
||||||
" if 20 <= cloud_pct <= 70\n",
|
|
||||||
"]\n",
|
|
||||||
"\n",
|
|
||||||
"test_candidates.sort(key=lambda x: x[1], reverse=True)\n",
|
|
||||||
"\n",
|
|
||||||
"print(\"\\n\" + \"=\"*60)\n",
|
|
||||||
"print(\"RECOMMENDED IMAGES FOR OMNICLOUDMASK TESTING\")\n",
|
|
||||||
"print(\"=\"*60)\n",
|
|
||||||
"print(f\"\\n{'Rank':<6} {'Date':<12} {'Cloud %':<10} {'Path'}\")\n",
|
|
||||||
"print(\"-\" * 80)\n",
|
|
||||||
"\n",
|
|
||||||
"for i, (date, cloud_pct, path) in enumerate(test_candidates[:5], 1):\n",
|
|
||||||
" print(f\"{i:<6} {date:<12} {cloud_pct:>6.2f}% {path}\")\n",
|
|
||||||
"\n",
|
|
||||||
"if test_candidates:\n",
|
|
||||||
" print(f\"\\n✓ Top candidate: {test_candidates[0][0]} ({test_candidates[0][1]:.1f}% cloudy)\")\n",
|
|
||||||
" print(f\" Path: {test_candidates[0][2]}\")\n",
|
|
||||||
" print(\"\\n👉 Use this image in Step 2 (cloud_detection_step2_test_omnicloudmask.ipynb)\")\n",
|
|
||||||
"else:\n",
|
|
||||||
" print(\"\\n⚠️ No suitable cloudy images found in this period.\")\n",
|
|
||||||
" print(\" Try extending the date range or select any available image.\")"
|
|
||||||
]
|
|
||||||
},
|
|
||||||
{
|
|
||||||
"cell_type": "markdown",
|
|
||||||
"id": "ea103951",
|
|
||||||
"metadata": {},
|
|
||||||
"source": [
|
|
||||||
"## 13. Export Summary"
|
|
||||||
]
|
|
||||||
},
|
|
||||||
{
|
|
||||||
"cell_type": "code",
|
|
||||||
"execution_count": null,
|
|
||||||
"id": "b5c78310",
|
|
||||||
"metadata": {},
|
|
||||||
"outputs": [],
|
|
||||||
"source": [
|
|
||||||
"# Save summary to JSON for Step 2\n",
|
|
||||||
"summary = {\n",
|
|
||||||
" \"project\": project,\n",
|
|
||||||
" \"date_range\": f\"{start_date} to {end_date}\",\n",
|
|
||||||
" \"total_dates\": len(slots),\n",
|
|
||||||
" \"available_dates\": len(available_slots),\n",
|
|
||||||
" \"cloud_statistics\": cloud_stats,\n",
|
|
||||||
" \"test_candidates\": [\n",
|
|
||||||
" {\"date\": date, \"cloud_percentage\": cloud_pct, \"path\": path}\n",
|
|
||||||
" for date, cloud_pct, path in test_candidates[:5]\n",
|
|
||||||
" ],\n",
|
|
||||||
" \"merged_files\": merged_files\n",
|
|
||||||
"}\n",
|
|
||||||
"\n",
|
|
||||||
"summary_path = BASE_PATH / 'cloud_detection_summary.json'\n",
|
|
||||||
"with open(summary_path, 'w') as f:\n",
|
|
||||||
" json.dump(summary, f, indent=2)\n",
|
|
||||||
"\n",
|
|
||||||
"print(f\"✓ Summary saved to: {summary_path}\")\n",
|
|
||||||
"print(\"\\n\" + \"=\"*60)\n",
|
|
||||||
"print(\"NEXT STEP: Open cloud_detection_step2_test_omnicloudmask.ipynb\")\n",
|
|
||||||
"print(\"=\"*60)"
|
|
||||||
]
|
|
||||||
},
|
|
||||||
{
|
|
||||||
"cell_type": "markdown",
|
|
||||||
"id": "f6f6d142",
|
|
||||||
"metadata": {},
|
|
||||||
"source": [
|
|
||||||
"## 14. Cleanup (Optional)"
|
|
||||||
]
|
|
||||||
},
|
|
||||||
{
|
|
||||||
"cell_type": "code",
|
|
||||||
"execution_count": null,
|
|
||||||
"id": "88a775f8",
|
|
||||||
"metadata": {},
|
|
||||||
"outputs": [],
|
|
||||||
"source": [
|
|
||||||
"# Uncomment to delete intermediate files (single tiles and virtual rasters)\n",
|
|
||||||
"# Keep merged GeoTIFFs for Step 2\n",
|
|
||||||
"\n",
|
|
||||||
"cleanup = False # Set to True to enable cleanup\n",
|
|
||||||
"\n",
|
|
||||||
"if cleanup:\n",
|
|
||||||
" folders_to_clean = [BASE_PATH_SINGLE_IMAGES, folder_for_virtual_raster]\n",
|
|
||||||
" \n",
|
|
||||||
" for folder in folders_to_clean:\n",
|
|
||||||
" if folder.exists():\n",
|
|
||||||
" shutil.rmtree(folder)\n",
|
|
||||||
" folder.mkdir()\n",
|
|
||||||
" print(f\"✓ Cleaned: {folder}\")\n",
|
|
||||||
" \n",
|
|
||||||
" print(\"\\n✓ Cleanup complete - merged GeoTIFFs preserved\")\n",
|
|
||||||
"else:\n",
|
|
||||||
" print(\"Cleanup disabled. Set cleanup=True to remove intermediate files.\")"
|
|
||||||
]
|
|
||||||
}
|
|
||||||
],
|
|
||||||
"metadata": {
|
|
||||||
"kernelspec": {
|
|
||||||
"display_name": "base",
|
|
||||||
"language": "python",
|
|
||||||
"name": "python3"
|
|
||||||
},
|
|
||||||
"language_info": {
|
|
||||||
"codemirror_mode": {
|
|
||||||
"name": "ipython",
|
|
||||||
"version": 3
|
|
||||||
},
|
|
||||||
"file_extension": ".py",
|
|
||||||
"mimetype": "text/x-python",
|
|
||||||
"name": "python",
|
|
||||||
"nbconvert_exporter": "python",
|
|
||||||
"pygments_lexer": "ipython3",
|
|
||||||
"version": "3.12.3"
|
|
||||||
}
|
|
||||||
},
|
|
||||||
"nbformat": 4,
|
|
||||||
"nbformat_minor": 5
|
|
||||||
}
|
|
||||||
|
|
@ -1,319 +0,0 @@
|
||||||
import os
|
|
||||||
import argparse
|
|
||||||
import numpy as np
|
|
||||||
from pathlib import Path
|
|
||||||
from osgeo import gdal
|
|
||||||
import rasterio as rio
|
|
||||||
from rasterio.enums import Resampling
|
|
||||||
from rasterio.warp import reproject
|
|
||||||
from osgeo import osr
|
|
||||||
|
|
||||||
# Attempt to import OmniCloudMask and set a flag
|
|
||||||
try:
|
|
||||||
from omnicloudmask import predict_from_array, load_multiband
|
|
||||||
HAS_OCM = True
|
|
||||||
except ImportError:
|
|
||||||
HAS_OCM = False
|
|
||||||
|
|
||||||
def calculate_utm_zone_and_hemisphere(longitude, latitude):
|
|
||||||
"""
|
|
||||||
Calculate the UTM zone and hemisphere based on longitude and latitude.
|
|
||||||
"""
|
|
||||||
utm_zone = int((longitude + 180) / 6) + 1
|
|
||||||
is_southern = latitude < 0
|
|
||||||
return utm_zone, is_southern
|
|
||||||
|
|
||||||
def reproject_to_projected_crs(input_path, output_path):
|
|
||||||
"""
|
|
||||||
Reprojects a raster to a projected coordinate system (e.g., UTM).
|
|
||||||
"""
|
|
||||||
input_ds = gdal.Open(str(input_path))
|
|
||||||
if not input_ds:
|
|
||||||
raise ValueError(f"Failed to open input raster: {input_path}")
|
|
||||||
|
|
||||||
# Get the source spatial reference
|
|
||||||
source_srs = osr.SpatialReference()
|
|
||||||
source_srs.ImportFromWkt(input_ds.GetProjection())
|
|
||||||
|
|
||||||
# Get the geographic coordinates of the image's center
|
|
||||||
geo_transform = input_ds.GetGeoTransform()
|
|
||||||
width = input_ds.RasterXSize
|
|
||||||
height = input_ds.RasterYSize
|
|
||||||
center_x = geo_transform[0] + (width / 2) * geo_transform[1]
|
|
||||||
center_y = geo_transform[3] + (height / 2) * geo_transform[5]
|
|
||||||
|
|
||||||
# Calculate the UTM zone and hemisphere dynamically
|
|
||||||
utm_zone, is_southern = calculate_utm_zone_and_hemisphere(center_x, center_y)
|
|
||||||
|
|
||||||
# Define the target spatial reference
|
|
||||||
target_srs = osr.SpatialReference()
|
|
||||||
target_srs.SetWellKnownGeogCS("WGS84")
|
|
||||||
target_srs.SetUTM(utm_zone, is_southern)
|
|
||||||
|
|
||||||
# Create the warp options
|
|
||||||
warp_options = gdal.WarpOptions(
|
|
||||||
dstSRS=target_srs.ExportToWkt(),
|
|
||||||
format="GTiff"
|
|
||||||
)
|
|
||||||
|
|
||||||
# Perform the reprojection
|
|
||||||
gdal.Warp(str(output_path), input_ds, options=warp_options)
|
|
||||||
input_ds = None # Close the dataset
|
|
||||||
print(f"Reprojected raster saved to: {output_path}")
|
|
||||||
return output_path
|
|
||||||
|
|
||||||
def resample_image(input_path, output_path, resolution=(10, 10), resample_alg="bilinear"):
|
|
||||||
"""
|
|
||||||
Resamples a raster to a specified resolution using gdal.Translate.
|
|
||||||
"""
|
|
||||||
print(f"Resampling {input_path} to {resolution}m resolution -> {output_path}")
|
|
||||||
|
|
||||||
# Reproject the input image to a projected CRS
|
|
||||||
reprojected_path = str(Path(output_path).with_name(f"{Path(output_path).stem}_reprojected.tif"))
|
|
||||||
reproject_to_projected_crs(input_path, reprojected_path)
|
|
||||||
|
|
||||||
# Open the reprojected dataset
|
|
||||||
input_ds = gdal.Open(reprojected_path)
|
|
||||||
if not input_ds:
|
|
||||||
raise ValueError(f"Failed to open reprojected raster: {reprojected_path}")
|
|
||||||
|
|
||||||
# Perform the resampling
|
|
||||||
result = gdal.Translate(
|
|
||||||
str(output_path),
|
|
||||||
input_ds,
|
|
||||||
xRes=resolution[0],
|
|
||||||
yRes=resolution[1],
|
|
||||||
resampleAlg=resample_alg
|
|
||||||
)
|
|
||||||
input_ds = None # Explicitly dereference the GDAL dataset
|
|
||||||
if result is None:
|
|
||||||
raise ValueError(f"Failed to resample image to {output_path}")
|
|
||||||
print(f"Successfully resampled image saved to: {output_path}")
|
|
||||||
return output_path
|
|
||||||
|
|
||||||
def run_ocm_on_image(image_path_10m, ocm_output_dir, save_mask=True):
|
|
||||||
"""
|
|
||||||
Processes a 10m resolution image with OmniCloudMask.
|
|
||||||
Adapted from process_with_ocm in the notebook.
|
|
||||||
"""
|
|
||||||
if not HAS_OCM:
|
|
||||||
print("OmniCloudMask not available. Please install with: pip install omnicloudmask")
|
|
||||||
return None, None
|
|
||||||
|
|
||||||
image_path_10m = Path(image_path_10m)
|
|
||||||
ocm_output_dir = Path(ocm_output_dir)
|
|
||||||
ocm_output_dir.mkdir(exist_ok=True, parents=True)
|
|
||||||
|
|
||||||
mask_10m_path = ocm_output_dir / f"{image_path_10m.stem}_ocm_mask_10m.tif"
|
|
||||||
|
|
||||||
try:
|
|
||||||
# Open the image to check dimensions
|
|
||||||
with rio.open(image_path_10m) as src:
|
|
||||||
width, height = src.width, src.height
|
|
||||||
|
|
||||||
# Check if the image is too small for OmniCloudMask
|
|
||||||
if width < 50 or height < 50:
|
|
||||||
print(f"Warning: Image {image_path_10m} is too small for OmniCloudMask (width: {width}, height: {height}). Skipping.")
|
|
||||||
return None, None
|
|
||||||
|
|
||||||
# PlanetScope 4-band images are typically [B,G,R,NIR]
|
|
||||||
# OCM expects [R,G,NIR] for its default model.
|
|
||||||
# Band numbers for load_multiband are 1-based.
|
|
||||||
# If original is B(1),G(2),R(3),NIR(4), then R=3, G=2, NIR=4
|
|
||||||
band_order = [3, 2, 4]
|
|
||||||
|
|
||||||
print(f"Loading 10m image for OCM: {image_path_10m}")
|
|
||||||
# load_multiband resamples if resample_res is different from source,
|
|
||||||
# but here image_path_10m is already 10m.
|
|
||||||
# We pass resample_res=None to use the image's own resolution.
|
|
||||||
rgn_data, profile = load_multiband(
|
|
||||||
input_path=str(image_path_10m),
|
|
||||||
resample_res=10, # Explicitly set target resolution for OCM
|
|
||||||
band_order=band_order
|
|
||||||
)
|
|
||||||
|
|
||||||
print("Applying OmniCloudMask...")
|
|
||||||
prediction = predict_from_array(rgn_data)
|
|
||||||
|
|
||||||
if save_mask:
|
|
||||||
profile.update(count=1, dtype='uint8')
|
|
||||||
with rio.open(mask_10m_path, 'w', **profile) as dst:
|
|
||||||
dst.write(prediction.astype('uint8'), 1)
|
|
||||||
print(f"Saved 10m OCM mask to: {mask_10m_path}")
|
|
||||||
|
|
||||||
# Summary (optional, can be removed for cleaner script output)
|
|
||||||
n_total = prediction.size
|
|
||||||
n_clear = np.sum(prediction == 0)
|
|
||||||
n_thick = np.sum(prediction == 1)
|
|
||||||
n_thin = np.sum(prediction == 2)
|
|
||||||
n_shadow = np.sum(prediction == 3)
|
|
||||||
print(f" OCM: Clear: {100*n_clear/n_total:.1f}%, Thick: {100*n_thick/n_total:.1f}%, Thin: {100*n_thin/n_total:.1f}%, Shadow: {100*n_shadow/n_total:.1f}%")
|
|
||||||
|
|
||||||
return str(mask_10m_path), profile
|
|
||||||
except Exception as e:
|
|
||||||
print(f"Error processing 10m image with OmniCloudMask: {str(e)}")
|
|
||||||
return None, None
|
|
||||||
|
|
||||||
|
|
||||||
def upsample_mask_to_3m(mask_10m_path, target_3m_image_path, output_3m_mask_path):
|
|
||||||
"""
|
|
||||||
Upsamples a 10m OCM mask to match the 3m target image.
|
|
||||||
Adapted from upsample_mask_to_highres in the notebook.
|
|
||||||
"""
|
|
||||||
print(f"Upsampling 10m mask {mask_10m_path} to 3m, referencing {target_3m_image_path}")
|
|
||||||
with rio.open(mask_10m_path) as src_mask, rio.open(target_3m_image_path) as src_img_3m:
|
|
||||||
mask_data_10m = src_mask.read(1)
|
|
||||||
|
|
||||||
img_shape_3m = (src_img_3m.height, src_img_3m.width)
|
|
||||||
img_transform_3m = src_img_3m.transform
|
|
||||||
img_crs_3m = src_img_3m.crs
|
|
||||||
|
|
||||||
upsampled_mask_3m_data = np.zeros(img_shape_3m, dtype=mask_data_10m.dtype)
|
|
||||||
|
|
||||||
reproject(
|
|
||||||
source=mask_data_10m,
|
|
||||||
destination=upsampled_mask_3m_data,
|
|
||||||
src_transform=src_mask.transform,
|
|
||||||
src_crs=src_mask.crs,
|
|
||||||
dst_transform=img_transform_3m,
|
|
||||||
dst_crs=img_crs_3m,
|
|
||||||
resampling=Resampling.nearest
|
|
||||||
)
|
|
||||||
|
|
||||||
profile_3m_mask = src_img_3m.profile.copy()
|
|
||||||
profile_3m_mask.update({
|
|
||||||
'count': 1,
|
|
||||||
'dtype': upsampled_mask_3m_data.dtype
|
|
||||||
})
|
|
||||||
|
|
||||||
with rio.open(output_3m_mask_path, 'w', **profile_3m_mask) as dst:
|
|
||||||
dst.write(upsampled_mask_3m_data, 1)
|
|
||||||
print(f"Upsampled 3m OCM mask saved to: {output_3m_mask_path}")
|
|
||||||
return str(output_3m_mask_path)
|
|
||||||
|
|
||||||
|
|
||||||
def apply_3m_mask_to_3m_image(image_3m_path, mask_3m_path, final_masked_output_path):
|
|
||||||
"""
|
|
||||||
Applies an upsampled 3m OCM mask to the original 3m image.
|
|
||||||
Adapted from apply_upsampled_mask_to_highres in the notebook.
|
|
||||||
"""
|
|
||||||
print(f"Applying 3m mask {mask_3m_path} to 3m image {image_3m_path}")
|
|
||||||
image_3m_path = Path(image_3m_path)
|
|
||||||
mask_3m_path = Path(mask_3m_path)
|
|
||||||
final_masked_output_path = Path(final_masked_output_path)
|
|
||||||
final_masked_output_path.parent.mkdir(parents=True, exist_ok=True)
|
|
||||||
|
|
||||||
try:
|
|
||||||
with rio.open(image_3m_path) as src_img_3m, rio.open(mask_3m_path) as src_mask_3m:
|
|
||||||
img_data_3m = src_img_3m.read()
|
|
||||||
img_profile_3m = src_img_3m.profile.copy()
|
|
||||||
mask_data_3m = src_mask_3m.read(1)
|
|
||||||
|
|
||||||
if img_data_3m.shape[1:] != mask_data_3m.shape:
|
|
||||||
print(f"Warning: 3m image shape {img_data_3m.shape[1:]} and 3m mask shape {mask_data_3m.shape} do not match.")
|
|
||||||
# This should ideally not happen if upsampling was correct.
|
|
||||||
|
|
||||||
# OCM: 0=clear, 1=thick cloud, 2=thin cloud, 3=shadow
|
|
||||||
# We want to mask out (set to nodata) pixels where OCM is > 0
|
|
||||||
binary_mask = np.ones_like(mask_data_3m, dtype=np.uint8)
|
|
||||||
binary_mask[mask_data_3m > 0] = 0 # 0 for cloud/shadow, 1 for clear
|
|
||||||
|
|
||||||
masked_img_data_3m = img_data_3m.copy()
|
|
||||||
nodata_val = img_profile_3m.get('nodata', 0) # Use existing nodata or 0
|
|
||||||
|
|
||||||
for i in range(img_profile_3m['count']):
|
|
||||||
masked_img_data_3m[i][binary_mask == 0] = nodata_val
|
|
||||||
|
|
||||||
# Ensure dtype of profile matches data to be written
|
|
||||||
# If original image was float, but nodata is int (0), rasterio might complain
|
|
||||||
# It's safer to use the original image's dtype for the output.
|
|
||||||
img_profile_3m.update(dtype=img_data_3m.dtype)
|
|
||||||
|
|
||||||
with rio.open(final_masked_output_path, 'w', **img_profile_3m) as dst:
|
|
||||||
dst.write(masked_img_data_3m)
|
|
||||||
|
|
||||||
print(f"Final masked 3m image saved to: {final_masked_output_path}")
|
|
||||||
return str(final_masked_output_path)
|
|
||||||
|
|
||||||
except Exception as e:
|
|
||||||
print(f"Error applying 3m mask to 3m image: {str(e)}")
|
|
||||||
return None
|
|
||||||
|
|
||||||
|
|
||||||
def main():
|
|
||||||
parser = argparse.ArgumentParser(description="Process PlanetScope 3m imagery with OmniCloudMask.")
|
|
||||||
parser.add_argument("input_3m_image", type=str, help="Path to the input merged 3m PlanetScope GeoTIFF image.")
|
|
||||||
parser.add_argument("output_dir", type=str, help="Directory to save processed files (10m image, masks, final 3m masked image).")
|
|
||||||
|
|
||||||
args = parser.parse_args()
|
|
||||||
|
|
||||||
try:
|
|
||||||
# Resolve paths to absolute paths immediately
|
|
||||||
input_3m_path = Path(args.input_3m_image).resolve(strict=True)
|
|
||||||
# output_base_dir is the directory where outputs will be saved.
|
|
||||||
# It should exist when the script is called (created by the notebook).
|
|
||||||
output_base_dir = Path(args.output_dir).resolve(strict=True)
|
|
||||||
except FileNotFoundError as e:
|
|
||||||
print(f"Error: Path resolution failed. Input image or output base directory may not exist or is not accessible: {e}")
|
|
||||||
return
|
|
||||||
except Exception as e:
|
|
||||||
print(f"Error resolving paths: {e}")
|
|
||||||
return
|
|
||||||
|
|
||||||
# The check for input_3m_path.exists() is now covered by resolve(strict=True)
|
|
||||||
|
|
||||||
# Define intermediate and final file paths using absolute base paths
|
|
||||||
intermediate_dir = output_base_dir / "intermediate_ocm_files"
|
|
||||||
intermediate_dir.mkdir(parents=True, exist_ok=True)
|
|
||||||
|
|
||||||
image_10m_path = intermediate_dir / f"{input_3m_path.stem}_10m.tif"
|
|
||||||
# OCM mask (10m) will be saved inside run_ocm_on_image, in a subdir of intermediate_dir
|
|
||||||
ocm_mask_output_dir = intermediate_dir / "ocm_10m_mask_output"
|
|
||||||
|
|
||||||
# Upsampled OCM mask (3m)
|
|
||||||
mask_3m_upsampled_path = intermediate_dir / f"{input_3m_path.stem}_ocm_mask_3m_upsampled.tif"
|
|
||||||
|
|
||||||
# Final masked image (3m)
|
|
||||||
final_masked_3m_path = output_base_dir / f"{input_3m_path.stem}_ocm_masked_3m.tif"
|
|
||||||
|
|
||||||
print(f"--- Starting OCM processing for {input_3m_path.name} ---")
|
|
||||||
print(f"Input 3m image (absolute): {input_3m_path}")
|
|
||||||
print(f"Output base directory (absolute): {output_base_dir}")
|
|
||||||
print(f"Intermediate 10m image path: {image_10m_path}")
|
|
||||||
|
|
||||||
# 1. Resample 3m input to 10m for OCM
|
|
||||||
try:
|
|
||||||
resample_image(input_3m_path, image_10m_path, resolution=(10, 10))
|
|
||||||
except Exception as e:
|
|
||||||
print(f"Failed to resample to 10m: {e}")
|
|
||||||
return
|
|
||||||
|
|
||||||
# 2. Run OCM on the 10m image
|
|
||||||
mask_10m_generated_path, _ = run_ocm_on_image(image_10m_path, ocm_mask_output_dir)
|
|
||||||
if not mask_10m_generated_path:
|
|
||||||
print("OCM processing failed. Exiting.")
|
|
||||||
return
|
|
||||||
|
|
||||||
# 3. Upsample the 10m OCM mask to 3m
|
|
||||||
try:
|
|
||||||
upsample_mask_to_3m(mask_10m_generated_path, input_3m_path, mask_3m_upsampled_path)
|
|
||||||
except Exception as e:
|
|
||||||
print(f"Failed to upsample 10m OCM mask to 3m: {e}")
|
|
||||||
return
|
|
||||||
|
|
||||||
# 4. Apply the 3m upsampled mask to the original 3m image
|
|
||||||
try:
|
|
||||||
apply_3m_mask_to_3m_image(input_3m_path, mask_3m_upsampled_path, final_masked_3m_path)
|
|
||||||
except Exception as e:
|
|
||||||
print(f"Failed to apply 3m mask to 3m image: {e}")
|
|
||||||
return
|
|
||||||
|
|
||||||
print(f"--- Successfully completed OCM processing for {input_3m_path.name} ---")
|
|
||||||
print(f"Final 3m masked output: {final_masked_3m_path}")
|
|
||||||
|
|
||||||
if __name__ == "__main__":
|
|
||||||
if not HAS_OCM:
|
|
||||||
print("OmniCloudMask library is not installed. Please install it to run this script.")
|
|
||||||
print("You can typically install it using: pip install omnicloudmask")
|
|
||||||
else:
|
|
||||||
main()
|
|
||||||
|
|
@ -1,269 +0,0 @@
|
||||||
"""
|
|
||||||
Simple OmniCloudMask test script for PlanetScope imagery
|
|
||||||
Based on: https://dpird-dma.github.io/blog/Cloud-Masking-for-PlanetScope-Imagery-Using-OmniCloudMask/
|
|
||||||
|
|
||||||
Tests OmniCloudMask on 2024-12-30 ESA image
|
|
||||||
"""
|
|
||||||
|
|
||||||
from omnicloudmask import predict_from_array, load_multiband
|
|
||||||
from functools import partial
|
|
||||||
from pathlib import Path
|
|
||||||
import rasterio as rio
|
|
||||||
import numpy as np
|
|
||||||
import geopandas as gpd
|
|
||||||
from rasterio.features import rasterize
|
|
||||||
from rasterio.transform import Affine
|
|
||||||
|
|
||||||
print("="*70)
|
|
||||||
print("OMNICLOUDMASK TEST - ESA PROJECT")
|
|
||||||
print("="*70)
|
|
||||||
|
|
||||||
|
|
||||||
# Configuration
|
|
||||||
project = 'esa'
|
|
||||||
test_date = '2024-12-03'
|
|
||||||
|
|
||||||
# Get absolute path to the project root (go up one level from python_app/)
|
|
||||||
project_root = Path(__file__).resolve().parent.parent
|
|
||||||
planetscope_image = project_root / "laravel_app" / "storage" / "app" / project / "cloud_test_merged_tif" / f"{test_date}.tif"
|
|
||||||
geojson_path = project_root / "laravel_app" / "storage" / "app" / project / "Data" / "pivot_2.geojson"
|
|
||||||
output_dir = project_root / "laravel_app" / "storage" / "app" / project / "omnicloudmask_results"
|
|
||||||
output_dir.mkdir(exist_ok=True, parents=True)
|
|
||||||
|
|
||||||
print(f"\nInput image: {planetscope_image}")
|
|
||||||
print(f"Field boundaries: {geojson_path}")
|
|
||||||
print(f"Output directory: {output_dir}")
|
|
||||||
|
|
||||||
# Check files exist
|
|
||||||
if not planetscope_image.exists():
|
|
||||||
print(f"\n❌ ERROR: Image not found: {planetscope_image}")
|
|
||||||
exit(1)
|
|
||||||
|
|
||||||
if not geojson_path.exists():
|
|
||||||
print(f"\n⚠️ WARNING: GeoJSON not found: {geojson_path}")
|
|
||||||
print(" Will process without field mask")
|
|
||||||
use_field_mask = False
|
|
||||||
else:
|
|
||||||
use_field_mask = True
|
|
||||||
|
|
||||||
print("\n" + "="*70)
|
|
||||||
print("STEP 1: Load PlanetScope Image")
|
|
||||||
print("="*70)
|
|
||||||
|
|
||||||
# First, check the image metadata
|
|
||||||
with rio.open(str(planetscope_image)) as src:
|
|
||||||
print(f"\nOriginal image info:")
|
|
||||||
print(f" Bands: {src.count}")
|
|
||||||
print(f" Size: {src.height} x {src.width}")
|
|
||||||
print(f" CRS: {src.crs}")
|
|
||||||
print(f" Bounds: {src.bounds}")
|
|
||||||
|
|
||||||
# PlanetScope 4-band order: Blue(1), Green(2), Red(3), NIR(4)
|
|
||||||
# OmniCloudMask needs: Red, Green, NIR
|
|
||||||
band_order = [3, 2, 4] # Red, Green, NIR
|
|
||||||
|
|
||||||
print(f"\nLoading bands in order: Red(3), Green(2), NIR(4)")
|
|
||||||
print(f"Note: Skipping resampling to preserve image data...")
|
|
||||||
|
|
||||||
# Load without resampling to avoid issues with EPSG:4326
|
|
||||||
try:
|
|
||||||
with rio.open(str(planetscope_image)) as src:
|
|
||||||
# Read the required bands (1-indexed for rasterio)
|
|
||||||
red = src.read(3)
|
|
||||||
green = src.read(2)
|
|
||||||
nir = src.read(4)
|
|
||||||
|
|
||||||
# Stack into array (bands, height, width)
|
|
||||||
rgn_data = np.stack([red, green, nir])
|
|
||||||
|
|
||||||
# Get profile for later use
|
|
||||||
profile = src.profile.copy()
|
|
||||||
profile.update(count=1) # We'll save single-band output
|
|
||||||
|
|
||||||
print(f"✓ Image loaded successfully")
|
|
||||||
print(f" Shape: {rgn_data.shape} (bands, height, width)")
|
|
||||||
print(f" Data type: {rgn_data.dtype}")
|
|
||||||
|
|
||||||
# Check if data is valid
|
|
||||||
if rgn_data.size == 0:
|
|
||||||
print(f"❌ ERROR: Image has no data!")
|
|
||||||
exit(1)
|
|
||||||
|
|
||||||
print(f" Value range: {rgn_data.min():.6f} to {rgn_data.max():.6f}")
|
|
||||||
|
|
||||||
# Check each band
|
|
||||||
print(f"\n Band statistics:")
|
|
||||||
print(f" Red (band 0): min={rgn_data[0].min():.6f}, max={rgn_data[0].max():.6f}, mean={rgn_data[0].mean():.6f}")
|
|
||||||
print(f" Green (band 1): min={rgn_data[1].min():.6f}, max={rgn_data[1].max():.6f}, mean={rgn_data[1].mean():.6f}")
|
|
||||||
print(f" NIR (band 2): min={rgn_data[2].min():.6f}, max={rgn_data[2].max():.6f}, mean={rgn_data[2].mean():.6f}")
|
|
||||||
|
|
||||||
except Exception as e:
|
|
||||||
print(f"❌ ERROR loading image: {e}")
|
|
||||||
import traceback
|
|
||||||
traceback.print_exc()
|
|
||||||
exit(1)
|
|
||||||
|
|
||||||
# Optional: Apply field mask
|
|
||||||
if use_field_mask:
|
|
||||||
print("\n" + "="*70)
|
|
||||||
print("STEP 2: Apply Field Mask (Optional)")
|
|
||||||
print("="*70)
|
|
||||||
|
|
||||||
try:
|
|
||||||
# Load field boundaries
|
|
||||||
fields_gdf = gpd.read_file(str(geojson_path))
|
|
||||||
print(f"✓ Loaded {len(fields_gdf)} field polygons")
|
|
||||||
|
|
||||||
# Create field mask
|
|
||||||
# profile['transform'] is already an Affine object from rasterio
|
|
||||||
transform = profile['transform']
|
|
||||||
field_mask = rasterize(
|
|
||||||
[(geom, 1) for geom in fields_gdf.geometry],
|
|
||||||
out_shape=(rgn_data.shape[1], rgn_data.shape[2]),
|
|
||||||
transform=transform,
|
|
||||||
fill=0,
|
|
||||||
dtype=np.uint8
|
|
||||||
)
|
|
||||||
|
|
||||||
field_pixels = np.sum(field_mask == 1)
|
|
||||||
total_pixels = field_mask.size
|
|
||||||
print(f"✓ Field mask created")
|
|
||||||
print(f" Field pixels: {field_pixels:,} ({field_pixels/total_pixels*100:.1f}%)")
|
|
||||||
print(f" Non-field pixels: {total_pixels - field_pixels:,}")
|
|
||||||
|
|
||||||
# Apply mask - set non-field pixels to 0
|
|
||||||
rgn_data_masked = rgn_data.copy()
|
|
||||||
for i in range(3): # For each band
|
|
||||||
rgn_data_masked[i][field_mask == 0] = 0
|
|
||||||
|
|
||||||
print(f"\n Masked data statistics (field pixels only):")
|
|
||||||
field_data = field_mask == 1
|
|
||||||
print(f" Red: {rgn_data_masked[0][field_data].min():.6f} to {rgn_data_masked[0][field_data].max():.6f} (mean: {rgn_data_masked[0][field_data].mean():.6f})")
|
|
||||||
print(f" Green: {rgn_data_masked[1][field_data].min():.6f} to {rgn_data_masked[1][field_data].max():.6f} (mean: {rgn_data_masked[1][field_data].mean():.6f})")
|
|
||||||
print(f" NIR: {rgn_data_masked[2][field_data].min():.6f} to {rgn_data_masked[2][field_data].max():.6f} (mean: {rgn_data_masked[2][field_data].mean():.6f})")
|
|
||||||
|
|
||||||
# Use masked data
|
|
||||||
rgn_data_to_process = rgn_data_masked
|
|
||||||
|
|
||||||
except Exception as e:
|
|
||||||
print(f"⚠️ WARNING: Could not apply field mask: {e}")
|
|
||||||
print(" Proceeding without field mask...")
|
|
||||||
use_field_mask = False
|
|
||||||
rgn_data_to_process = rgn_data
|
|
||||||
field_mask = None
|
|
||||||
else:
|
|
||||||
rgn_data_to_process = rgn_data
|
|
||||||
field_mask = None
|
|
||||||
|
|
||||||
print("\n" + "="*70)
|
|
||||||
print("STEP 3: Run OmniCloudMask")
|
|
||||||
print("="*70)
|
|
||||||
|
|
||||||
print(f"\nRunning OmniCloudMask inference...")
|
|
||||||
print(f"⏳ This may take a few minutes (especially on CPU)...")
|
|
||||||
|
|
||||||
try:
|
|
||||||
# Generate cloud and shadow mask
|
|
||||||
prediction = predict_from_array(
|
|
||||||
rgn_data_to_process,
|
|
||||||
no_data_value=0 if use_field_mask else None,
|
|
||||||
apply_no_data_mask=use_field_mask
|
|
||||||
)
|
|
||||||
|
|
||||||
print(f"✓ OmniCloudMask inference complete!")
|
|
||||||
print(f" Prediction shape: {prediction.shape}")
|
|
||||||
print(f" Unique values: {np.unique(prediction)}")
|
|
||||||
print(f" 0 = Clear, 1 = Thick Cloud, 2 = Thin Cloud, 3 = Shadow")
|
|
||||||
|
|
||||||
except Exception as e:
|
|
||||||
print(f"❌ ERROR during inference: {e}")
|
|
||||||
import traceback
|
|
||||||
traceback.print_exc()
|
|
||||||
exit(1)
|
|
||||||
|
|
||||||
print("\n" + "="*70)
|
|
||||||
print("STEP 4: Calculate Statistics")
|
|
||||||
print("="*70)
|
|
||||||
|
|
||||||
# Get classification from prediction (remove batch dimension if present)
|
|
||||||
if prediction.ndim == 3:
|
|
||||||
classification = prediction[0]
|
|
||||||
else:
|
|
||||||
classification = prediction
|
|
||||||
|
|
||||||
# Calculate statistics
|
|
||||||
if use_field_mask and field_mask is not None:
|
|
||||||
# Stats for field pixels only
|
|
||||||
field_pixels_mask = field_mask == 1
|
|
||||||
total_pixels = np.sum(field_pixels_mask)
|
|
||||||
|
|
||||||
clear_pixels = np.sum(classification[field_pixels_mask] == 0)
|
|
||||||
thick_cloud_pixels = np.sum(classification[field_pixels_mask] == 1)
|
|
||||||
thin_cloud_pixels = np.sum(classification[field_pixels_mask] == 2)
|
|
||||||
shadow_pixels = np.sum(classification[field_pixels_mask] == 3)
|
|
||||||
|
|
||||||
print(f"\n✅ Results for FIELD AREAS ONLY ({total_pixels:,} pixels):")
|
|
||||||
else:
|
|
||||||
# Stats for all pixels
|
|
||||||
total_pixels = classification.size
|
|
||||||
|
|
||||||
clear_pixels = np.sum(classification == 0)
|
|
||||||
thick_cloud_pixels = np.sum(classification == 1)
|
|
||||||
thin_cloud_pixels = np.sum(classification == 2)
|
|
||||||
shadow_pixels = np.sum(classification == 3)
|
|
||||||
|
|
||||||
print(f"\n✅ Results for ALL PIXELS ({total_pixels:,} pixels):")
|
|
||||||
|
|
||||||
print(f" Clear: {clear_pixels:>10,} ({clear_pixels/total_pixels*100:>5.1f}%)")
|
|
||||||
print(f" Thick Cloud: {thick_cloud_pixels:>10,} ({thick_cloud_pixels/total_pixels*100:>5.1f}%)")
|
|
||||||
print(f" Thin Cloud: {thin_cloud_pixels:>10,} ({thin_cloud_pixels/total_pixels*100:>5.1f}%)")
|
|
||||||
print(f" Shadow: {shadow_pixels:>10,} ({shadow_pixels/total_pixels*100:>5.1f}%)")
|
|
||||||
|
|
||||||
cloud_pixels = thick_cloud_pixels + thin_cloud_pixels
|
|
||||||
print(f"\n Total Clouds: {cloud_pixels:>9,} ({cloud_pixels/total_pixels*100:>5.1f}%)")
|
|
||||||
print(f" Total Unusable: {cloud_pixels + shadow_pixels:>7,} ({(cloud_pixels + shadow_pixels)/total_pixels*100:>5.1f}%)")
|
|
||||||
|
|
||||||
print("\n" + "="*70)
|
|
||||||
print("STEP 5: Save Results")
|
|
||||||
print("="*70)
|
|
||||||
|
|
||||||
# Save the cloud mask result
|
|
||||||
output_file = output_dir / f"omnicloudmask_{test_date}.tif"
|
|
||||||
|
|
||||||
try:
|
|
||||||
profile.update(count=1, dtype='uint8')
|
|
||||||
with rio.open(str(output_file), 'w', **profile) as dst:
|
|
||||||
dst.write(prediction.astype('uint8'))
|
|
||||||
|
|
||||||
print(f"✓ Cloud mask saved: {output_file}")
|
|
||||||
|
|
||||||
except Exception as e:
|
|
||||||
print(f"❌ ERROR saving result: {e}")
|
|
||||||
import traceback
|
|
||||||
traceback.print_exc()
|
|
||||||
|
|
||||||
# Also save a human-readable summary
|
|
||||||
summary_file = output_dir / f"omnicloudmask_{test_date}_summary.txt"
|
|
||||||
with open(summary_file, 'w') as f:
|
|
||||||
f.write(f"OmniCloudMask Results for {test_date}\n")
|
|
||||||
f.write(f"="*50 + "\n\n")
|
|
||||||
f.write(f"Input: {planetscope_image}\n")
|
|
||||||
f.write(f"Field mask applied: {use_field_mask}\n\n")
|
|
||||||
f.write(f"Classification Results:\n")
|
|
||||||
f.write(f" Total pixels analyzed: {total_pixels:,}\n")
|
|
||||||
f.write(f" Clear: {clear_pixels:>10,} ({clear_pixels/total_pixels*100:>5.1f}%)\n")
|
|
||||||
f.write(f" Thick Cloud: {thick_cloud_pixels:>10,} ({thick_cloud_pixels/total_pixels*100:>5.1f}%)\n")
|
|
||||||
f.write(f" Thin Cloud: {thin_cloud_pixels:>10,} ({thin_cloud_pixels/total_pixels*100:>5.1f}%)\n")
|
|
||||||
f.write(f" Shadow: {shadow_pixels:>10,} ({shadow_pixels/total_pixels*100:>5.1f}%)\n")
|
|
||||||
f.write(f"\n Total Unusable: {cloud_pixels + shadow_pixels:>7,} ({(cloud_pixels + shadow_pixels)/total_pixels*100:>5.1f}%)\n")
|
|
||||||
|
|
||||||
print(f"✓ Summary saved: {summary_file}")
|
|
||||||
|
|
||||||
print("\n" + "="*70)
|
|
||||||
print("✅ COMPLETE!")
|
|
||||||
print("="*70)
|
|
||||||
print(f"\nOutputs:")
|
|
||||||
print(f" Cloud mask: {output_file}")
|
|
||||||
print(f" Summary: {summary_file}")
|
|
||||||
print(f"\nYou can open the cloud mask in QGIS or other GIS software.")
|
|
||||||
print(f"Values: 0=Clear, 1=Thick Cloud, 2=Thin Cloud, 3=Shadow")
|
|
||||||
|
|
@ -1,151 +0,0 @@
|
||||||
#!/usr/bin/env python3
|
|
||||||
"""
|
|
||||||
Setup Script for SAR Download Environment
|
|
||||||
=========================================
|
|
||||||
|
|
||||||
This script helps set up the Python environment for SAR data download.
|
|
||||||
|
|
||||||
Usage:
|
|
||||||
python setup_sar_environment.py
|
|
||||||
|
|
||||||
The script will:
|
|
||||||
1. Check Python version
|
|
||||||
2. Install required packages
|
|
||||||
3. Test SentinelHub connection
|
|
||||||
4. Create necessary directories
|
|
||||||
|
|
||||||
"""
|
|
||||||
|
|
||||||
import os
|
|
||||||
import sys
|
|
||||||
import subprocess
|
|
||||||
from pathlib import Path
|
|
||||||
|
|
||||||
def check_python_version():
|
|
||||||
"""Check if Python version is compatible"""
|
|
||||||
version = sys.version_info
|
|
||||||
if version.major != 3 or version.minor < 8:
|
|
||||||
print(f"Error: Python 3.8+ required, found {version.major}.{version.minor}")
|
|
||||||
return False
|
|
||||||
print(f"✓ Python {version.major}.{version.minor}.{version.micro} is compatible")
|
|
||||||
return True
|
|
||||||
|
|
||||||
def install_requirements():
|
|
||||||
"""Install required packages"""
|
|
||||||
requirements_file = "requirements_sar.txt"
|
|
||||||
|
|
||||||
if not os.path.exists(requirements_file):
|
|
||||||
print(f"Error: {requirements_file} not found")
|
|
||||||
return False
|
|
||||||
|
|
||||||
print("Installing required packages...")
|
|
||||||
try:
|
|
||||||
subprocess.check_call([
|
|
||||||
sys.executable, "-m", "pip", "install", "-r", requirements_file
|
|
||||||
])
|
|
||||||
print("✓ Packages installed successfully")
|
|
||||||
return True
|
|
||||||
except subprocess.CalledProcessError as e:
|
|
||||||
print(f"Error installing packages: {e}")
|
|
||||||
return False
|
|
||||||
|
|
||||||
def create_directories():
|
|
||||||
"""Create necessary directory structure"""
|
|
||||||
directories = [
|
|
||||||
"data/aura/weekly_SAR_mosaic",
|
|
||||||
"data/aura/field_boundaries",
|
|
||||||
"output/sar_analysis"
|
|
||||||
]
|
|
||||||
|
|
||||||
for directory in directories:
|
|
||||||
Path(directory).mkdir(parents=True, exist_ok=True)
|
|
||||||
print(f"✓ Created directory: {directory}")
|
|
||||||
|
|
||||||
return True
|
|
||||||
|
|
||||||
def test_imports():
|
|
||||||
"""Test if all required packages can be imported"""
|
|
||||||
packages = [
|
|
||||||
"sentinelhub",
|
|
||||||
"geopandas",
|
|
||||||
"rasterio",
|
|
||||||
"numpy",
|
|
||||||
"scipy"
|
|
||||||
]
|
|
||||||
|
|
||||||
print("Testing package imports...")
|
|
||||||
failed_imports = []
|
|
||||||
|
|
||||||
for package in packages:
|
|
||||||
try:
|
|
||||||
__import__(package)
|
|
||||||
print(f"✓ {package}")
|
|
||||||
except ImportError as e:
|
|
||||||
print(f"✗ {package}: {e}")
|
|
||||||
failed_imports.append(package)
|
|
||||||
|
|
||||||
if failed_imports:
|
|
||||||
print(f"\nFailed to import: {', '.join(failed_imports)}")
|
|
||||||
return False
|
|
||||||
|
|
||||||
print("✓ All packages imported successfully")
|
|
||||||
return True
|
|
||||||
|
|
||||||
def check_sentinelhub_config():
|
|
||||||
"""Check SentinelHub configuration"""
|
|
||||||
try:
|
|
||||||
from sentinelhub import SHConfig
|
|
||||||
config = SHConfig()
|
|
||||||
|
|
||||||
print("\nSentinelHub Configuration Check:")
|
|
||||||
print(f"Instance ID: {'Set' if config.instance_id else 'Not set'}")
|
|
||||||
print(f"Client ID: {'Set' if config.sh_client_id else 'Not set'}")
|
|
||||||
print(f"Client Secret: {'Set' if config.sh_client_secret else 'Not set'}")
|
|
||||||
|
|
||||||
if not config.sh_client_id or not config.sh_client_secret:
|
|
||||||
print("\n⚠️ SentinelHub credentials not configured")
|
|
||||||
print("You'll need to set these up when running the download script")
|
|
||||||
print("Get credentials from: https://apps.sentinel-hub.com/")
|
|
||||||
else:
|
|
||||||
print("✓ SentinelHub credentials are configured")
|
|
||||||
|
|
||||||
return True
|
|
||||||
|
|
||||||
except Exception as e:
|
|
||||||
print(f"Error checking SentinelHub config: {e}")
|
|
||||||
return False
|
|
||||||
|
|
||||||
def main():
|
|
||||||
"""Main setup function"""
|
|
||||||
print("=== SAR Download Environment Setup ===\n")
|
|
||||||
|
|
||||||
# Check Python version
|
|
||||||
if not check_python_version():
|
|
||||||
return False
|
|
||||||
|
|
||||||
# Install requirements
|
|
||||||
if not install_requirements():
|
|
||||||
return False
|
|
||||||
|
|
||||||
# Create directories
|
|
||||||
if not create_directories():
|
|
||||||
return False
|
|
||||||
|
|
||||||
# Test imports
|
|
||||||
if not test_imports():
|
|
||||||
return False
|
|
||||||
|
|
||||||
# Check SentinelHub config
|
|
||||||
check_sentinelhub_config()
|
|
||||||
|
|
||||||
print("\n=== Setup Complete! ===")
|
|
||||||
print("\nNext steps:")
|
|
||||||
print("1. Get SentinelHub credentials from https://apps.sentinel-hub.com/")
|
|
||||||
print("2. Place your field boundaries file (geojson) in data/aura/field_boundaries/")
|
|
||||||
print("3. Run: python download_s1_aura.py")
|
|
||||||
|
|
||||||
return True
|
|
||||||
|
|
||||||
if __name__ == "__main__":
|
|
||||||
success = main()
|
|
||||||
sys.exit(0 if success else 1)
|
|
||||||
|
|
@ -1,145 +0,0 @@
|
||||||
#!/usr/bin/env python3
|
|
||||||
"""
|
|
||||||
Quick Test Script for SAR Download
|
|
||||||
==================================
|
|
||||||
|
|
||||||
This is a simplified test version to verify the setup works before running the full download.
|
|
||||||
|
|
||||||
Usage:
|
|
||||||
python test_sar_download.py
|
|
||||||
|
|
||||||
This will:
|
|
||||||
1. Test SentinelHub connection
|
|
||||||
2. Load field boundaries
|
|
||||||
3. Download 1 week of SAR data for testing
|
|
||||||
4. Save to test directory
|
|
||||||
|
|
||||||
"""
|
|
||||||
|
|
||||||
import os
|
|
||||||
import sys
|
|
||||||
from pathlib import Path
|
|
||||||
import logging
|
|
||||||
|
|
||||||
# Import our main downloader
|
|
||||||
from download_s1_aura import SARDownloader
|
|
||||||
|
|
||||||
# Configure logging
|
|
||||||
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')
|
|
||||||
logger = logging.getLogger(__name__)
|
|
||||||
|
|
||||||
def test_connection():
|
|
||||||
"""Test SentinelHub connection and credentials"""
|
|
||||||
try:
|
|
||||||
from sentinelhub import SHConfig
|
|
||||||
|
|
||||||
config = SHConfig()
|
|
||||||
config.sh_client_id = '1a72d811-4f0e-4447-8282-df09608cff44'
|
|
||||||
config.sh_client_secret = 'FcBlRL29i9ZmTzhmKTv1etSMFs5PxSos'
|
|
||||||
|
|
||||||
logger.info("OK - SentinelHub credentials configured")
|
|
||||||
logger.info(f"OK - Client ID: {config.sh_client_id[:8]}...")
|
|
||||||
|
|
||||||
return True
|
|
||||||
|
|
||||||
except Exception as e:
|
|
||||||
logger.error(f"✗ Connection test failed: {e}")
|
|
||||||
return False
|
|
||||||
|
|
||||||
def test_field_boundaries():
|
|
||||||
"""Test loading field boundaries"""
|
|
||||||
try:
|
|
||||||
import geopandas as gpd
|
|
||||||
|
|
||||||
# Try to load the pivot.geojson file
|
|
||||||
geojson_path = "pivot.geojson"
|
|
||||||
if not os.path.exists(geojson_path):
|
|
||||||
geojson_path = "../pivot.geojson"
|
|
||||||
|
|
||||||
if os.path.exists(geojson_path):
|
|
||||||
gdf = gpd.read_file(geojson_path)
|
|
||||||
bounds = gdf.total_bounds
|
|
||||||
|
|
||||||
logger.info(f"OK - Field boundaries loaded: {geojson_path}")
|
|
||||||
logger.info(f"OK - {len(gdf)} fields found")
|
|
||||||
logger.info(f"OK - Bounds: {bounds}")
|
|
||||||
|
|
||||||
return True, gdf
|
|
||||||
else:
|
|
||||||
logger.error("✗ Could not find pivot.geojson file")
|
|
||||||
return False, None
|
|
||||||
|
|
||||||
except Exception as e:
|
|
||||||
logger.error(f"✗ Field boundary test failed: {e}")
|
|
||||||
return False, None
|
|
||||||
|
|
||||||
def test_quick_download():
|
|
||||||
"""Download 1 week of SAR data for testing"""
|
|
||||||
try:
|
|
||||||
# Create test output directory
|
|
||||||
test_dir = Path("test_sar_output")
|
|
||||||
test_dir.mkdir(exist_ok=True)
|
|
||||||
|
|
||||||
# Initialize downloader with test directory
|
|
||||||
downloader = SARDownloader(output_dir=test_dir)
|
|
||||||
|
|
||||||
# Load field boundaries
|
|
||||||
fields = downloader.load_field_boundaries()
|
|
||||||
|
|
||||||
# Download just 1 week of data (current week)
|
|
||||||
from datetime import datetime, timedelta
|
|
||||||
end_date = datetime.now()
|
|
||||||
start_date = end_date - timedelta(days=7)
|
|
||||||
|
|
||||||
logger.info(f"Testing download for: {start_date.strftime('%Y-%m-%d')} to {end_date.strftime('%Y-%m-%d')}")
|
|
||||||
|
|
||||||
# Download 1 week
|
|
||||||
downloader.download_weekly_sar(start_date, end_date)
|
|
||||||
|
|
||||||
# Check if files were created
|
|
||||||
tif_files = list(test_dir.glob("*.tif"))
|
|
||||||
if tif_files:
|
|
||||||
logger.info(f"OK - Test download successful! {len(tif_files)} files created")
|
|
||||||
for f in tif_files:
|
|
||||||
logger.info(f" - {f.name}")
|
|
||||||
return True
|
|
||||||
else:
|
|
||||||
logger.warning("ERROR - No files downloaded - check SentinelHub quota/permissions")
|
|
||||||
return False
|
|
||||||
|
|
||||||
except Exception as e:
|
|
||||||
logger.error(f"✗ Test download failed: {e}")
|
|
||||||
return False
|
|
||||||
|
|
||||||
def main():
|
|
||||||
"""Run all tests"""
|
|
||||||
logger.info("=== SAR Download Test Suite ===\n")
|
|
||||||
|
|
||||||
# Test 1: Connection
|
|
||||||
logger.info("1. Testing SentinelHub connection...")
|
|
||||||
if not test_connection():
|
|
||||||
logger.error("Connection test failed - check credentials")
|
|
||||||
return False
|
|
||||||
|
|
||||||
# Test 2: Field boundaries
|
|
||||||
logger.info("\n2. Testing field boundaries...")
|
|
||||||
success, fields = test_field_boundaries()
|
|
||||||
if not success:
|
|
||||||
logger.error("Field boundary test failed")
|
|
||||||
return False
|
|
||||||
|
|
||||||
# Test 3: Quick download
|
|
||||||
logger.info("\n3. Testing SAR download (1 week)...")
|
|
||||||
if not test_quick_download():
|
|
||||||
logger.error("Download test failed")
|
|
||||||
return False
|
|
||||||
|
|
||||||
logger.info("\n=== All Tests Passed! ===")
|
|
||||||
logger.info("You can now run the full download script:")
|
|
||||||
logger.info("python download_s1_aura.py")
|
|
||||||
|
|
||||||
return True
|
|
||||||
|
|
||||||
if __name__ == "__main__":
|
|
||||||
success = main()
|
|
||||||
sys.exit(0 if success else 1)
|
|
||||||
|
|
@ -1,998 +0,0 @@
|
||||||
{
|
|
||||||
"cells": [
|
|
||||||
{
|
|
||||||
"cell_type": "markdown",
|
|
||||||
"id": "a42393ff",
|
|
||||||
"metadata": {},
|
|
||||||
"source": [
|
|
||||||
"## Section 1: Setup & GPU"
|
|
||||||
]
|
|
||||||
},
|
|
||||||
{
|
|
||||||
"cell_type": "code",
|
|
||||||
"execution_count": 5,
|
|
||||||
"id": "bdcfdce8",
|
|
||||||
"metadata": {},
|
|
||||||
"outputs": [
|
|
||||||
{
|
|
||||||
"name": "stdout",
|
|
||||||
"output_type": "stream",
|
|
||||||
"text": [
|
|
||||||
"\n",
|
|
||||||
"================================================================================\n",
|
|
||||||
"SCRIPT 12: HARVEST DETECTION MODEL BUILDING\n",
|
|
||||||
"================================================================================\n",
|
|
||||||
"Using device: cuda\n",
|
|
||||||
"GPU: NVIDIA GeForce RTX 4070 Laptop GPU\n",
|
|
||||||
"Memory: 8.59 GB\n"
|
|
||||||
]
|
|
||||||
}
|
|
||||||
],
|
|
||||||
"source": [
|
|
||||||
"import pandas as pd\n",
|
|
||||||
"import numpy as np\n",
|
|
||||||
"import matplotlib.pyplot as plt\n",
|
|
||||||
"import seaborn as sns\n",
|
|
||||||
"import torch\n",
|
|
||||||
"import torch.nn as nn\n",
|
|
||||||
"import torch.optim as optim\n",
|
|
||||||
"from torch.utils.data import DataLoader, Dataset\n",
|
|
||||||
"from sklearn.preprocessing import MinMaxScaler\n",
|
|
||||||
"from sklearn.metrics import classification_report, confusion_matrix, roc_auc_score, roc_curve\n",
|
|
||||||
"import warnings\n",
|
|
||||||
"warnings.filterwarnings('ignore')\n",
|
|
||||||
"import pickle\n",
|
|
||||||
"import json\n",
|
|
||||||
"import os\n",
|
|
||||||
"from scipy import stats\n",
|
|
||||||
"\n",
|
|
||||||
"# Set seeds\n",
|
|
||||||
"np.random.seed(42)\n",
|
|
||||||
"torch.manual_seed(42)\n",
|
|
||||||
"\n",
|
|
||||||
"# Check GPU\n",
|
|
||||||
"device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')\n",
|
|
||||||
"print(f\"\\n{'='*80}\")\n",
|
|
||||||
"print(\"SCRIPT 12: HARVEST DETECTION MODEL BUILDING\")\n",
|
|
||||||
"print(f\"{'='*80}\")\n",
|
|
||||||
"print(f\"Using device: {device}\")\n",
|
|
||||||
"if torch.cuda.is_available():\n",
|
|
||||||
" print(f\"GPU: {torch.cuda.get_device_name(0)}\")\n",
|
|
||||||
" print(f\"Memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.2f} GB\")"
|
|
||||||
]
|
|
||||||
},
|
|
||||||
{
|
|
||||||
"cell_type": "markdown",
|
|
||||||
"id": "bdf3f895",
|
|
||||||
"metadata": {},
|
|
||||||
"source": [
|
|
||||||
"## Section 2: Load Clean Data From Script 11"
|
|
||||||
]
|
|
||||||
},
|
|
||||||
{
|
|
||||||
"cell_type": "code",
|
|
||||||
"execution_count": 6,
|
|
||||||
"id": "3691dadd",
|
|
||||||
"metadata": {},
|
|
||||||
"outputs": [
|
|
||||||
{
|
|
||||||
"name": "stdout",
|
|
||||||
"output_type": "stream",
|
|
||||||
"text": [
|
|
||||||
"\n",
|
|
||||||
"================================================================================\n",
|
|
||||||
"LOADING CLEANED DATA FROM SCRIPT 11\n",
|
|
||||||
"================================================================================\n",
|
|
||||||
"\n",
|
|
||||||
"Loading:\n",
|
|
||||||
" lstm_train_data_cleaned.csv\n",
|
|
||||||
" lstm_test_data_cleaned.csv\n",
|
|
||||||
"\n",
|
|
||||||
"Loaded:\n",
|
|
||||||
" Train: (67998, 19)\n",
|
|
||||||
" Test: (4672, 19)\n",
|
|
||||||
"\n",
|
|
||||||
"CI column: 'fitdata_ma7'\n",
|
|
||||||
"Columns available: ['date', 'fitdata', 'field', 'sub_field', 'value', 'doy', 'model', 'season', 'subfield', 'ci_per_day', 'cumulative_ci', 'client', 'ci', 'fitdata_ma7', 'fitdata_ma14', 'model_season_id', 'is_spike', 'is_imminent', 'is_detected']\n"
|
|
||||||
]
|
|
||||||
}
|
|
||||||
],
|
|
||||||
"source": [
|
|
||||||
"print(f\"\\n{'='*80}\")\n",
|
|
||||||
"print(\"LOADING CLEANED DATA FROM SCRIPT 11\")\n",
|
|
||||||
"print(f\"{'='*80}\")\n",
|
|
||||||
"\n",
|
|
||||||
"train_path = 'lstm_train_data_cleaned.csv'\n",
|
|
||||||
"test_path = 'lstm_test_data_cleaned.csv'\n",
|
|
||||||
"\n",
|
|
||||||
"print(f\"\\nLoading:\")\n",
|
|
||||||
"print(f\" {train_path}\")\n",
|
|
||||||
"print(f\" {test_path}\")\n",
|
|
||||||
"\n",
|
|
||||||
"df_train = pd.read_csv(train_path, low_memory=False)\n",
|
|
||||||
"df_test = pd.read_csv(test_path, low_memory=False)\n",
|
|
||||||
"\n",
|
|
||||||
"print(f\"\\nLoaded:\")\n",
|
|
||||||
"print(f\" Train: {df_train.shape}\")\n",
|
|
||||||
"print(f\" Test: {df_test.shape}\")\n",
|
|
||||||
"\n",
|
|
||||||
"# Convert date\n",
|
|
||||||
"df_train['date'] = pd.to_datetime(df_train['date'])\n",
|
|
||||||
"df_test['date'] = pd.to_datetime(df_test['date'])\n",
|
|
||||||
"\n",
|
|
||||||
"# Detect CI column\n",
|
|
||||||
"if 'fitdata_ma7' in df_train.columns:\n",
|
|
||||||
" ci_column = 'fitdata_ma7'\n",
|
|
||||||
"elif 'fitdata' in df_train.columns:\n",
|
|
||||||
" ci_column = 'fitdata'\n",
|
|
||||||
"else:\n",
|
|
||||||
" ci_column = 'value'\n",
|
|
||||||
"\n",
|
|
||||||
"print(f\"\\nCI column: '{ci_column}'\")\n",
|
|
||||||
"print(f\"Columns available: {list(df_train.columns)}\")"
|
|
||||||
]
|
|
||||||
},
|
|
||||||
{
|
|
||||||
"cell_type": "markdown",
|
|
||||||
"id": "e07df306",
|
|
||||||
"metadata": {},
|
|
||||||
"source": [
|
|
||||||
"## Section 3: Configuration"
|
|
||||||
]
|
|
||||||
},
|
|
||||||
{
|
|
||||||
"cell_type": "code",
|
|
||||||
"execution_count": 7,
|
|
||||||
"id": "7487a1d4",
|
|
||||||
"metadata": {},
|
|
||||||
"outputs": [
|
|
||||||
{
|
|
||||||
"name": "stdout",
|
|
||||||
"output_type": "stream",
|
|
||||||
"text": [
|
|
||||||
"\n",
|
|
||||||
"================================================================================\n",
|
|
||||||
"CONFIGURATION\n",
|
|
||||||
"================================================================================\n",
|
|
||||||
"\n",
|
|
||||||
"Client: ALL CLIENTS\n",
|
|
||||||
"Train/Val/Test split: (0.7, 0.15, 0.15)\n",
|
|
||||||
"\n",
|
|
||||||
"Harvest windows:\n",
|
|
||||||
" Imminent: 3-14d before harvest\n",
|
|
||||||
" Detected: 1-21d after harvest\n",
|
|
||||||
"\n",
|
|
||||||
"Model:\n",
|
|
||||||
" Hidden: 64, Layers: 1, Dropout: 0.5\n",
|
|
||||||
" Batch: 4, LR: 0.001, Epochs: 150\n"
|
|
||||||
]
|
|
||||||
}
|
|
||||||
],
|
|
||||||
"source": [
|
|
||||||
"# Configuration - EDIT HERE for quick iteration\n",
|
|
||||||
"CLIENT_FILTER = None # None = all clients, or 'esa', 'chemba', etc.\n",
|
|
||||||
"TRAIN_VAL_TEST_SPLIT = (0.7, 0.15, 0.15) # Train, Val, Test\n",
|
|
||||||
"\n",
|
|
||||||
"# Harvest labeling windows (days)\n",
|
|
||||||
"IMMINENT_START = 14 # Start labeling 14 days before harvest\n",
|
|
||||||
"IMMINENT_END = 3 # Stop labeling 3 days before\n",
|
|
||||||
"DETECTED_START = 1 # Start labeling 1 day after harvest\n",
|
|
||||||
"DETECTED_END = 21 # Stop labeling 21 days after\n",
|
|
||||||
"\n",
|
|
||||||
"# Model hyperparameters\n",
|
|
||||||
"HIDDEN_SIZE = 64\n",
|
|
||||||
"NUM_LAYERS = 1\n",
|
|
||||||
"DROPOUT = 0.5\n",
|
|
||||||
"BATCH_SIZE = 4\n",
|
|
||||||
"LEARNING_RATE = 0.001\n",
|
|
||||||
"NUM_EPOCHS = 150\n",
|
|
||||||
"EARLY_STOPPING_PATIENCE = 20\n",
|
|
||||||
"\n",
|
|
||||||
"print(f\"\\n{'='*80}\")\n",
|
|
||||||
"print(\"CONFIGURATION\")\n",
|
|
||||||
"print(f\"{'='*80}\")\n",
|
|
||||||
"print(f\"\\nClient: {CLIENT_FILTER if CLIENT_FILTER else 'ALL CLIENTS'}\")\n",
|
|
||||||
"print(f\"Train/Val/Test split: {TRAIN_VAL_TEST_SPLIT}\")\n",
|
|
||||||
"print(f\"\\nHarvest windows:\")\n",
|
|
||||||
"print(f\" Imminent: {IMMINENT_END}-{IMMINENT_START}d before harvest\")\n",
|
|
||||||
"print(f\" Detected: {DETECTED_START}-{DETECTED_END}d after harvest\")\n",
|
|
||||||
"print(f\"\\nModel:\")\n",
|
|
||||||
"print(f\" Hidden: {HIDDEN_SIZE}, Layers: {NUM_LAYERS}, Dropout: {DROPOUT}\")\n",
|
|
||||||
"print(f\" Batch: {BATCH_SIZE}, LR: {LEARNING_RATE}, Epochs: {NUM_EPOCHS}\")"
|
|
||||||
]
|
|
||||||
},
|
|
||||||
{
|
|
||||||
"cell_type": "markdown",
|
|
||||||
"id": "08aa3ed8",
|
|
||||||
"metadata": {},
|
|
||||||
"source": [
|
|
||||||
"## Section 4: Load Pre-Engineered Features from Script 11\n"
|
|
||||||
]
|
|
||||||
},
|
|
||||||
{
|
|
||||||
"cell_type": "code",
|
|
||||||
"execution_count": 8,
|
|
||||||
"id": "f9f789aa",
|
|
||||||
"metadata": {},
|
|
||||||
"outputs": [
|
|
||||||
{
|
|
||||||
"name": "stdout",
|
|
||||||
"output_type": "stream",
|
|
||||||
"text": [
|
|
||||||
"\n",
|
|
||||||
"================================================================================\n",
|
|
||||||
"LOADING PRE-ENGINEERED FEATURES FROM SCRIPT 11\n",
|
|
||||||
"================================================================================\n",
|
|
||||||
"\n",
|
|
||||||
"Loading pickle files...\n",
|
|
||||||
" ✓ train_sequences.pkl: 326 sequences\n",
|
|
||||||
" ✓ test_sequences.pkl: 18 sequences\n",
|
|
||||||
" ✓ X_train_norm.pkl: 326 normalized feature arrays\n",
|
|
||||||
" ✓ X_test_norm.pkl: 18 normalized feature arrays\n",
|
|
||||||
" ✓ feature_scalers.pkl: 7 scalers\n",
|
|
||||||
" ✓ feature_engineering_config.json loaded\n",
|
|
||||||
"\n",
|
|
||||||
"✓ Features ready:\n",
|
|
||||||
" Input size: 7D\n",
|
|
||||||
" Feature names: ['CI', '7d Velocity', '7d Acceleration', '14d MA', '14d Velocity', '7d Min', 'Is_Spike']\n",
|
|
||||||
" Train sequences: 326\n",
|
|
||||||
" Test sequences: 18\n",
|
|
||||||
" Imminent window: [14, 3] days\n",
|
|
||||||
" Detected window: [1, 40] days\n",
|
|
||||||
"\n",
|
|
||||||
"Feature verification:\n",
|
|
||||||
" X_train_norm[0] shape: (183, 7)\n",
|
|
||||||
" X_test_norm[0] shape: (161, 7)\n",
|
|
||||||
" Train sequence keys: ['field', 'model', 'ci', 'is_spike', 'is_imminent', 'is_detected', 'dates', 'length']\n"
|
|
||||||
]
|
|
||||||
}
|
|
||||||
],
|
|
||||||
"source": [
|
|
||||||
"print(f\"\\n{'='*80}\")\n",
|
|
||||||
"print(\"LOADING PRE-ENGINEERED FEATURES FROM SCRIPT 11\")\n",
|
|
||||||
"print(f\"{'='*80}\")\n",
|
|
||||||
"\n",
|
|
||||||
"# Load pickles created by Script 11\n",
|
|
||||||
"print(f\"\\nLoading pickle files...\")\n",
|
|
||||||
"\n",
|
|
||||||
"train_sequences = pickle.load(open('train_sequences.pkl', 'rb'))\n",
|
|
||||||
"test_sequences = pickle.load(open('test_sequences.pkl', 'rb'))\n",
|
|
||||||
"print(f\" ✓ train_sequences.pkl: {len(train_sequences)} sequences\")\n",
|
|
||||||
"print(f\" ✓ test_sequences.pkl: {len(test_sequences)} sequences\")\n",
|
|
||||||
"\n",
|
|
||||||
"X_train_norm = pickle.load(open('X_train_norm.pkl', 'rb'))\n",
|
|
||||||
"X_test_norm = pickle.load(open('X_test_norm.pkl', 'rb'))\n",
|
|
||||||
"print(f\" ✓ X_train_norm.pkl: {len(X_train_norm)} normalized feature arrays\")\n",
|
|
||||||
"print(f\" ✓ X_test_norm.pkl: {len(X_test_norm)} normalized feature arrays\")\n",
|
|
||||||
"\n",
|
|
||||||
"feature_scalers = pickle.load(open('feature_scalers.pkl', 'rb'))\n",
|
|
||||||
"print(f\" ✓ feature_scalers.pkl: {len(feature_scalers)} scalers\")\n",
|
|
||||||
"\n",
|
|
||||||
"feature_config = json.load(open('feature_engineering_config.json', 'r'))\n",
|
|
||||||
"print(f\" ✓ feature_engineering_config.json loaded\")\n",
|
|
||||||
"\n",
|
|
||||||
"print(f\"\\n✓ Features ready:\")\n",
|
|
||||||
"print(f\" Input size: {feature_config['input_size']}D\")\n",
|
|
||||||
"print(f\" Feature names: {feature_config['feature_names']}\")\n",
|
|
||||||
"print(f\" Train sequences: {len(train_sequences)}\")\n",
|
|
||||||
"print(f\" Test sequences: {len(test_sequences)}\")\n",
|
|
||||||
"print(f\" Imminent window: {feature_config['imminent_window']} days\")\n",
|
|
||||||
"print(f\" Detected window: {feature_config['detected_window']} days\")\n",
|
|
||||||
"\n",
|
|
||||||
"# Verify feature dimensions\n",
|
|
||||||
"print(f\"\\nFeature verification:\")\n",
|
|
||||||
"print(f\" X_train_norm[0] shape: {X_train_norm[0].shape}\")\n",
|
|
||||||
"print(f\" X_test_norm[0] shape: {X_test_norm[0].shape}\")\n",
|
|
||||||
"print(f\" Train sequence keys: {list(train_sequences[0].keys())}\")\n"
|
|
||||||
]
|
|
||||||
},
|
|
||||||
{
|
|
||||||
"cell_type": "code",
|
|
||||||
"execution_count": 9,
|
|
||||||
"id": "377687c5",
|
|
||||||
"metadata": {},
|
|
||||||
"outputs": [
|
|
||||||
{
|
|
||||||
"name": "stdout",
|
|
||||||
"output_type": "stream",
|
|
||||||
"text": [
|
|
||||||
"\n",
|
|
||||||
"================================================================================\n",
|
|
||||||
"LOSS FUNCTION & OPTIMIZATION\n",
|
|
||||||
"================================================================================\n",
|
|
||||||
"\n",
|
|
||||||
"Class weights (capped at 8.0):\n",
|
|
||||||
" Imminent: 8.00x (raw: 17.96x)\n",
|
|
||||||
" Detected: 1.00x (raw: 1.00x)\n"
|
|
||||||
]
|
|
||||||
},
|
|
||||||
{
|
|
||||||
"ename": "NameError",
|
|
||||||
"evalue": "name 'model' is not defined",
|
|
||||||
"output_type": "error",
|
|
||||||
"traceback": [
|
|
||||||
"\u001b[31m---------------------------------------------------------------------------\u001b[39m",
|
|
||||||
"\u001b[31mNameError\u001b[39m Traceback (most recent call last)",
|
|
||||||
"\u001b[36mCell\u001b[39m\u001b[36m \u001b[39m\u001b[32mIn[9]\u001b[39m\u001b[32m, line 49\u001b[39m\n\u001b[32m 46\u001b[39m criterion_imminent = FocalBCELoss(weight_pos=weight_imminent, gamma=\u001b[32m2.0\u001b[39m)\n\u001b[32m 47\u001b[39m criterion_detected = FocalBCELoss(weight_pos=weight_detected, gamma=\u001b[32m2.0\u001b[39m)\n\u001b[32m---> \u001b[39m\u001b[32m49\u001b[39m optimizer = optim.Adam(\u001b[43mmodel\u001b[49m.parameters(), lr=LEARNING_RATE)\n\u001b[32m 51\u001b[39m \u001b[38;5;28mprint\u001b[39m(\u001b[33mf\u001b[39m\u001b[33m\"\u001b[39m\u001b[38;5;130;01m\\n\u001b[39;00m\u001b[33m\"\u001b[39m + \u001b[33m\"\u001b[39m\u001b[33m=\u001b[39m\u001b[33m\"\u001b[39m*\u001b[32m80\u001b[39m)\n\u001b[32m 52\u001b[39m \u001b[38;5;28mprint\u001b[39m(\u001b[33m\"\u001b[39m\u001b[33mFOCAL LOSS (Like Script 5)\u001b[39m\u001b[33m\"\u001b[39m)\n",
|
|
||||||
"\u001b[31mNameError\u001b[39m: name 'model' is not defined"
|
|
||||||
]
|
|
||||||
}
|
|
||||||
],
|
|
||||||
"source": [
|
|
||||||
"print(f\"\\n{'='*80}\")\n",
|
|
||||||
"print(\"LOSS FUNCTION & OPTIMIZATION\")\n",
|
|
||||||
"print(f\"{'='*80}\")\n",
|
|
||||||
"\n",
|
|
||||||
"# Calculate class weights from all training data\n",
|
|
||||||
"y_train_imm_all = np.concatenate([s['is_imminent'] for s in train_sequences])\n",
|
|
||||||
"y_train_det_all = np.concatenate([s['is_detected'] for s in train_sequences])\n",
|
|
||||||
"\n",
|
|
||||||
"weight_imminent_raw = (1 - y_train_imm_all.mean()) / y_train_imm_all.mean() if y_train_imm_all.mean() > 0 else 1.0\n",
|
|
||||||
"weight_detected_raw = (1 - y_train_det_all.mean()) / y_train_det_all.mean() if y_train_det_all.mean() > 0 else 1.0\n",
|
|
||||||
"\n",
|
|
||||||
"# Cap weights at 8.0\n",
|
|
||||||
"weight_imminent = min(weight_imminent_raw, 8.0)\n",
|
|
||||||
"weight_detected = min(weight_detected_raw, 8.0)\n",
|
|
||||||
"\n",
|
|
||||||
"print(f\"\\nClass weights (capped at 8.0):\")\n",
|
|
||||||
"print(f\" Imminent: {weight_imminent:.2f}x (raw: {weight_imminent_raw:.2f}x)\")\n",
|
|
||||||
"print(f\" Detected: {weight_detected:.2f}x (raw: {weight_detected_raw:.2f}x)\")\n",
|
|
||||||
"\n",
|
|
||||||
"# Focal Loss - like Script 5\n",
|
|
||||||
"class FocalBCELoss(nn.Module):\n",
|
|
||||||
" \"\"\"Focal loss for handling imbalanced binary classification.\"\"\"\n",
|
|
||||||
" def __init__(self, weight_pos=1.0, gamma=2.0):\n",
|
|
||||||
" super().__init__()\n",
|
|
||||||
" self.weight_pos = weight_pos\n",
|
|
||||||
" self.gamma = gamma\n",
|
|
||||||
" \n",
|
|
||||||
" def forward(self, pred, target, mask=None):\n",
|
|
||||||
" \"\"\"\n",
|
|
||||||
" Args:\n",
|
|
||||||
" pred: (batch, seq_len) - predicted probabilities\n",
|
|
||||||
" target: (batch, seq_len) - target labels\n",
|
|
||||||
" mask: (batch, seq_len) - 1 for valid, 0 for padded\n",
|
|
||||||
" \"\"\"\n",
|
|
||||||
" bce_loss = -(target * torch.log(pred + 1e-7) + (1 - target) * torch.log(1 - pred + 1e-7))\n",
|
|
||||||
" focal_weight = target * torch.pow(1 - pred, self.gamma) + (1 - target) * torch.pow(pred, self.gamma)\n",
|
|
||||||
" loss = self.weight_pos * target * focal_weight * torch.log(pred + 1e-7) + \\\n",
|
|
||||||
" (1 - target) * focal_weight * torch.log(1 - pred + 1e-7)\n",
|
|
||||||
" loss = -loss\n",
|
|
||||||
" \n",
|
|
||||||
" if mask is not None:\n",
|
|
||||||
" loss = loss * mask\n",
|
|
||||||
" \n",
|
|
||||||
" return loss.mean()\n",
|
|
||||||
"\n",
|
|
||||||
"criterion_imminent = FocalBCELoss(weight_pos=weight_imminent, gamma=2.0)\n",
|
|
||||||
"criterion_detected = FocalBCELoss(weight_pos=weight_detected, gamma=2.0)\n",
|
|
||||||
"\n",
|
|
||||||
"optimizer = optim.Adam(model.parameters(), lr=LEARNING_RATE)\n",
|
|
||||||
"\n",
|
|
||||||
"print(f\"\\n\" + \"=\"*80)\n",
|
|
||||||
"print(\"FOCAL LOSS (Like Script 5)\")\n",
|
|
||||||
"print(\"=\"*80)\n",
|
|
||||||
"print(f\" Gamma: 2.0 (focus on hard examples)\")\n",
|
|
||||||
"print(f\" Per-timestep masking: enabled\")\n",
|
|
||||||
"print(f\" Optimizer: Adam (lr={LEARNING_RATE})\")\n",
|
|
||||||
"print(f\" Epochs: {NUM_EPOCHS}, Patience: {EARLY_STOPPING_PATIENCE}\")\n"
|
|
||||||
]
|
|
||||||
},
|
|
||||||
{
|
|
||||||
"cell_type": "markdown",
|
|
||||||
"id": "e50530c9",
|
|
||||||
"metadata": {},
|
|
||||||
"source": [
|
|
||||||
"## Section 5: Extract Labels from Sequences\n"
|
|
||||||
]
|
|
||||||
},
|
|
||||||
{
|
|
||||||
"cell_type": "code",
|
|
||||||
"execution_count": null,
|
|
||||||
"id": "fab422c4",
|
|
||||||
"metadata": {},
|
|
||||||
"outputs": [],
|
|
||||||
"source": [
|
|
||||||
"print(f\"\\n{'='*80}\")\n",
|
|
||||||
"print(\"EXTRACTING LABELS FROM SEQUENCES\")\n",
|
|
||||||
"print(f\"{'='*80}\")\n",
|
|
||||||
"\n",
|
|
||||||
"# Extract harvest labels for training\n",
|
|
||||||
"# Note: Labels come from Script 11's is_imminent/is_detected columns\n",
|
|
||||||
"train_labels_imm = []\n",
|
|
||||||
"train_labels_det = []\n",
|
|
||||||
"test_labels_imm = []\n",
|
|
||||||
"test_labels_det = []\n",
|
|
||||||
"\n",
|
|
||||||
"for seq in train_sequences:\n",
|
|
||||||
" # is_imminent and is_detected are in the sequence\n",
|
|
||||||
" # We'll extract them during batch loading\n",
|
|
||||||
" pass\n",
|
|
||||||
"\n",
|
|
||||||
"for seq in test_sequences:\n",
|
|
||||||
" pass\n",
|
|
||||||
"\n",
|
|
||||||
"print(f\"\\n✓ Labels ready:\")\n",
|
|
||||||
"print(f\" Imminent: Days 14-3 before harvest (early warning)\")\n",
|
|
||||||
"print(f\" Detected: Days 1-40 after harvest (confirmation)\")\n",
|
|
||||||
"print(f\"\\n These were set in Script 11 and will be loaded during training\")\n",
|
|
||||||
"\n",
|
|
||||||
"# Display sample sequence stats\n",
|
|
||||||
"print(f\"\\nSample sequences:\")\n",
|
|
||||||
"sample_seq = train_sequences[0]\n",
|
|
||||||
"print(f\" Field: {sample_seq['field']}\")\n",
|
|
||||||
"print(f\" Season: {sample_seq['model']}\")\n",
|
|
||||||
"print(f\" Length: {sample_seq['length']} days\")\n",
|
|
||||||
"print(f\" Date range: {sample_seq['dates'][0].date()} to {sample_seq['dates'][-1].date()}\")\n"
|
|
||||||
]
|
|
||||||
},
|
|
||||||
{
|
|
||||||
"cell_type": "markdown",
|
|
||||||
"id": "82588f54",
|
|
||||||
"metadata": {},
|
|
||||||
"source": [
|
|
||||||
"## Section 6: PyTorch DataLoader (Features Already Normalized)\n"
|
|
||||||
]
|
|
||||||
},
|
|
||||||
{
|
|
||||||
"cell_type": "code",
|
|
||||||
"execution_count": null,
|
|
||||||
"id": "deb3a62b",
|
|
||||||
"metadata": {},
|
|
||||||
"outputs": [],
|
|
||||||
"source": [
|
|
||||||
"print(f\"\\n{'='*80}\")\n",
|
|
||||||
"print(\"PREPARING DATALOADERS (Features Pre-Normalized in Script 11)\")\n",
|
|
||||||
"print(f\"{'='*80}\")\n",
|
|
||||||
"\n",
|
|
||||||
"# Features are already normalized in Script 11\n",
|
|
||||||
"# X_train_norm and X_test_norm are ready to use\n",
|
|
||||||
"\n",
|
|
||||||
"print(f\"\\nFeature statistics (already normalized [0,1]):\")\n",
|
|
||||||
"X_all = X_train_norm + X_test_norm\n",
|
|
||||||
"for feat_idx, name in enumerate(feature_config['feature_names']):\n",
|
|
||||||
" feat_data = np.concatenate([f[:, feat_idx] for f in X_all])\n",
|
|
||||||
" print(f\" {name:20s}: [{feat_data.min():.4f}, {feat_data.max():.4f}]\")\n"
|
|
||||||
]
|
|
||||||
},
|
|
||||||
{
|
|
||||||
"cell_type": "markdown",
|
|
||||||
"id": "2e8e919a",
|
|
||||||
"metadata": {},
|
|
||||||
"source": [
|
|
||||||
"## Section 7: PyTorch DataLoader"
|
|
||||||
]
|
|
||||||
},
|
|
||||||
{
|
|
||||||
"cell_type": "code",
|
|
||||||
"execution_count": null,
|
|
||||||
"id": "de08003a",
|
|
||||||
"metadata": {},
|
|
||||||
"outputs": [],
|
|
||||||
"source": [
|
|
||||||
"print(f\"\\n{'='*80}\")\n",
|
|
||||||
"print(\"PYTORCH DATASET & DATALOADER\")\n",
|
|
||||||
"print(f\"{'='*80}\")\n",
|
|
||||||
"\n",
|
|
||||||
"class HarvestDataset(torch.utils.data.Dataset):\n",
|
|
||||||
" def __init__(self, X_sequences, sequences):\n",
|
|
||||||
" self.X = X_sequences\n",
|
|
||||||
" self.sequences = sequences\n",
|
|
||||||
" \n",
|
|
||||||
" def __len__(self):\n",
|
|
||||||
" return len(self.X)\n",
|
|
||||||
" \n",
|
|
||||||
" def __getitem__(self, idx):\n",
|
|
||||||
" X = self.X[idx]\n",
|
|
||||||
" seq = self.sequences[idx]\n",
|
|
||||||
" \n",
|
|
||||||
" if 'is_imminent' in seq:\n",
|
|
||||||
" y_imm = seq['is_imminent']\n",
|
|
||||||
" else:\n",
|
|
||||||
" y_imm = np.zeros(len(seq['ci']))\n",
|
|
||||||
" \n",
|
|
||||||
" if 'is_detected' in seq:\n",
|
|
||||||
" y_det = seq['is_detected']\n",
|
|
||||||
" else:\n",
|
|
||||||
" y_det = np.zeros(len(seq['ci']))\n",
|
|
||||||
" \n",
|
|
||||||
" return X, y_imm, y_det\n",
|
|
||||||
"\n",
|
|
||||||
"def collate_variable_length(batch):\n",
|
|
||||||
" \"\"\"Pad sequences to longest in batch.\"\"\"\n",
|
|
||||||
" X_list, y_imm_list, y_det_list = zip(*batch)\n",
|
|
||||||
" \n",
|
|
||||||
" max_len = max(len(x) for x in X_list)\n",
|
|
||||||
" \n",
|
|
||||||
" X_padded = []\n",
|
|
||||||
" y_imm_padded = []\n",
|
|
||||||
" y_det_padded = []\n",
|
|
||||||
" seq_lengths = []\n",
|
|
||||||
" \n",
|
|
||||||
" for x, y_imm, y_det in zip(X_list, y_imm_list, y_det_list):\n",
|
|
||||||
" seq_len = len(x)\n",
|
|
||||||
" seq_lengths.append(seq_len)\n",
|
|
||||||
" \n",
|
|
||||||
" x_padded = np.zeros((max_len, 7)) # 7 features (with spike)\n",
|
|
||||||
" x_padded[:seq_len] = x\n",
|
|
||||||
" X_padded.append(x_padded)\n",
|
|
||||||
" \n",
|
|
||||||
" y_imm_padded_arr = np.zeros(max_len)\n",
|
|
||||||
" y_imm_padded_arr[:seq_len] = y_imm\n",
|
|
||||||
" y_imm_padded.append(y_imm_padded_arr)\n",
|
|
||||||
" \n",
|
|
||||||
" y_det_padded_arr = np.zeros(max_len)\n",
|
|
||||||
" y_det_padded_arr[:seq_len] = y_det\n",
|
|
||||||
" y_det_padded.append(y_det_padded_arr)\n",
|
|
||||||
" \n",
|
|
||||||
" X_batch = torch.FloatTensor(np.array(X_padded))\n",
|
|
||||||
" y_imm_batch = torch.FloatTensor(np.array(y_imm_padded))\n",
|
|
||||||
" y_det_batch = torch.FloatTensor(np.array(y_det_padded))\n",
|
|
||||||
" seq_lengths = torch.LongTensor(seq_lengths)\n",
|
|
||||||
" \n",
|
|
||||||
" return X_batch, y_imm_batch, y_det_batch, seq_lengths\n",
|
|
||||||
"\n",
|
|
||||||
"# Create dataloaders\n",
|
|
||||||
"train_dataset = HarvestDataset(X_train_norm, train_sequences)\n",
|
|
||||||
"test_dataset = HarvestDataset(X_test_norm, test_sequences)\n",
|
|
||||||
"\n",
|
|
||||||
"train_loader = DataLoader(train_dataset, batch_size=BATCH_SIZE, shuffle=True, collate_fn=collate_variable_length)\n",
|
|
||||||
"test_loader = DataLoader(test_dataset, batch_size=BATCH_SIZE, shuffle=False, collate_fn=collate_variable_length)\n",
|
|
||||||
"\n",
|
|
||||||
"print(f\"\\n✓ DataLoaders created:\")\n",
|
|
||||||
"print(f\" Train: {len(train_loader)} batches ({len(train_dataset)} sequences)\")\n",
|
|
||||||
"print(f\" Test: {len(test_loader)} batches ({len(test_dataset)} sequences)\")\n",
|
|
||||||
"print(f\" Batch size: {BATCH_SIZE}\")\n",
|
|
||||||
"print(f\" Input shape: (max_seq_len, 7) - pre-engineered 7D features (WITH SPIKE)\")\n",
|
|
||||||
"print(f\" Dynamic padding to longest sequence in each batch\")\n"
|
|
||||||
]
|
|
||||||
},
|
|
||||||
{
|
|
||||||
"cell_type": "markdown",
|
|
||||||
"id": "51964919",
|
|
||||||
"metadata": {},
|
|
||||||
"source": [
|
|
||||||
"## Section 7: Build & Train LSTM Model\n"
|
|
||||||
]
|
|
||||||
},
|
|
||||||
{
|
|
||||||
"cell_type": "code",
|
|
||||||
"execution_count": null,
|
|
||||||
"id": "ea0653f9",
|
|
||||||
"metadata": {},
|
|
||||||
"outputs": [],
|
|
||||||
"source": [
|
|
||||||
"print(f\"\\n{'='*80}\")\n",
|
|
||||||
"print(\"BUILDING LSTM MODEL\")\n",
|
|
||||||
"print(f\"{'='*80}\")\n",
|
|
||||||
"\n",
|
|
||||||
"class HarvestLSTM(nn.Module):\n",
|
|
||||||
" \"\"\"Dual-output LSTM for harvest prediction.\"\"\"\n",
|
|
||||||
" def __init__(self, input_size=7, hidden_size=64, num_layers=1, dropout=0.5):\n",
|
|
||||||
" super().__init__()\n",
|
|
||||||
" \n",
|
|
||||||
" self.lstm = nn.LSTM(\n",
|
|
||||||
" input_size=input_size,\n",
|
|
||||||
" hidden_size=hidden_size,\n",
|
|
||||||
" num_layers=num_layers,\n",
|
|
||||||
" dropout=dropout if num_layers > 1 else 0,\n",
|
|
||||||
" bidirectional=False,\n",
|
|
||||||
" batch_first=True\n",
|
|
||||||
" )\n",
|
|
||||||
" \n",
|
|
||||||
" # Output heads for dual prediction\n",
|
|
||||||
" self.imminent_head = nn.Sequential(\n",
|
|
||||||
" nn.Linear(hidden_size, 16),\n",
|
|
||||||
" nn.ReLU(),\n",
|
|
||||||
" nn.Dropout(dropout),\n",
|
|
||||||
" nn.Linear(16, 1),\n",
|
|
||||||
" nn.Sigmoid()\n",
|
|
||||||
" )\n",
|
|
||||||
" \n",
|
|
||||||
" self.detected_head = nn.Sequential(\n",
|
|
||||||
" nn.Linear(hidden_size, 16),\n",
|
|
||||||
" nn.ReLU(),\n",
|
|
||||||
" nn.Dropout(dropout),\n",
|
|
||||||
" nn.Linear(16, 1),\n",
|
|
||||||
" nn.Sigmoid()\n",
|
|
||||||
" )\n",
|
|
||||||
" \n",
|
|
||||||
" def forward(self, x):\n",
|
|
||||||
" lstm_out, _ = self.lstm(x)\n",
|
|
||||||
" \n",
|
|
||||||
" batch_size, seq_len, hidden_size = lstm_out.shape\n",
|
|
||||||
" lstm_flat = lstm_out.reshape(-1, hidden_size)\n",
|
|
||||||
" \n",
|
|
||||||
" imminent_flat = self.imminent_head(lstm_flat).reshape(batch_size, seq_len)\n",
|
|
||||||
" detected_flat = self.detected_head(lstm_flat).reshape(batch_size, seq_len)\n",
|
|
||||||
" \n",
|
|
||||||
" return imminent_flat, detected_flat\n",
|
|
||||||
"\n",
|
|
||||||
"model = HarvestLSTM(input_size=7, hidden_size=HIDDEN_SIZE, num_layers=NUM_LAYERS, dropout=DROPOUT)\n",
|
|
||||||
"model = model.to(device)\n",
|
|
||||||
"\n",
|
|
||||||
"print(f\"\\nModel architecture:\")\n",
|
|
||||||
"print(model)\n",
|
|
||||||
"\n",
|
|
||||||
"total_params = sum(p.numel() for p in model.parameters())\n",
|
|
||||||
"trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad)\n",
|
|
||||||
"print(f\"\\nParameters: {trainable_params:,} / {total_params:,}\")\n",
|
|
||||||
"\n",
|
|
||||||
"optimizer = optim.Adam(model.parameters(), lr=LEARNING_RATE)\n",
|
|
||||||
"print(f\"\\nOptimizer: Adam (lr={LEARNING_RATE})\")\n",
|
|
||||||
"print(f\"Input: 7D features (CI, vel7d, accel7d, ma14d, vel14d, min7d, is_spike) - SAME AS SCRIPT 5\")\n"
|
|
||||||
]
|
|
||||||
},
|
|
||||||
{
|
|
||||||
"cell_type": "markdown",
|
|
||||||
"id": "1862848f",
|
|
||||||
"metadata": {},
|
|
||||||
"source": [
|
|
||||||
"## Section 9: Train Model"
|
|
||||||
]
|
|
||||||
},
|
|
||||||
{
|
|
||||||
"cell_type": "code",
|
|
||||||
"execution_count": null,
|
|
||||||
"id": "7cfc98dd",
|
|
||||||
"metadata": {},
|
|
||||||
"outputs": [],
|
|
||||||
"source": [
|
|
||||||
"print(\"\\n\" + \"=\"*80)\n",
|
|
||||||
"print(\"TRAINING\")\n",
|
|
||||||
"print(\"=\"*80)\n",
|
|
||||||
"\n",
|
|
||||||
"# Class weights from training data\n",
|
|
||||||
"y_train_imm_all = np.concatenate([s['is_imminent'] for s in train_sequences])\n",
|
|
||||||
"y_train_det_all = np.concatenate([s['is_detected'] for s in train_sequences])\n",
|
|
||||||
"\n",
|
|
||||||
"weight_imm = min((1 - y_train_imm_all.mean()) / y_train_imm_all.mean() if y_train_imm_all.mean() > 0 else 1.0, 8.0)\n",
|
|
||||||
"weight_det = min((1 - y_train_det_all.mean()) / y_train_det_all.mean() if y_train_det_all.mean() > 0 else 1.0, 8.0)\n",
|
|
||||||
"\n",
|
|
||||||
"print(f\"\\nClass weights:\")\n",
|
|
||||||
"print(f\" Imminent: {weight_imm:.1f}x\")\n",
|
|
||||||
"print(f\" Detected: {weight_det:.1f}x\")\n",
|
|
||||||
"\n",
|
|
||||||
"best_test_loss = float('inf')\n",
|
|
||||||
"patience_counter = 0\n",
|
|
||||||
"train_losses = []\n",
|
|
||||||
"test_losses = []\n",
|
|
||||||
"\n",
|
|
||||||
"print(f\"\\nTraining for {NUM_EPOCHS} epochs (patience={EARLY_STOPPING_PATIENCE})...\\n\")\n",
|
|
||||||
"\n",
|
|
||||||
"for epoch in range(NUM_EPOCHS):\n",
|
|
||||||
" # TRAINING\n",
|
|
||||||
" model.train()\n",
|
|
||||||
" train_loss = 0.0\n",
|
|
||||||
" \n",
|
|
||||||
" for X_batch, y_imm_batch, y_det_batch, seq_lens in train_loader:\n",
|
|
||||||
" X_batch = X_batch.to(device)\n",
|
|
||||||
" y_imm_batch = y_imm_batch.to(device)\n",
|
|
||||||
" y_det_batch = y_det_batch.to(device)\n",
|
|
||||||
" seq_lens = seq_lens.to(device)\n",
|
|
||||||
" \n",
|
|
||||||
" # Create mask for valid (non-padded) positions\n",
|
|
||||||
" batch_size, max_len = y_imm_batch.shape\n",
|
|
||||||
" mask = torch.zeros(batch_size, max_len, device=device)\n",
|
|
||||||
" for i, seq_len in enumerate(seq_lens):\n",
|
|
||||||
" mask[i, :seq_len] = 1.0\n",
|
|
||||||
" \n",
|
|
||||||
" optimizer.zero_grad()\n",
|
|
||||||
" imminent_pred, detected_pred = model(X_batch)\n",
|
|
||||||
" \n",
|
|
||||||
" loss_imminent = criterion_imminent(imminent_pred, y_imm_batch, mask)\n",
|
|
||||||
" loss_detected = criterion_detected(detected_pred, y_det_batch, mask)\n",
|
|
||||||
" loss = 0.5 * loss_imminent + 0.5 * loss_detected\n",
|
|
||||||
" \n",
|
|
||||||
" loss.backward()\n",
|
|
||||||
" torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)\n",
|
|
||||||
" optimizer.step()\n",
|
|
||||||
" \n",
|
|
||||||
" train_loss += loss.item()\n",
|
|
||||||
" \n",
|
|
||||||
" train_loss /= len(train_loader)\n",
|
|
||||||
" train_losses.append(train_loss)\n",
|
|
||||||
" \n",
|
|
||||||
" # VALIDATION (using test set)\n",
|
|
||||||
" model.eval()\n",
|
|
||||||
" test_loss = 0.0\n",
|
|
||||||
" \n",
|
|
||||||
" with torch.no_grad():\n",
|
|
||||||
" for X_batch, y_imm_batch, y_det_batch, seq_lens in test_loader:\n",
|
|
||||||
" X_batch = X_batch.to(device)\n",
|
|
||||||
" y_imm_batch = y_imm_batch.to(device)\n",
|
|
||||||
" y_det_batch = y_det_batch.to(device)\n",
|
|
||||||
" seq_lens = seq_lens.to(device)\n",
|
|
||||||
" \n",
|
|
||||||
" # Create mask\n",
|
|
||||||
" batch_size, max_len = y_imm_batch.shape\n",
|
|
||||||
" mask = torch.zeros(batch_size, max_len, device=device)\n",
|
|
||||||
" for i, seq_len in enumerate(seq_lens):\n",
|
|
||||||
" mask[i, :seq_len] = 1.0\n",
|
|
||||||
" \n",
|
|
||||||
" imminent_pred, detected_pred = model(X_batch)\n",
|
|
||||||
" \n",
|
|
||||||
" loss_imminent = criterion_imminent(imminent_pred, y_imm_batch, mask)\n",
|
|
||||||
" loss_detected = criterion_detected(detected_pred, y_det_batch, mask)\n",
|
|
||||||
" loss = 0.5 * loss_imminent + 0.5 * loss_detected\n",
|
|
||||||
" \n",
|
|
||||||
" test_loss += loss.item()\n",
|
|
||||||
" \n",
|
|
||||||
" test_loss /= len(test_loader)\n",
|
|
||||||
" test_losses.append(test_loss)\n",
|
|
||||||
" \n",
|
|
||||||
" # Early stopping\n",
|
|
||||||
" if test_loss < best_test_loss:\n",
|
|
||||||
" best_test_loss = test_loss\n",
|
|
||||||
" patience_counter = 0\n",
|
|
||||||
" torch.save(model.state_dict(), 'harvest_detection_model_best.pt')\n",
|
|
||||||
" else:\n",
|
|
||||||
" patience_counter += 1\n",
|
|
||||||
" \n",
|
|
||||||
" # Print progress\n",
|
|
||||||
" if (epoch + 1) % 20 == 0 or epoch == 0:\n",
|
|
||||||
" print(f\"Epoch {epoch+1:3d}/{NUM_EPOCHS} | Train: {train_loss:.4f} | Test: {test_loss:.4f}\")\n",
|
|
||||||
" \n",
|
|
||||||
" if patience_counter >= EARLY_STOPPING_PATIENCE:\n",
|
|
||||||
" print(f\"\\n✓ Early stopping at epoch {epoch + 1}\")\n",
|
|
||||||
" break\n",
|
|
||||||
"\n",
|
|
||||||
"print(\"\\n\" + \"=\"*80)\n",
|
|
||||||
"print(\"TRAINING COMPLETE\")\n",
|
|
||||||
"print(\"=\"*80)\n",
|
|
||||||
"print(f\"\\nBest test loss: {best_test_loss:.4f}\")\n",
|
|
||||||
"print(f\"Final epoch: {epoch + 1}\")\n",
|
|
||||||
"\n",
|
|
||||||
"# Load best model\n",
|
|
||||||
"model.load_state_dict(torch.load('harvest_detection_model_best.pt'))\n",
|
|
||||||
"print(f\"✓ Loaded best model from epoch with test_loss={best_test_loss:.4f}\")\n"
|
|
||||||
]
|
|
||||||
},
|
|
||||||
{
|
|
||||||
"cell_type": "markdown",
|
|
||||||
"id": "dd05c9bf",
|
|
||||||
"metadata": {},
|
|
||||||
"source": [
|
|
||||||
"## Section 10: Evaluate Model"
|
|
||||||
]
|
|
||||||
},
|
|
||||||
{
|
|
||||||
"cell_type": "code",
|
|
||||||
"execution_count": null,
|
|
||||||
"id": "82641d96",
|
|
||||||
"metadata": {},
|
|
||||||
"outputs": [],
|
|
||||||
"source": [
|
|
||||||
"print(f\"\\n{'='*80}\")\n",
|
|
||||||
"print(\"EVALUATION ON TEST SET\")\n",
|
|
||||||
"print(f\"{'='*80}\")\n",
|
|
||||||
"\n",
|
|
||||||
"model.eval()\n",
|
|
||||||
"test_preds_imm = []\n",
|
|
||||||
"test_preds_det = []\n",
|
|
||||||
"test_labels_imm = []\n",
|
|
||||||
"test_labels_det = []\n",
|
|
||||||
"\n",
|
|
||||||
"with torch.no_grad():\n",
|
|
||||||
" for X_batch, y_imm_batch, y_det_batch, seq_lens in test_loader:\n",
|
|
||||||
" X_batch = X_batch.to(device)\n",
|
|
||||||
" \n",
|
|
||||||
" imm_pred, det_pred = model(X_batch)\n",
|
|
||||||
" \n",
|
|
||||||
" for i, seq_len in enumerate(seq_lens):\n",
|
|
||||||
" seq_len = seq_len.item()\n",
|
|
||||||
" test_preds_imm.extend(imm_pred[i, :seq_len].cpu().numpy())\n",
|
|
||||||
" test_preds_det.extend(det_pred[i, :seq_len].cpu().numpy())\n",
|
|
||||||
" test_labels_imm.extend(y_imm_batch[i, :seq_len].cpu().numpy())\n",
|
|
||||||
" test_labels_det.extend(y_det_batch[i, :seq_len].cpu().numpy())\n",
|
|
||||||
"\n",
|
|
||||||
"test_preds_imm = np.array(test_preds_imm)\n",
|
|
||||||
"test_preds_det = np.array(test_preds_det)\n",
|
|
||||||
"test_labels_imm = np.array(test_labels_imm)\n",
|
|
||||||
"test_labels_det = np.array(test_labels_det)\n",
|
|
||||||
"\n",
|
|
||||||
"test_preds_imm_binary = (test_preds_imm > 0.5).astype(int)\n",
|
|
||||||
"test_preds_det_binary = (test_preds_det > 0.5).astype(int)\n",
|
|
||||||
"\n",
|
|
||||||
"auc_imm = roc_auc_score(test_labels_imm, test_preds_imm)\n",
|
|
||||||
"auc_det = roc_auc_score(test_labels_det, test_preds_det)\n",
|
|
||||||
"\n",
|
|
||||||
"print(f\"\\nHARVEST IMMINENT PREDICTION:\")\n",
|
|
||||||
"print(classification_report(test_labels_imm, test_preds_imm_binary, target_names=['Normal', 'Imminent']))\n",
|
|
||||||
"print(f\"AUC-ROC: {auc_imm:.4f}\")\n",
|
|
||||||
"\n",
|
|
||||||
"print(f\"\\nHARVEST DETECTED PREDICTION:\")\n",
|
|
||||||
"print(classification_report(test_labels_det, test_preds_det_binary, target_names=['Normal', 'Detected']))\n",
|
|
||||||
"print(f\"AUC-ROC: {auc_det:.4f}\")\n",
|
|
||||||
"\n",
|
|
||||||
"print(f\"\\n{'='*80}\")\n",
|
|
||||||
"print(\"SUMMARY\")\n",
|
|
||||||
"print(f\"{'='*80}\")\n",
|
|
||||||
"print(f\"✓ Imminent (early warning): AUC = {auc_imm:.4f}\")\n",
|
|
||||||
"print(f\"✓ Detected (confirmation): AUC = {auc_det:.4f}\")"
|
|
||||||
]
|
|
||||||
},
|
|
||||||
{
|
|
||||||
"cell_type": "markdown",
|
|
||||||
"id": "284e6449",
|
|
||||||
"metadata": {},
|
|
||||||
"source": [
|
|
||||||
"## Section 11: Save Model & Artifacts"
|
|
||||||
]
|
|
||||||
},
|
|
||||||
{
|
|
||||||
"cell_type": "code",
|
|
||||||
"execution_count": null,
|
|
||||||
"id": "9c40d4ab",
|
|
||||||
"metadata": {},
|
|
||||||
"outputs": [],
|
|
||||||
"source": [
|
|
||||||
"print(f\"\\n{'='*80}\")\n",
|
|
||||||
"print(\"SAVING MODEL & ARTIFACTS\")\n",
|
|
||||||
"print(f\"{'='*80}\")\n",
|
|
||||||
"\n",
|
|
||||||
"model_name = f'harvest_detection_model_trained.pt'\n",
|
|
||||||
"torch.save(model.state_dict(), model_name)\n",
|
|
||||||
"print(f\"\\n✓ Saved: {model_name}\")\n",
|
|
||||||
"\n",
|
|
||||||
"# Save config (references feature config from Script 11)\n",
|
|
||||||
"config = {\n",
|
|
||||||
" 'input_size': 7,\n",
|
|
||||||
" 'hidden_size': HIDDEN_SIZE,\n",
|
|
||||||
" 'num_layers': NUM_LAYERS,\n",
|
|
||||||
" 'dropout': DROPOUT,\n",
|
|
||||||
" 'feature_names': feature_config['feature_names'],\n",
|
|
||||||
" 'auc_imminent': float(auc_imm),\n",
|
|
||||||
" 'auc_detected': float(auc_det),\n",
|
|
||||||
" 'imminent_window': feature_config['imminent_window'],\n",
|
|
||||||
" 'detected_window': feature_config['detected_window'],\n",
|
|
||||||
" 'note': 'Feature engineering done in Script 11 - this model is pure training'\n",
|
|
||||||
"}\n",
|
|
||||||
"\n",
|
|
||||||
"with open('harvest_model_config.json', 'w') as f:\n",
|
|
||||||
" json.dump(config, f, indent=2)\n",
|
|
||||||
"print(f\"✓ Saved: harvest_model_config.json\")\n",
|
|
||||||
"\n",
|
|
||||||
"print(f\"\\n{'='*80}\")\n",
|
|
||||||
"print(\"✓ SCRIPT 12 COMPLETE\")\n",
|
|
||||||
"print(f\"{'='*80}\")\n",
|
|
||||||
"print(f\"\"\"\n",
|
|
||||||
"Model is ready for production!\n",
|
|
||||||
"\n",
|
|
||||||
"Architecture:\n",
|
|
||||||
" Input: 7D pre-engineered features (from Script 11)\n",
|
|
||||||
" Features: CI, 7d velocity, 7d acceleration, 14d MA, 14d velocity, 7d min, is_spike\n",
|
|
||||||
" LSTM: {HIDDEN_SIZE} hidden units, {NUM_LAYERS} layer(s), {DROPOUT} dropout\n",
|
|
||||||
" Output: Dual heads (imminent + detected)\n",
|
|
||||||
"\n",
|
|
||||||
"Performance:\n",
|
|
||||||
" Imminent (early warning): AUC = {auc_imm:.4f}\n",
|
|
||||||
" Detected (confirmation): AUC = {auc_det:.4f}\n",
|
|
||||||
"\n",
|
|
||||||
"Next steps:\n",
|
|
||||||
" 1. Load model weights + config for inference\n",
|
|
||||||
" 2. Implement streaming day-by-day prediction\n",
|
|
||||||
" 3. Deploy to production pipeline\n",
|
|
||||||
"\"\"\")\n"
|
|
||||||
]
|
|
||||||
},
|
|
||||||
{
|
|
||||||
"cell_type": "code",
|
|
||||||
"execution_count": null,
|
|
||||||
"id": "a1185772",
|
|
||||||
"metadata": {},
|
|
||||||
"outputs": [],
|
|
||||||
"source": [
|
|
||||||
"print(f\"\\n{'='*80}\")\n",
|
|
||||||
"print(\"VISUALIZING PREDICTIONS ON TEST FIELDS\")\n",
|
|
||||||
"print(f\"{'='*80}\")\n",
|
|
||||||
"\n",
|
|
||||||
"# Select a few diverse test fields\n",
|
|
||||||
"test_fields = df_test['field'].unique()[:3]\n",
|
|
||||||
"\n",
|
|
||||||
"fig, axes = plt.subplots(len(test_fields), 1, figsize=(16, 4 * len(test_fields)))\n",
|
|
||||||
"if len(test_fields) == 1:\n",
|
|
||||||
" axes = [axes]\n",
|
|
||||||
"\n",
|
|
||||||
"for ax_idx, field in enumerate(test_fields):\n",
|
|
||||||
" field_data = df_test[df_test['field'] == field].sort_values('date').reset_index(drop=True)\n",
|
|
||||||
" \n",
|
|
||||||
" if len(field_data) == 0:\n",
|
|
||||||
" continue\n",
|
|
||||||
" \n",
|
|
||||||
" ci_values = field_data[ci_column].values\n",
|
|
||||||
" dates = pd.to_datetime(field_data['date'].values)\n",
|
|
||||||
" \n",
|
|
||||||
" # Get model predictions for this field\n",
|
|
||||||
" field_test_sequences = [s for s in test_sequences if s['field'] == field]\n",
|
|
||||||
" \n",
|
|
||||||
" if len(field_test_sequences) == 0:\n",
|
|
||||||
" continue\n",
|
|
||||||
" \n",
|
|
||||||
" # Predict for first season in field\n",
|
|
||||||
" seq = field_test_sequences[0]\n",
|
|
||||||
" X_seq = X_test_norm[test_sequences.index(seq)]\n",
|
|
||||||
" X_tensor = torch.FloatTensor(X_seq).unsqueeze(0).to(device)\n",
|
|
||||||
" \n",
|
|
||||||
" model.eval()\n",
|
|
||||||
" with torch.no_grad():\n",
|
|
||||||
" imm_pred, det_pred = model(X_tensor)\n",
|
|
||||||
" imm_pred = imm_pred[0].cpu().numpy()[:len(seq['ci'])]\n",
|
|
||||||
" det_pred = det_pred[0].cpu().numpy()[:len(seq['ci'])]\n",
|
|
||||||
" \n",
|
|
||||||
" ax = axes[ax_idx]\n",
|
|
||||||
" \n",
|
|
||||||
" # Plot 1: CI line\n",
|
|
||||||
" ax.plot(dates, ci_values, 'b-', linewidth=2, label='CI (Crop Index)', alpha=0.7)\n",
|
|
||||||
" \n",
|
|
||||||
" # Plot 2: Imminent probability (right axis)\n",
|
|
||||||
" ax2 = ax.twinx()\n",
|
|
||||||
" ax2.fill_between(dates, imm_pred, alpha=0.3, color='orange', label='Imminent Probability')\n",
|
|
||||||
" ax2.plot(dates, imm_pred, 'o-', color='orange', linewidth=1.5, markersize=3)\n",
|
|
||||||
" \n",
|
|
||||||
" # Plot 3: Detected probability (right axis)\n",
|
|
||||||
" ax2.fill_between(dates, det_pred, alpha=0.2, color='red', label='Detected Probability')\n",
|
|
||||||
" ax2.plot(dates, det_pred, 's-', color='red', linewidth=1.5, markersize=3)\n",
|
|
||||||
" \n",
|
|
||||||
" # Label harvest boundaries\n",
|
|
||||||
" harvest_idx = len(ci_values) - 1\n",
|
|
||||||
" ax.axvline(dates[harvest_idx], color='darkred', linestyle='--', linewidth=2, alpha=0.5)\n",
|
|
||||||
" ax.text(dates[harvest_idx], ci_values.max(), 'HARVEST', rotation=90, va='top', fontsize=9)\n",
|
|
||||||
" \n",
|
|
||||||
" # Formatting\n",
|
|
||||||
" ax.set_xlabel('Date', fontsize=10)\n",
|
|
||||||
" ax.set_ylabel('Crop Index', fontsize=10, color='b')\n",
|
|
||||||
" ax2.set_ylabel('Prediction Probability', fontsize=10)\n",
|
|
||||||
" ax2.set_ylim([0, 1])\n",
|
|
||||||
" ax.set_title(f'Field: {field}', fontsize=12, fontweight='bold')\n",
|
|
||||||
" ax.grid(True, alpha=0.3)\n",
|
|
||||||
" ax.tick_params(axis='y', labelcolor='b')\n",
|
|
||||||
" \n",
|
|
||||||
" # Legend\n",
|
|
||||||
" lines1, labels1 = ax.get_legend_handles_labels()\n",
|
|
||||||
" lines2, labels2 = ax2.get_legend_handles_labels()\n",
|
|
||||||
" ax.legend(lines1 + lines2, labels1 + labels2, loc='upper left', fontsize=9)\n",
|
|
||||||
"\n",
|
|
||||||
"plt.tight_layout()\n",
|
|
||||||
"plt.savefig('harvest_predictions_by_field.png', dpi=100, bbox_inches='tight')\n",
|
|
||||||
"plt.show()\n",
|
|
||||||
"\n",
|
|
||||||
"print(f\"\\n✓ Saved: harvest_predictions_by_field.png\")\n",
|
|
||||||
"print(f\"\\nPrediction interpretation:\")\n",
|
|
||||||
"print(f\" Blue line: CI (crop health)\")\n",
|
|
||||||
"print(f\" Orange: Imminent probability (14-3 days before harvest)\")\n",
|
|
||||||
"print(f\" Red: Detected probability (1-21 days after harvest)\")\n",
|
|
||||||
"print(f\" Red dashed line: Harvest event (season end)\")\n"
|
|
||||||
]
|
|
||||||
},
|
|
||||||
{
|
|
||||||
"cell_type": "markdown",
|
|
||||||
"id": "d4712287",
|
|
||||||
"metadata": {},
|
|
||||||
"source": [
|
|
||||||
"## Section 12: Per-Field Prediction Visualization"
|
|
||||||
]
|
|
||||||
}
|
|
||||||
],
|
|
||||||
"metadata": {
|
|
||||||
"kernelspec": {
|
|
||||||
"display_name": "pytorch_gpu",
|
|
||||||
"language": "python",
|
|
||||||
"name": "python3"
|
|
||||||
},
|
|
||||||
"language_info": {
|
|
||||||
"codemirror_mode": {
|
|
||||||
"name": "ipython",
|
|
||||||
"version": 3
|
|
||||||
},
|
|
||||||
"file_extension": ".py",
|
|
||||||
"mimetype": "text/x-python",
|
|
||||||
"name": "python",
|
|
||||||
"nbconvert_exporter": "python",
|
|
||||||
"pygments_lexer": "ipython3",
|
|
||||||
"version": "3.11.14"
|
|
||||||
}
|
|
||||||
},
|
|
||||||
"nbformat": 4,
|
|
||||||
"nbformat_minor": 5
|
|
||||||
}
|
|
||||||
|
|
@ -1,136 +0,0 @@
|
||||||
# Action Plan: Fix False Imminent Triggers (CI-Only + Confidence Intervals)
|
|
||||||
|
|
||||||
**Problem**: Noise/clouds cause false imminent triggers (model learns on noisy data)
|
|
||||||
**Solution**: Better smoothing + uncertainty quantification to filter noise
|
|
||||||
**Effort**: 4-5 hours implementation + 30 min training
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## Root Cause Analysis
|
|
||||||
|
|
||||||
Your graph shows: Smooth blue LOESS curve (real field state) vs. Jagged red line (noisy measurements)
|
|
||||||
|
|
||||||
**Current model problem:**
|
|
||||||
- Feature engineering uses raw noisy data
|
|
||||||
- Model learns "this noise pattern = harvest signal"
|
|
||||||
- When clouds/sensor errors create similar noise → False trigger
|
|
||||||
|
|
||||||
**Fix:**
|
|
||||||
1. Derive features from SMOOTHED curve only (remove noise at source)
|
|
||||||
2. Add "stability" feature (harvest = smooth decline, noise = jagged)
|
|
||||||
3. Add "decline rate" feature (harvest = consistent slope)
|
|
||||||
4. Add confidence intervals to identify uncertain predictions (= noise)
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## Step-by-Step Implementation
|
|
||||||
|
|
||||||
### STEP 1: Update Feature Engineering (Section 5)
|
|
||||||
**What**: Replace 7 features with new CI-only features
|
|
||||||
**How**: Use 21-day median + 7-day mean smoothing as foundation
|
|
||||||
**Features**:
|
|
||||||
- Smoothed CI (from smooth curve, not raw)
|
|
||||||
- 7d velocity (from smooth curve)
|
|
||||||
- 7d acceleration (from smooth curve)
|
|
||||||
- 21d MA (very long-term trend)
|
|
||||||
- 21d velocity (slow changes only)
|
|
||||||
- **Decline rate** (NEW - slope of smooth curve, harvest = negative slope)
|
|
||||||
- **Stability** (NEW - smoothness metric, harvest = high stability)
|
|
||||||
|
|
||||||
**Code**: See `CI_ONLY_IMPROVEMENTS.md` → "Solution 1: Aggressive Smoothing"
|
|
||||||
|
|
||||||
**Expected result**: Model learns real patterns, not noise
|
|
||||||
|
|
||||||
### STEP 2: Add Monte Carlo Dropout (Confidence Intervals)
|
|
||||||
**What**: Run prediction 30 times with dropout ON, get uncertainty
|
|
||||||
**Why**: High uncertainty = model unsure = probably noise
|
|
||||||
**How**: Keep dropout active during inference, ensemble predictions
|
|
||||||
|
|
||||||
**Code**: See `CI_ONLY_IMPROVEMENTS.md` → "Solution 2: Add Confidence Intervals"
|
|
||||||
|
|
||||||
**Expected result**: Each prediction has mean + 95% CI
|
|
||||||
|
|
||||||
### STEP 3: Filter by Uncertainty
|
|
||||||
**What**: Only alert on HIGH probability + LOW uncertainty
|
|
||||||
**Why**: Filters out noise-driven false positives
|
|
||||||
**How**: Use threshold like `prob > 0.5 AND std < 0.10`
|
|
||||||
|
|
||||||
**Code**: See `CI_ONLY_IMPROVEMENTS.md` → "Solution 3: Use Uncertainty to Filter"
|
|
||||||
|
|
||||||
**Expected result**: False positive rate drops 30-50% without losing real harvests
|
|
||||||
|
|
||||||
### STEP 4: Retrain & Evaluate
|
|
||||||
**Runtime**: ~30 minutes on GPU (standard)
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## What NOT to Do (Yet)
|
|
||||||
|
|
||||||
❌ **Don't add temperature data yet**
|
|
||||||
❌ **Don't add rainfall data yet**
|
|
||||||
❌ **Don't add soil moisture yet**
|
|
||||||
|
|
||||||
Reason: Fix CI-only first. Once this works perfectly, external data will add value. Adding too many features now would confuse the problem.
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## Expected Performance
|
|
||||||
|
|
||||||
| Metric | Before | After | Change |
|
|
||||||
|--------|--------|-------|--------|
|
|
||||||
| Imminent AUC | 0.8793 | 0.90-0.92 | +1-3% |
|
|
||||||
| False positive rate | ~15% | ~3-5% | -70% |
|
|
||||||
| **Recall** (catches real harvests) | 100% | 85-90% | -10-15% |
|
|
||||||
|
|
||||||
**Trade-off**: You lose 10-15% of early warnings to filter 70% of false positives. Acceptable trade.
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## Testing Strategy
|
|
||||||
|
|
||||||
After implementation, test on same 6 sequences you've been using:
|
|
||||||
|
|
||||||
```
|
|
||||||
For each sequence:
|
|
||||||
1. Plot imminent probability + confidence bands
|
|
||||||
2. Plot uncertainty over time
|
|
||||||
3. Verify:
|
|
||||||
- Cloud dips show HIGH uncertainty
|
|
||||||
- Real harvest shows LOW uncertainty
|
|
||||||
- False triggers disappeared
|
|
||||||
```
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## File Location
|
|
||||||
|
|
||||||
All documentation is now in:
|
|
||||||
`python_app/harvest_detection_experiments/`
|
|
||||||
|
|
||||||
Main files:
|
|
||||||
- `CI_ONLY_IMPROVEMENTS.md` ← Implementation details + code
|
|
||||||
- `README_EVALUATION.md` ← Navigation guide
|
|
||||||
- Other `.md` files for reference
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## Timeline
|
|
||||||
|
|
||||||
- **Day 1**: Read CI_ONLY_IMPROVEMENTS.md, plan implementation
|
|
||||||
- **Day 2-3**: Implement Step 1 (new features)
|
|
||||||
- **Day 4**: Implement Steps 2-3 (Monte Carlo + filtering)
|
|
||||||
- **Day 5**: Retrain + test
|
|
||||||
- **Day 5+**: Evaluate results, iterate
|
|
||||||
|
|
||||||
Total: **3-4 focused days** of work
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## Success Criteria
|
|
||||||
|
|
||||||
✅ Model trained without errors
|
|
||||||
✅ Uncertainty bands visible in plots
|
|
||||||
✅ Cloud dips show high uncertainty
|
|
||||||
✅ Real harvest shows low uncertainty
|
|
||||||
✅ False positive rate < 5%
|
|
||||||
✅ Recall > 85% (still catches most real harvests)
|
|
||||||
|
|
@ -1,563 +0,0 @@
|
||||||
# CI-Only Improvements & Confidence Intervals
|
|
||||||
|
|
||||||
**Focus**: Fix false imminent triggers using only CI features, add uncertainty quantification
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## Problem Diagnosis: Why False Imminent Triggers?
|
|
||||||
|
|
||||||
### The Real Issue
|
|
||||||
Your observation is **critical**: The smooth CI curve with noise/clouds means:
|
|
||||||
|
|
||||||
```
|
|
||||||
What model sees:
|
|
||||||
[Real CI trend] + [Noise spikes] + [Cloud-induced dips]
|
|
||||||
|
|
||||||
What actually matters:
|
|
||||||
Only the [Real CI trend]
|
|
||||||
|
|
||||||
Current problem:
|
|
||||||
Model learns to trigger on [Noise spikes] and [Cloud dips]
|
|
||||||
Because they LOOK like pre-harvest decline
|
|
||||||
But they're not representative of actual field state
|
|
||||||
```
|
|
||||||
|
|
||||||
### Why This Happens
|
|
||||||
1. **Noise filter too weak** - Current 2.5 std threshold doesn't catch all artifacts
|
|
||||||
2. **No smoothing before features** - Raw data fed to feature engineering includes noise
|
|
||||||
3. **Model overfits to noisy patterns** - Trained on limited ESA data, learns noise = signal
|
|
||||||
|
|
||||||
### Visual Evidence
|
|
||||||
Your graph shows: Smooth blue LOESS curve (real trend) vs. Jagged red line (noisy measurements)
|
|
||||||
- Model should only learn from blue curve
|
|
||||||
- Currently learning from red curve noise
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## Solution 1: Aggressive Smoothing (Quick Fix)
|
|
||||||
|
|
||||||
**The issue**: We're not smoothing enough. Your graph uses LOESS (smooth curve-fitting). We should too.
|
|
||||||
|
|
||||||
### Add LOESS Smoothing to Feature Engineering
|
|
||||||
|
|
||||||
In Section 5 (Feature Engineering), add this at the START:
|
|
||||||
|
|
||||||
```python
|
|
||||||
print("="*80)
|
|
||||||
print("FEATURE ENGINEERING: IMPROVED SMOOTHING + CI-ONLY FEATURES")
|
|
||||||
print("="*80)
|
|
||||||
|
|
||||||
def engineer_temporal_features_improved(X_sequences, aggressive_smoothing=True):
|
|
||||||
"""
|
|
||||||
Enhanced CI-only feature engineering with aggressive smoothing.
|
|
||||||
|
|
||||||
Problem: Raw CI data contains noise (clouds, sensor artifacts)
|
|
||||||
Solution: Use multiple smoothing scales to isolate real signal
|
|
||||||
|
|
||||||
New approach:
|
|
||||||
1. Start with heavily smoothed baseline (LOESS-like)
|
|
||||||
2. Calculate all features from smoothed curve
|
|
||||||
3. Keep original CI only for reference
|
|
||||||
|
|
||||||
Features (still 7D, but derived differently):
|
|
||||||
1. ci_smoothed: 21-day median filter (VERY smooth, removes noise)
|
|
||||||
2. velocity_7d: From smoothed curve only
|
|
||||||
3. acceleration_7d: From smoothed curve only
|
|
||||||
4. ma_21d: Even longer smoothing (slower trends)
|
|
||||||
5. velocity_21d: Longer window velocity
|
|
||||||
6. ci_decline_rate: Smooth slope (harvest = steeper negative)
|
|
||||||
7. ci_stability: How stable is current CI (noise = low stability)
|
|
||||||
"""
|
|
||||||
X_features = []
|
|
||||||
|
|
||||||
for ci_seq in X_sequences:
|
|
||||||
seq_len = len(ci_seq)
|
|
||||||
|
|
||||||
# STEP 1: AGGRESSIVE SMOOTHING
|
|
||||||
# Use multiple smoothing scales to remove noise
|
|
||||||
|
|
||||||
# 21-day median filter (removes all short-term noise/clouds)
|
|
||||||
ci_series = pd.Series(ci_seq)
|
|
||||||
ci_median_21d = ci_series.rolling(window=21, center=True, min_periods=1).median()
|
|
||||||
ci_smoothed = ci_median_21d.values
|
|
||||||
|
|
||||||
# Further smooth with 7-day mean on top of median
|
|
||||||
ci_smooth_final = pd.Series(ci_smoothed).rolling(window=7, center=True, min_periods=1).mean().values
|
|
||||||
|
|
||||||
# STEP 2: CALCULATE FEATURES FROM SMOOTHED CURVE ONLY
|
|
||||||
|
|
||||||
# Feature 1: Smoothed CI (baseline)
|
|
||||||
feature_1 = ci_smooth_final
|
|
||||||
|
|
||||||
# Feature 2: 7-day velocity (from smoothed curve)
|
|
||||||
ma7_smooth = pd.Series(ci_smooth_final).rolling(window=7, center=False, min_periods=1).mean().values
|
|
||||||
feature_2 = np.zeros(seq_len)
|
|
||||||
for i in range(seq_len):
|
|
||||||
if i >= 7:
|
|
||||||
feature_2[i] = ma7_smooth[i] - ma7_smooth[i-7]
|
|
||||||
|
|
||||||
# Feature 3: 7-day acceleration (from smoothed curve)
|
|
||||||
feature_3 = np.zeros(seq_len)
|
|
||||||
for i in range(seq_len):
|
|
||||||
if i >= 7:
|
|
||||||
feature_3[i] = feature_2[i] - feature_2[i-7]
|
|
||||||
|
|
||||||
# Feature 4: 21-day MA (longer-term trend)
|
|
||||||
ma21_smooth = pd.Series(ci_smooth_final).rolling(window=21, center=False, min_periods=1).mean().values
|
|
||||||
feature_4 = ma21_smooth
|
|
||||||
|
|
||||||
# Feature 5: 21-day velocity (slower changes)
|
|
||||||
feature_5 = np.zeros(seq_len)
|
|
||||||
for i in range(seq_len):
|
|
||||||
if i >= 21:
|
|
||||||
feature_5[i] = ma21_smooth[i] - ma21_smooth[i-21]
|
|
||||||
|
|
||||||
# Feature 6: Decline Rate (smooth slope of smoothed curve)
|
|
||||||
# Harvest = consistent downward slope, noise = random changes
|
|
||||||
feature_6 = np.zeros(seq_len)
|
|
||||||
for i in range(seq_len):
|
|
||||||
if i >= 7:
|
|
||||||
window = ci_smooth_final[max(0, i-7):i+1]
|
|
||||||
if len(window) >= 2:
|
|
||||||
# Linear fit slope (positive = growth, negative = decline)
|
|
||||||
x = np.arange(len(window))
|
|
||||||
slope = np.polyfit(x, window, 1)[0]
|
|
||||||
feature_6[i] = slope
|
|
||||||
|
|
||||||
# Feature 7: CI Stability (variance in smoothed curve)
|
|
||||||
# High stability = smooth decline (harvest signal)
|
|
||||||
# Low stability = noisy spikes (not harvest)
|
|
||||||
feature_7 = np.zeros(seq_len)
|
|
||||||
for i in range(seq_len):
|
|
||||||
window = ci_smooth_final[max(0, i-14):i+1]
|
|
||||||
# Normalize by mean to get relative stability
|
|
||||||
stability = 1.0 / (np.std(window) + 0.1) # Higher = more stable
|
|
||||||
feature_7[i] = min(stability, 10.0) # Cap at 10
|
|
||||||
|
|
||||||
# Stack features
|
|
||||||
features = np.column_stack([
|
|
||||||
feature_1, # Smoothed CI
|
|
||||||
feature_2, # 7d velocity (from smooth)
|
|
||||||
feature_3, # 7d acceleration (from smooth)
|
|
||||||
feature_4, # 21d MA
|
|
||||||
feature_5, # 21d velocity
|
|
||||||
feature_6, # Decline rate
|
|
||||||
feature_7 # Stability
|
|
||||||
])
|
|
||||||
|
|
||||||
X_features.append(features)
|
|
||||||
|
|
||||||
return X_features
|
|
||||||
|
|
||||||
print("\n[ENGINEERING] Creating improved 7D CI-only features...")
|
|
||||||
print(" Strategy: Aggressive smoothing to remove cloud/noise artifacts")
|
|
||||||
print(" Features derived from smoothed curve only, not raw noisy data")
|
|
||||||
|
|
||||||
X_train_features = engineer_temporal_features_improved(X_train_list)
|
|
||||||
X_val_features = engineer_temporal_features_improved(X_val_list)
|
|
||||||
X_test_features = engineer_temporal_features_improved(X_test_list)
|
|
||||||
|
|
||||||
# Update feature names
|
|
||||||
feature_names = [
|
|
||||||
'CI Smoothed', # From 21d median + 7d mean
|
|
||||||
'7d Velocity (Smooth)', # Smooth slope
|
|
||||||
'7d Acceleration', # Change in slope
|
|
||||||
'21d MA', # Very smooth trend
|
|
||||||
'21d Velocity', # Slow changes only
|
|
||||||
'Decline Rate', # Polyfit slope (harvest = negative)
|
|
||||||
'CI Stability' # Smoothness (harvest = high stability)
|
|
||||||
]
|
|
||||||
|
|
||||||
print(f"\n✓ Features created:")
|
|
||||||
for i, name in enumerate(feature_names):
|
|
||||||
print(f" {i+1}. {name}")
|
|
||||||
|
|
||||||
print(f"\n✓ New approach:")
|
|
||||||
print(f" - 21-day median filter removes cloud noise")
|
|
||||||
print(f" - 7-day mean on top removes remaining spikes")
|
|
||||||
print(f" - All features derived from smooth curve")
|
|
||||||
print(f" - Decline rate detects true harvest slopes")
|
|
||||||
print(f" - Stability metric distinguishes smooth decline from noisy dips")
|
|
||||||
```
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## Solution 2: Add Confidence Intervals
|
|
||||||
|
|
||||||
**Goal**: Model outputs uncertainty, not just point estimates
|
|
||||||
|
|
||||||
### A. Monte Carlo Dropout (Easy, Recommended)
|
|
||||||
|
|
||||||
The idea: Run prediction multiple times with dropout ON, get ensemble of predictions = confidence interval
|
|
||||||
|
|
||||||
Add this to your evaluation section:
|
|
||||||
|
|
||||||
```python
|
|
||||||
print("="*80)
|
|
||||||
print("ADDING CONFIDENCE INTERVALS VIA MONTE CARLO DROPOUT")
|
|
||||||
print("="*80)
|
|
||||||
|
|
||||||
class MCDropoutModel:
|
|
||||||
"""
|
|
||||||
Wrapper for Monte Carlo Dropout inference.
|
|
||||||
|
|
||||||
How it works:
|
|
||||||
1. During training, dropout randomly zeros 50% of neurons
|
|
||||||
2. During inference, normally we turn dropout OFF
|
|
||||||
3. Here, we keep dropout ON and run N times
|
|
||||||
4. Each run gives slightly different prediction (due to dropped neurons)
|
|
||||||
5. N predictions → mean (best estimate) + std (uncertainty)
|
|
||||||
|
|
||||||
High uncertainty = model is unsure (likely noise pattern)
|
|
||||||
Low uncertainty = model is confident (likely real harvest signal)
|
|
||||||
"""
|
|
||||||
|
|
||||||
def __init__(self, model, n_samples=20):
|
|
||||||
"""
|
|
||||||
Args:
|
|
||||||
model: Trained PyTorch model
|
|
||||||
n_samples: How many forward passes to run (20-50 typical)
|
|
||||||
"""
|
|
||||||
self.model = model
|
|
||||||
self.n_samples = n_samples
|
|
||||||
|
|
||||||
def predict_with_uncertainty(self, X_batch, seq_lens):
|
|
||||||
"""
|
|
||||||
Run model n_samples times with dropout ON.
|
|
||||||
|
|
||||||
Returns:
|
|
||||||
means: (batch, seq_len) - mean probability
|
|
||||||
stds: (batch, seq_len) - standard deviation (uncertainty)
|
|
||||||
lower_ci: (batch, seq_len) - 95% confidence lower bound
|
|
||||||
upper_ci: (batch, seq_len) - 95% confidence upper bound
|
|
||||||
"""
|
|
||||||
|
|
||||||
# Run multiple forward passes WITH dropout enabled
|
|
||||||
predictions_imminent = []
|
|
||||||
predictions_detected = []
|
|
||||||
|
|
||||||
self.model.train() # Keep dropout ON (not eval mode)
|
|
||||||
|
|
||||||
with torch.no_grad():
|
|
||||||
for _ in range(self.n_samples):
|
|
||||||
imminent_pred, detected_pred = self.model(X_batch)
|
|
||||||
predictions_imminent.append(imminent_pred.cpu().numpy())
|
|
||||||
predictions_detected.append(detected_pred.cpu().numpy())
|
|
||||||
|
|
||||||
# Stack all runs: (n_samples, batch, seq_len)
|
|
||||||
pred_imm_stack = np.array(predictions_imminent)
|
|
||||||
pred_det_stack = np.array(predictions_detected)
|
|
||||||
|
|
||||||
# Compute statistics across runs
|
|
||||||
imm_mean = np.mean(pred_imm_stack, axis=0) # (batch, seq_len)
|
|
||||||
imm_std = np.std(pred_imm_stack, axis=0) # (batch, seq_len)
|
|
||||||
imm_lower = np.percentile(pred_imm_stack, 2.5, axis=0) # 95% CI lower
|
|
||||||
imm_upper = np.percentile(pred_imm_stack, 97.5, axis=0) # 95% CI upper
|
|
||||||
|
|
||||||
det_mean = np.mean(pred_det_stack, axis=0)
|
|
||||||
det_std = np.std(pred_det_stack, axis=0)
|
|
||||||
det_lower = np.percentile(pred_det_stack, 2.5, axis=0)
|
|
||||||
det_upper = np.percentile(pred_det_stack, 97.5, axis=0)
|
|
||||||
|
|
||||||
return {
|
|
||||||
'imminent': {
|
|
||||||
'mean': imm_mean,
|
|
||||||
'std': imm_std,
|
|
||||||
'lower_ci': imm_lower,
|
|
||||||
'upper_ci': imm_upper
|
|
||||||
},
|
|
||||||
'detected': {
|
|
||||||
'mean': det_mean,
|
|
||||||
'std': det_std,
|
|
||||||
'lower_ci': det_lower,
|
|
||||||
'upper_ci': det_upper
|
|
||||||
}
|
|
||||||
}
|
|
||||||
|
|
||||||
# Create MC Dropout predictor
|
|
||||||
mc_predictor = MCDropoutModel(model, n_samples=30)
|
|
||||||
|
|
||||||
print("\n✓ Monte Carlo Dropout predictor created")
|
|
||||||
print(f" N samples per prediction: 30")
|
|
||||||
print(f" Each sample uses different random dropout pattern")
|
|
||||||
print(f" Result: Mean + std + 95% confidence interval")
|
|
||||||
|
|
||||||
# Test on one batch
|
|
||||||
print("\nTesting on validation set...")
|
|
||||||
test_batch = next(iter(val_loader))
|
|
||||||
X_test_batch, y_imm_test, y_det_test, seq_lens = test_batch
|
|
||||||
X_test_batch = X_test_batch.to(device)
|
|
||||||
|
|
||||||
results = mc_predictor.predict_with_uncertainty(X_test_batch, seq_lens)
|
|
||||||
|
|
||||||
print("\nExample predictions (first sequence, first 10 days):")
|
|
||||||
print("Day | Imm Mean | Imm Std | Imm 95% CI | Ground Truth")
|
|
||||||
print("----|----------|---------|----------------|-------------")
|
|
||||||
for i in range(min(10, seq_lens[0])):
|
|
||||||
mean_val = results['imminent']['mean'][0, i]
|
|
||||||
std_val = results['imminent']['std'][0, i]
|
|
||||||
lower = results['imminent']['lower_ci'][0, i]
|
|
||||||
upper = results['imminent']['upper_ci'][0, i]
|
|
||||||
true_val = y_imm_test[0, i].item()
|
|
||||||
print(f"{i+1:3d} | {mean_val:.3f} | {std_val:.3f} | [{lower:.3f}-{upper:.3f}] | {int(true_val)}")
|
|
||||||
|
|
||||||
print("\nInterpretation:")
|
|
||||||
print(" Imm Mean = Probability of imminent harvest")
|
|
||||||
print(" Imm Std = Uncertainty (high = unsure, likely noise)")
|
|
||||||
print(" 95% CI = If we ran model 100 times, 95 would fall in this range")
|
|
||||||
print(" → High std + wide CI = probably noise artifact")
|
|
||||||
print(" → Low std + narrow CI = probably real signal")
|
|
||||||
```
|
|
||||||
|
|
||||||
### B. Updated Visualization with Uncertainty
|
|
||||||
|
|
||||||
```python
|
|
||||||
print("\n" + "="*80)
|
|
||||||
print("VISUALIZATION: PREDICTIONS WITH CONFIDENCE INTERVALS")
|
|
||||||
print("="*80)
|
|
||||||
|
|
||||||
# Get predictions with uncertainty for test set
|
|
||||||
def get_all_predictions_with_ci(model, test_loader, device, mc_samples=30):
|
|
||||||
"""Get predictions with confidence intervals for entire test set."""
|
|
||||||
|
|
||||||
mc_predictor = MCDropoutModel(model, n_samples=mc_samples)
|
|
||||||
|
|
||||||
all_results = {
|
|
||||||
'imm_mean': [],
|
|
||||||
'imm_std': [],
|
|
||||||
'imm_lower': [],
|
|
||||||
'imm_upper': [],
|
|
||||||
'det_mean': [],
|
|
||||||
'det_std': [],
|
|
||||||
'det_lower': [],
|
|
||||||
'det_upper': [],
|
|
||||||
}
|
|
||||||
|
|
||||||
with torch.no_grad():
|
|
||||||
for X_batch, _, _, seq_lens in test_loader:
|
|
||||||
X_batch = X_batch.to(device)
|
|
||||||
results = mc_predictor.predict_with_uncertainty(X_batch, seq_lens)
|
|
||||||
|
|
||||||
# Extract for each sequence, only valid timesteps
|
|
||||||
for i, seq_len in enumerate(seq_lens):
|
|
||||||
seq_len = seq_len.item()
|
|
||||||
all_results['imm_mean'].extend(results['imminent']['mean'][i, :seq_len])
|
|
||||||
all_results['imm_std'].extend(results['imminent']['std'][i, :seq_len])
|
|
||||||
all_results['imm_lower'].extend(results['imminent']['lower_ci'][i, :seq_len])
|
|
||||||
all_results['imm_upper'].extend(results['imminent']['upper_ci'][i, :seq_len])
|
|
||||||
all_results['det_mean'].extend(results['detected']['mean'][i, :seq_len])
|
|
||||||
all_results['det_std'].extend(results['detected']['std'][i, :seq_len])
|
|
||||||
all_results['det_lower'].extend(results['detected']['lower_ci'][i, :seq_len])
|
|
||||||
all_results['det_upper'].extend(results['detected']['upper_ci'][i, :seq_len])
|
|
||||||
|
|
||||||
return {k: np.array(v) for k, v in all_results.items()}
|
|
||||||
|
|
||||||
# Compute on test set
|
|
||||||
print("Computing predictions with confidence intervals (this takes ~1-2 min)...")
|
|
||||||
ci_results = get_all_predictions_with_ci(model, test_loader, device, mc_samples=30)
|
|
||||||
|
|
||||||
# Plot one example sequence with uncertainty bands
|
|
||||||
if len(test_sequences_labeled) > 0:
|
|
||||||
# Find a sequence with harvest events
|
|
||||||
sequences_with_harvest = [
|
|
||||||
(i, s) for i, s in enumerate(test_sequences_labeled)
|
|
||||||
if s['data']['harvest_imminent'].sum() > 0
|
|
||||||
]
|
|
||||||
|
|
||||||
if len(sequences_with_harvest) > 0:
|
|
||||||
seq_idx, seq_dict = sequences_with_harvest[0]
|
|
||||||
data = seq_dict['data'].sort_values('date')
|
|
||||||
dates = pd.to_datetime(data['date'].values)
|
|
||||||
seq_len = len(data)
|
|
||||||
|
|
||||||
# Get predictions for this sequence
|
|
||||||
# (Simplified - in practice would need to track sequence boundaries in ci_results)
|
|
||||||
with torch.no_grad():
|
|
||||||
X_seq = X_test_norm[seq_idx]
|
|
||||||
X_seq_batch = np.expand_dims(X_seq, axis=0)
|
|
||||||
X_seq_tensor = torch.FloatTensor(X_seq_batch).to(device)
|
|
||||||
|
|
||||||
# Get ensemble predictions
|
|
||||||
mc_pred = MCDropoutModel(model, n_samples=30)
|
|
||||||
results_seq = mc_pred.predict_with_uncertainty(X_seq_tensor,
|
|
||||||
torch.tensor([seq_len]))
|
|
||||||
|
|
||||||
# Plot with confidence bands
|
|
||||||
fig, axes = plt.subplots(2, 1, figsize=(16, 10))
|
|
||||||
|
|
||||||
# Plot 1: Imminent signal with CI
|
|
||||||
ax = axes[0]
|
|
||||||
imm_mean = results_seq['imminent']['mean'][0, :seq_len]
|
|
||||||
imm_lower = results_seq['imminent']['lower_ci'][0, :seq_len]
|
|
||||||
imm_upper = results_seq['imminent']['upper_ci'][0, :seq_len]
|
|
||||||
imm_labels = data['harvest_imminent'].values
|
|
||||||
|
|
||||||
ax.plot(dates, imm_mean, linewidth=2.5, color='blue', label='Imminent Probability', zorder=3)
|
|
||||||
ax.fill_between(dates, imm_lower, imm_upper, alpha=0.3, color='cyan',
|
|
||||||
label='95% Confidence Interval', zorder=2)
|
|
||||||
ax.fill_between(dates, 0, imm_labels, alpha=0.2, color='orange',
|
|
||||||
label='Ground Truth Window', zorder=1)
|
|
||||||
ax.axhline(y=0.5, color='black', linestyle='--', linewidth=1.5, alpha=0.6)
|
|
||||||
ax.set_ylabel('Probability', fontweight='bold')
|
|
||||||
ax.set_title(f'Imminent Harvest with Uncertainty: {seq_dict["field"]}', fontweight='bold')
|
|
||||||
ax.legend(loc='upper left', fontsize=10)
|
|
||||||
ax.grid(True, alpha=0.3)
|
|
||||||
ax.set_ylim([-0.05, 1.05])
|
|
||||||
|
|
||||||
# Plot 2: Uncertainty (Std Dev) over time
|
|
||||||
ax = axes[1]
|
|
||||||
imm_std = results_seq['imminent']['std'][0, :seq_len]
|
|
||||||
|
|
||||||
# Color by uncertainty level
|
|
||||||
colors = np.where(imm_std > 0.15, 'red', np.where(imm_std > 0.08, 'orange', 'green'))
|
|
||||||
ax.scatter(dates, imm_std, c=colors, s=20, alpha=0.6, edgecolors='black', linewidth=0.5)
|
|
||||||
ax.axhline(y=0.15, color='red', linestyle='--', linewidth=1, alpha=0.5, label='High uncertainty (>0.15)')
|
|
||||||
ax.axhline(y=0.08, color='orange', linestyle='--', linewidth=1, alpha=0.5, label='Medium uncertainty (>0.08)')
|
|
||||||
ax.set_ylabel('Prediction Std Dev', fontweight='bold')
|
|
||||||
ax.set_xlabel('Date', fontweight='bold')
|
|
||||||
ax.set_title('Model Uncertainty Over Time (High = Model Unsure, Likely Noise)', fontweight='bold')
|
|
||||||
ax.legend(loc='upper left', fontsize=10)
|
|
||||||
ax.grid(True, alpha=0.3)
|
|
||||||
|
|
||||||
plt.tight_layout()
|
|
||||||
plt.savefig('predictions_with_confidence_intervals.png', dpi=150, bbox_inches='tight')
|
|
||||||
print("✓ Saved: predictions_with_confidence_intervals.png")
|
|
||||||
plt.show()
|
|
||||||
|
|
||||||
# Compute statistics
|
|
||||||
print("\n" + "="*80)
|
|
||||||
print("UNCERTAINTY STATISTICS")
|
|
||||||
print("="*80)
|
|
||||||
|
|
||||||
imm_std_all = ci_results['imm_std']
|
|
||||||
print(f"\nImminent Signal Uncertainty:")
|
|
||||||
print(f" Mean std: {np.mean(imm_std_all):.4f}")
|
|
||||||
print(f" Std std: {np.std(imm_std_all):.4f}")
|
|
||||||
print(f" Min std: {np.min(imm_std_all):.4f}")
|
|
||||||
print(f" Max std: {np.max(imm_std_all):.4f}")
|
|
||||||
print(f" % > 0.15 (high uncertainty): {(imm_std_all > 0.15).mean()*100:.1f}%")
|
|
||||||
print(f" % > 0.08 (medium uncertainty): {(imm_std_all > 0.08).mean()*100:.1f}%")
|
|
||||||
|
|
||||||
print(f"\nInterpretation:")
|
|
||||||
print(f" High uncertainty predictions = probably noise patterns")
|
|
||||||
print(f" These are likely FALSE IMMINENT triggers on cloud dips")
|
|
||||||
print(f" → Can filter them out by only alerting on LOW uncertainty predictions")
|
|
||||||
```
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## Solution 3: Use Uncertainty to Filter False Positives
|
|
||||||
|
|
||||||
Once you have confidence intervals, filter predictions:
|
|
||||||
|
|
||||||
```python
|
|
||||||
print("="*80)
|
|
||||||
print("FILTERING: USE UNCERTAINTY TO REMOVE NOISE-BASED FALSE POSITIVES")
|
|
||||||
print("="*80)
|
|
||||||
|
|
||||||
# After getting predictions with CI:
|
|
||||||
# Imminent prediction is only reliable if:
|
|
||||||
# 1. Probability > 0.5 (above threshold)
|
|
||||||
# 2. Uncertainty < 0.10 (model is confident, not noise)
|
|
||||||
|
|
||||||
imm_predictions = ci_results['imm_mean']
|
|
||||||
imm_uncertainties = ci_results['imm_std']
|
|
||||||
imm_labels = test_labels_imminent
|
|
||||||
|
|
||||||
# Three types of predictions:
|
|
||||||
# 1. High prob + Low uncertainty = CONFIDENT POSITIVE (real harvest signal)
|
|
||||||
# 2. High prob + High uncertainty = UNCERTAIN POSITIVE (probably noise)
|
|
||||||
# 3. Low prob + Low uncertainty = CONFIDENT NEGATIVE (correct negative)
|
|
||||||
|
|
||||||
threshold_prob = 0.5
|
|
||||||
threshold_uncertainty = 0.10
|
|
||||||
|
|
||||||
confident_positives = (imm_predictions > threshold_prob) & (imm_uncertainties < threshold_uncertainty)
|
|
||||||
uncertain_positives = (imm_predictions > threshold_prob) & (imm_uncertainties >= threshold_uncertainty)
|
|
||||||
confident_negatives = (imm_predictions <= threshold_prob) & (imm_uncertainties < threshold_uncertainty)
|
|
||||||
|
|
||||||
print(f"\nPrediction classification:")
|
|
||||||
print(f" Confident positives (prob>0.5 + low unc): {confident_positives.sum():,}")
|
|
||||||
print(f" Uncertain positives (prob>0.5 + high unc): {uncertain_positives.sum():,}")
|
|
||||||
print(f" Confident negatives (prob<0.5 + low unc): {confident_negatives.sum():,}")
|
|
||||||
|
|
||||||
# Compute metrics for each type
|
|
||||||
print(f"\nAccuracy breakdown:")
|
|
||||||
|
|
||||||
tp_confident = ((confident_positives) & (imm_labels == 1)).sum()
|
|
||||||
fp_confident = ((confident_positives) & (imm_labels == 0)).sum()
|
|
||||||
recall_confident = tp_confident / (imm_labels == 1).sum() if (imm_labels == 1).sum() > 0 else 0
|
|
||||||
precision_confident = tp_confident / confident_positives.sum() if confident_positives.sum() > 0 else 0
|
|
||||||
|
|
||||||
print(f" Confident positives:")
|
|
||||||
print(f" True positives: {tp_confident:,}")
|
|
||||||
print(f" False positives: {fp_confident:,}")
|
|
||||||
print(f" Precision: {precision_confident:.1%} (real harvest signals)")
|
|
||||||
print(f" Recall: {recall_confident:.1%} (catches this % of real harvests)")
|
|
||||||
|
|
||||||
tp_uncertain = ((uncertain_positives) & (imm_labels == 1)).sum()
|
|
||||||
fp_uncertain = ((uncertain_positives) & (imm_labels == 0)).sum()
|
|
||||||
|
|
||||||
print(f"\n Uncertain positives (probably noise):")
|
|
||||||
print(f" True positives: {tp_uncertain:,}")
|
|
||||||
print(f" False positives: {fp_uncertain:,}")
|
|
||||||
print(f" These are likely the cloud/noise artifacts!")
|
|
||||||
|
|
||||||
print(f"\nRECOMMENDATION:")
|
|
||||||
print(f" Use ONLY 'confident positives' for farmer alerts")
|
|
||||||
print(f" This removes ~{fp_uncertain/uncertain_positives.sum()*100:.0f}% false positives from uncertain set")
|
|
||||||
print(f" You lose {tp_uncertain/((tp_confident+tp_uncertain) if (tp_confident+tp_uncertain)>0 else 1)*100:.0f}% recall but gain much higher precision")
|
|
||||||
```
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## Summary: CI-Only Improvements
|
|
||||||
|
|
||||||
### Problem → Solution
|
|
||||||
|
|
||||||
| Problem | Solution | Implementation |
|
|
||||||
|---------|----------|-----------------|
|
|
||||||
| **Noise/clouds cause false triggers** | 1. Aggressive smoothing (21d median) | Add to Section 5 |
|
|
||||||
| | 2. Stability feature (smooth vs. noisy) | Add to Section 5 |
|
|
||||||
| | 3. Decline rate feature (harvest = consistent slope) | Add to Section 5 |
|
|
||||||
| **No uncertainty quantification** | 1. Monte Carlo Dropout (run 30x with dropout ON) | Add evaluation section |
|
|
||||||
| | 2. Confidence intervals from ensemble | Add visualization |
|
|
||||||
| | 3. Filter by uncertainty (remove noise predictions) | Add filtering logic |
|
|
||||||
|
|
||||||
### Expected Improvement
|
|
||||||
|
|
||||||
```
|
|
||||||
Current:
|
|
||||||
- Imminent AUC: 0.88
|
|
||||||
- False positive rate: ~15%
|
|
||||||
- Problem: Triggers on cloud dips
|
|
||||||
|
|
||||||
After CI-only improvements:
|
|
||||||
- Imminent AUC: 0.90-0.92 (slight gain)
|
|
||||||
- False positive rate: 3-5% (when filtered by uncertainty)
|
|
||||||
- Solution: Only alerts on smooth, confident patterns (not noise)
|
|
||||||
```
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## Key Insight: The "Confidence Filter"
|
|
||||||
|
|
||||||
The real power: **Not all predictions with p>0.5 are reliable!**
|
|
||||||
|
|
||||||
- **High confidence + High probability** = Alert farmer ✅
|
|
||||||
- **High confidence + Low probability** = Normal growth ✅
|
|
||||||
- **Low confidence + High probability** = Probably noise ❌ (FILTER THIS OUT)
|
|
||||||
- **Low confidence + Low probability** = Could be anything ❓
|
|
||||||
|
|
||||||
By adding uncertainty, you can **distinguish real harvest signals from noise artifacts**, which is exactly your problem!
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## Implementation Order
|
|
||||||
|
|
||||||
1. **First**: Add aggressive smoothing to Section 5 (removes noise from feature calculations)
|
|
||||||
2. **Second**: Retrain model with new features
|
|
||||||
3. **Third**: Add Monte Carlo Dropout to evaluation
|
|
||||||
4. **Fourth**: Filter predictions by uncertainty threshold
|
|
||||||
|
|
||||||
Total effort: **4-5 hours** of implementation + 30 min runtime
|
|
||||||
|
|
@ -1,299 +0,0 @@
|
||||||
# SmartCane Deployment Guide
|
|
||||||
**Quick Reference for Bitbucket Push & Server Deployment**
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## 🎯 TL;DR - WHAT YOU NEED TO KNOW
|
|
||||||
|
|
||||||
### What's New:
|
|
||||||
- ✅ **Scripts 09 & 10** are NEW - they generate reports WITH KPIs (field uniformity, stress detection)
|
|
||||||
- ✅ **2 new packages** to install: `flextable` and `officer` (for better tables in Word reports)
|
|
||||||
- ✅ **Shell script wrappers** (01-10) make execution easier
|
|
||||||
|
|
||||||
### Workflow Change:
|
|
||||||
```bash
|
|
||||||
# OLD (master branch):
|
|
||||||
Manual R script execution
|
|
||||||
|
|
||||||
# NEW (code-improvements branch):
|
|
||||||
./01_run_planet_download.sh
|
|
||||||
./02_run_ci_extraction.sh
|
|
||||||
./03_run_growth_model.sh
|
|
||||||
./04_run_mosaic_creation.sh
|
|
||||||
# SKIP 05 (old report without KPIs)
|
|
||||||
./09_run_calculate_kpis.sh # NEW - calculate KPIs first
|
|
||||||
./10_run_kpi_report.sh # NEW - generate report WITH KPIs
|
|
||||||
```
|
|
||||||
|
|
||||||
### For Your Admin:
|
|
||||||
1. Install 2 new R packages: `Rscript -e "renv::restore()"`
|
|
||||||
2. Run scripts in order: 01→02→03→04→09→10 (skip 05)
|
|
||||||
3. Script 10 parameters are configurable (see below)
|
|
||||||
|
|
||||||
**That's it!** Read below for details if needed.
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## 📦 WHAT CHANGED FROM MASTER BRANCH
|
|
||||||
|
|
||||||
### NEW Scripts (not in master):
|
|
||||||
| Script | Purpose | Status |
|
|
||||||
|--------|---------|--------|
|
|
||||||
| `09_run_calculate_kpis.sh` | Calculate field KPIs | ⭐ Required |
|
|
||||||
| `10_run_kpi_report.sh` | Generate reports WITH KPIs | ⭐ Required |
|
|
||||||
| `01-05_run_*.sh` | Shell wrappers for existing R scripts | ✅ Helpful |
|
|
||||||
|
|
||||||
### NEW R Files:
|
|
||||||
- `r_app/09_calculate_kpis.R` - KPI calculation logic
|
|
||||||
- `r_app/10_CI_report_with_kpis_simple.Rmd` - Enhanced report template
|
|
||||||
- `r_app/kpi_utils.R` - KPI utility functions
|
|
||||||
|
|
||||||
### NEW R Packages (in renv.lock):
|
|
||||||
- `flextable` - Enhanced table formatting for Word
|
|
||||||
- `officer` - Word document manipulation
|
|
||||||
|
|
||||||
### RENAMED Files:
|
|
||||||
- `ci_extraction.R` → `02_ci_extraction.R`
|
|
||||||
- `interpolate_growth_model.R` → `03_interpolate_growth_model.R`
|
|
||||||
- `mosaic_creation.R` → `04_mosaic_creation.R`
|
|
||||||
|
|
||||||
### DELETED Files:
|
|
||||||
- Old package management scripts (now using renv only)
|
|
||||||
- Duplicate geometry files
|
|
||||||
- Laravel build artifacts (will regenerate)
|
|
||||||
|
|
||||||
**Total:** 90 files changed, +12,309 lines added, -7,132 lines removed
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## 💻 LINUX SERVER DEPLOYMENT
|
|
||||||
|
|
||||||
### Step 1: Install System Dependencies
|
|
||||||
```bash
|
|
||||||
sudo apt-get update
|
|
||||||
sudo apt-get install -y \
|
|
||||||
libgdal-dev libgeos-dev libproj-dev libudunits2-dev \
|
|
||||||
libcurl4-openssl-dev libssl-dev libxml2-dev \
|
|
||||||
libfontconfig1-dev libharfbuzz-dev libfribidi-dev \
|
|
||||||
pandoc pandoc-citeproc
|
|
||||||
```
|
|
||||||
|
|
||||||
### Step 2: Clone & Setup
|
|
||||||
```bash
|
|
||||||
git clone <bitbucket-url> smartcane
|
|
||||||
cd smartcane
|
|
||||||
chmod +x *.sh
|
|
||||||
dos2unix *.sh # Fix Windows line endings
|
|
||||||
```
|
|
||||||
|
|
||||||
### Step 3: Install R Packages
|
|
||||||
```bash
|
|
||||||
Rscript -e "renv::restore()"
|
|
||||||
```
|
|
||||||
|
|
||||||
### Step 4: Test Workflow
|
|
||||||
```bash
|
|
||||||
./09_run_calculate_kpis.sh aura
|
|
||||||
./10_run_kpi_report.sh --data_dir=aura --filename=test.docx
|
|
||||||
ls laravel_app/storage/app/aura/reports/
|
|
||||||
```
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## ⚙️ SCRIPT 10 PARAMETERS (for Laravel UI)
|
|
||||||
|
|
||||||
### Configurable Parameters (add to Laravel project settings):
|
|
||||||
|
|
||||||
| Parameter | Type | Default | Options | Description |
|
|
||||||
|-----------|------|---------|---------|-------------|
|
|
||||||
| `borders` | Boolean | FALSE | TRUE/FALSE | Show field borders on maps |
|
|
||||||
| `ci_plot_type` | String | both | absolute/cumulative/both | Type of CI plots |
|
|
||||||
| `colorblind_friendly` | Boolean | TRUE | TRUE/FALSE | Use accessible color palettes |
|
|
||||||
| `facet_by_season` | Boolean | FALSE | TRUE/FALSE | Split plots by season |
|
|
||||||
| `x_axis_unit` | String | days | days/weeks | X-axis time unit |
|
|
||||||
|
|
||||||
### Auto-Set Parameters (managed by system):
|
|
||||||
|
|
||||||
| Parameter | Source | Description |
|
|
||||||
|-----------|--------|-------------|
|
|
||||||
| `filename` | Auto-generated | Set by system: `{project}_{date}.docx` |
|
|
||||||
| `report_date` | Current date | Automatically uses today's date |
|
|
||||||
| `mail_day` | Current day | Automatically uses current weekday |
|
|
||||||
| `data_dir` | Project name | Set from Laravel project configuration |
|
|
||||||
|
|
||||||
### Laravel Implementation Notes:
|
|
||||||
|
|
||||||
1. **Create settings per project** with the 5 configurable parameters above
|
|
||||||
2. **Auto-generate filename**: `${project_name}_report_${date}.docx`
|
|
||||||
3. **Auto-set dates**: Use current date/day when script runs
|
|
||||||
4. **data_dir**: Pull from project's directory name in Laravel
|
|
||||||
|
|
||||||
**Example usage:**
|
|
||||||
```bash
|
|
||||||
./10_run_kpi_report.sh \
|
|
||||||
--data_dir=aura \
|
|
||||||
--report_date=$(date +%Y-%m-%d) \
|
|
||||||
--filename="aura_report_$(date +%Y%m%d).docx" \
|
|
||||||
--mail_day=$(date +%A) \
|
|
||||||
--borders=FALSE \
|
|
||||||
--ci_plot_type=both \
|
|
||||||
--colorblind_friendly=TRUE \
|
|
||||||
--facet_by_season=FALSE \
|
|
||||||
--x_axis_unit=days
|
|
||||||
```
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## 🚨 COMMON DEPLOYMENT ERRORS
|
|
||||||
|
|
||||||
### Error 1: Package Compilation Fails
|
|
||||||
```
|
|
||||||
ERROR: configuration failed for package 'sf'
|
|
||||||
```
|
|
||||||
**Solution:** Install system dependencies (see Step 1 above)
|
|
||||||
|
|
||||||
### Error 2: Permission Denied
|
|
||||||
```
|
|
||||||
bash: ./10_run_kpi_report.sh: Permission denied
|
|
||||||
```
|
|
||||||
**Solution:** `chmod +x *.sh`
|
|
||||||
|
|
||||||
### Error 3: Line Ending Issues
|
|
||||||
```
|
|
||||||
/bin/bash^M: bad interpreter
|
|
||||||
```
|
|
||||||
**Solution:** `dos2unix *.sh` or `sed -i 's/\r$//' *.sh`
|
|
||||||
|
|
||||||
### Error 4: Pandoc Missing
|
|
||||||
```
|
|
||||||
Error: pandoc version 1.12.3 or higher is required
|
|
||||||
```
|
|
||||||
**Solution:** `sudo apt-get install -y pandoc`
|
|
||||||
|
|
||||||
### Error 5: Font Errors
|
|
||||||
```
|
|
||||||
Error in gdtools::...: font family not found
|
|
||||||
```
|
|
||||||
**Solution:** Install font libraries (libfontconfig1-dev, etc. - see Step 1)
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## 📊 SCRIPT COMPARISON: Old vs New
|
|
||||||
|
|
||||||
### Script 05 (OLD - skip this):
|
|
||||||
- Basic CI maps ✅
|
|
||||||
- CI trend plots ✅
|
|
||||||
- Week-over-week change ✅
|
|
||||||
- **NO KPI metrics** ❌
|
|
||||||
- **NO field uniformity** ❌
|
|
||||||
- **NO priority detection** ❌
|
|
||||||
|
|
||||||
### Scripts 09 + 10 (NEW - use these):
|
|
||||||
- Everything from script 05 ✅
|
|
||||||
- **KPI metrics** ✅
|
|
||||||
- **Field uniformity (CV, Moran's I)** ✅
|
|
||||||
- **Priority classification** (urgent/monitor/no stress) ✅
|
|
||||||
- **Enhanced tables** (flextable formatting) ✅
|
|
||||||
- **Field stress detection** ✅
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## ⚠️ WINDOWS → LINUX COMPATIBILITY
|
|
||||||
|
|
||||||
**Known issues when moving from Windows to Linux:**
|
|
||||||
|
|
||||||
| Issue | Windows | Linux | Solution |
|
|
||||||
|-------|---------|-------|----------|
|
|
||||||
| Path separators | `\` | `/` | Scripts use `here::here()` ✅ |
|
|
||||||
| Line endings | CRLF | LF | Run `dos2unix *.sh` |
|
|
||||||
| Package compilation | Binary | Source | Install system libs first |
|
|
||||||
| File permissions | Auto | Manual | Run `chmod +x *.sh` |
|
|
||||||
| R path | Fixed path | In PATH | Scripts auto-detect ✅ |
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## ✅ DEPLOYMENT CHECKLIST
|
|
||||||
|
|
||||||
**Before pushing to Bitbucket:**
|
|
||||||
- [ ] Verify scripts 09 and 10 work locally
|
|
||||||
- [ ] Check renv.lock is committed
|
|
||||||
- [ ] Test workflow: 01→02→03→04→09→10
|
|
||||||
|
|
||||||
**After pulling on Linux server:**
|
|
||||||
- [ ] Install system dependencies (GDAL, GEOS, PROJ, Pandoc, fonts)
|
|
||||||
- [ ] Clone repository
|
|
||||||
- [ ] Fix line endings: `dos2unix *.sh`
|
|
||||||
- [ ] Set permissions: `chmod +x *.sh`
|
|
||||||
- [ ] Install R packages: `Rscript -e "renv::restore()"`
|
|
||||||
- [ ] Test with one project: `./09_run_calculate_kpis.sh aura`
|
|
||||||
- [ ] Generate test report: `./10_run_kpi_report.sh --data_dir=aura`
|
|
||||||
- [ ] Create Laravel UI for script 10 parameters
|
|
||||||
- [ ] Update any automation scripts to use new workflow
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## 📂 KEY FILES TO KNOW
|
|
||||||
|
|
||||||
```
|
|
||||||
smartcane/
|
|
||||||
├── 01-04_*.sh # Data acquisition (existing workflow)
|
|
||||||
├── 05_*.sh # ❌ Old report (skip)
|
|
||||||
├── 09_*.sh # ✅ NEW - KPI calculation
|
|
||||||
├── 10_*.sh # ✅ NEW - Report with KPIs
|
|
||||||
├── renv.lock # Package versions (includes flextable/officer)
|
|
||||||
└── r_app/
|
|
||||||
├── 09_calculate_kpis.R # NEW
|
|
||||||
├── 10_CI_report_with_kpis_simple.Rmd # NEW
|
|
||||||
└── kpi_utils.R # NEW
|
|
||||||
```
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## 🔄 EXAMPLE: Full Weekly Pipeline
|
|
||||||
|
|
||||||
```bash
|
|
||||||
#!/bin/bash
|
|
||||||
# Complete weekly workflow for Aura farm
|
|
||||||
|
|
||||||
PROJECT="aura"
|
|
||||||
DATE=$(date +%Y-%m-%d)
|
|
||||||
|
|
||||||
# Step 1-4: Data acquisition
|
|
||||||
./01_run_planet_download.sh --project_dir=$PROJECT
|
|
||||||
./02_run_ci_extraction.sh --project_dir=$PROJECT
|
|
||||||
./03_run_growth_model.sh --project_dir=$PROJECT
|
|
||||||
./04_run_mosaic_creation.sh --data_dir=$PROJECT
|
|
||||||
|
|
||||||
# Step 5-6: KPI calculation & reporting (NEW)
|
|
||||||
./09_run_calculate_kpis.sh $PROJECT
|
|
||||||
./10_run_kpi_report.sh \
|
|
||||||
--data_dir=$PROJECT \
|
|
||||||
--report_date=$DATE \
|
|
||||||
--filename="${PROJECT}_${DATE}.docx" \
|
|
||||||
--colorblind_friendly=TRUE
|
|
||||||
|
|
||||||
echo "✅ Pipeline complete! Check output/"
|
|
||||||
```
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## 📞 TROUBLESHOOTING
|
|
||||||
|
|
||||||
**If deployment fails:**
|
|
||||||
1. Check error against "Common Errors" section above
|
|
||||||
2. Verify system dependencies: `dpkg -l | grep libgdal`
|
|
||||||
3. Test R packages: `Rscript -e "library(flextable)"`
|
|
||||||
4. Check file structure: `ls laravel_app/storage/app/*/`
|
|
||||||
5. Review logs: `./10_run_kpi_report.sh 2>&1 | tee debug.log`
|
|
||||||
|
|
||||||
**Still stuck?** Contact developer with:
|
|
||||||
- Full error message
|
|
||||||
- Which script failed
|
|
||||||
- Output of `sessionInfo()` in R
|
|
||||||
- Server OS and R version
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
**Version:** 1.0
|
|
||||||
**Last Updated:** October 14, 2025
|
|
||||||
**Branch:** code-improvements (ready for merge to master)
|
|
||||||
|
|
@ -1,324 +0,0 @@
|
||||||
# Executive Summary: Harvest Detection Model Evaluation
|
|
||||||
|
|
||||||
**Date**: December 8, 2025
|
|
||||||
**Script**: `python_app/harvest_detection_experiments/05_lstm_harvest_detection_pytorch.ipynb`
|
|
||||||
**Status**: ✅ **PRODUCTION-READY WITH MINOR ENHANCEMENTS RECOMMENDED**
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## Key Findings at a Glance
|
|
||||||
|
|
||||||
| Metric | Current | Target | Gap |
|
|
||||||
|--------|---------|--------|-----|
|
|
||||||
| **Imminent AUC** | 0.8793 | 0.95+ | 7% |
|
|
||||||
| **Detected AUC** | 0.9798 | 0.98+ | ✅ Achieved |
|
|
||||||
| **False Positive Rate** | ~15% | <5% | 10% |
|
|
||||||
| **Mean Lead Time** | ~7 days | 7-10 days | ✅ Good |
|
|
||||||
| **Fields Covered** | 2-3 (ESA) | 15+ (all) | 1 retraining |
|
|
||||||
| **Production Readiness** | 70% | 95%+ | 25% effort |
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## What the Model Does
|
|
||||||
|
|
||||||
**Goal**: Predict when sugarcane fields are ready for harvest and confirm when harvest occurred
|
|
||||||
|
|
||||||
**Input**: Weekly chlorophyll index (CI) values over 300-400+ days of a growing season
|
|
||||||
|
|
||||||
**Output**: Two probability signals per day:
|
|
||||||
1. **Imminent** (0-100%): "Harvest is 3-14 days away" → Alert farmer
|
|
||||||
2. **Detected** (0-100%): "Harvest occurred 1-21 days ago" → Confirm in database
|
|
||||||
|
|
||||||
**Accuracy**: 88-98% depending on task (excellent for operational use)
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## Strengths (What's Working Well)
|
|
||||||
|
|
||||||
### ✅ Architecture & Engineering
|
|
||||||
- **Clean code**: Well-organized, reproducible, documented
|
|
||||||
- **No data leakage**: Fields split for train/val/test (prevents cheating)
|
|
||||||
- **Smart preprocessing**: Detects and removes bad data (linear interpolation, sensor noise)
|
|
||||||
- **Appropriate loss function**: Focal BCE handles class imbalance properly
|
|
||||||
- **Variable-length handling**: Efficiently pads sequences per batch
|
|
||||||
|
|
||||||
### ✅ Performance
|
|
||||||
- **Detected signal is rock-solid**: 98% AUC (harvest confirmation works perfectly)
|
|
||||||
- **Imminent signal is good**: 88% AUC (room for improvement, but usable)
|
|
||||||
- **Per-timestep predictions**: Each day gets independent prediction (not just last day)
|
|
||||||
|
|
||||||
### ✅ Operational Readiness
|
|
||||||
- **Model is saved**: Can be deployed immediately
|
|
||||||
- **Config is documented**: Reproducible experiments
|
|
||||||
- **Visualizations are clear**: Easy to understand what model is doing
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## Weaknesses (Why It's Not Perfect)
|
|
||||||
|
|
||||||
### ⚠️ Limited Input Features
|
|
||||||
**Issue**: Model only uses CI (7 features derived from chlorophyll)
|
|
||||||
- Missing: Temperature, rainfall, soil moisture, phenological stage
|
|
||||||
- Result: Can't distinguish "harvest-ready decline" from "stress decline"
|
|
||||||
|
|
||||||
**Impact**: False imminent positives during seasonal dips
|
|
||||||
- Example: Field shows declining CI in mid-season (stress or natural) vs. pre-harvest (true harvest)
|
|
||||||
- Model can't tell the difference with CI alone
|
|
||||||
|
|
||||||
**Fix**: Add temperature data (can be done in 3-4 hours)
|
|
||||||
|
|
||||||
### ⚠️ Single-Client Training
|
|
||||||
**Issue**: Model trained on ESA fields only (~2 fields, ~2,000 training samples)
|
|
||||||
- Limited diversity: Same climate, same growing conditions
|
|
||||||
- Result: Overfits to ESA-specific patterns
|
|
||||||
|
|
||||||
**Impact**: Uncertain performance on chemba, bagamoyo, muhoroni, aura, sony
|
|
||||||
- May work well, may not
|
|
||||||
- Unknown until tested
|
|
||||||
|
|
||||||
**Fix**: Retrain on all clients (can be done in 15 minutes of runtime)
|
|
||||||
|
|
||||||
### ⚠️ Imminent Window May Not Be Optimal
|
|
||||||
**Issue**: Currently 3-14 days before harvest
|
|
||||||
- Too early warning (>14 days) = less actionable
|
|
||||||
- Too late warning (<3 days) = not enough lead time
|
|
||||||
|
|
||||||
**Impact**: Unknown if this is the sweet spot for farmers
|
|
||||||
- Need to test 5-15, 7-14, 10-21 to find optimal
|
|
||||||
|
|
||||||
**Fix**: Run window sensitivity analysis (can be done in 1-2 hours)
|
|
||||||
|
|
||||||
### ⚠️ No Uncertainty Quantification
|
|
||||||
**Issue**: Model outputs single probability (e.g., "0.87"), not confidence range
|
|
||||||
|
|
||||||
**Impact**: Operators don't know "Is 0.87 reliable? Or uncertain?"
|
|
||||||
|
|
||||||
**Fix**: Optional (Bayesian LSTM or ensemble), lower priority
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## Quick Wins (High-Impact, Low Effort)
|
|
||||||
|
|
||||||
### 🟢 Win #1: Retrain on All Clients (30 min setup + 15 min runtime)
|
|
||||||
**Impact**: +5-10% AUC on imminent, better generalization
|
|
||||||
**How**: Change line 49 in notebook from `CLIENT_FILTER = 'esa'` to `CLIENT_FILTER = None`
|
|
||||||
**Effort**: Trivial (1 variable change)
|
|
||||||
**Expected Result**: Same model, better trained (10,000+ samples vs. 2,000)
|
|
||||||
|
|
||||||
### 🟢 Win #2: Add Temperature Features (3-4 hours)
|
|
||||||
**Impact**: +10-15% AUC on imminent, 50% reduction in false positives
|
|
||||||
**Why**: Harvest timing correlates with heat. Temperature distinguishes "harvest-ready" from "stressed"
|
|
||||||
**How**: Download daily temperature, add GDD and anomaly features
|
|
||||||
**Expected Result**: Imminent AUC: 0.88 → 0.93-0.95
|
|
||||||
|
|
||||||
### 🟢 Win #3: Test Window Optimization (1-2 hours)
|
|
||||||
**Impact**: -30% false positives without losing any true positives
|
|
||||||
**Why**: Current 3-14 day window may not be optimal
|
|
||||||
**How**: Test 5 different windows, measure AUC and false positive rate
|
|
||||||
**Expected Result**: Find sweet spot (probably 7-14 or 10-21 days)
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## Recommended Actions
|
|
||||||
|
|
||||||
### **Immediate** (This Week)
|
|
||||||
- [ ] **Action 1**: Run Phase 1 (all-client retraining)
|
|
||||||
- Change 1 variable, run notebook
|
|
||||||
- Measure AUC improvement
|
|
||||||
- Estimate: 30 min active work, 15 min runtime
|
|
||||||
|
|
||||||
- [ ] **Action 2**: Identify temperature data source
|
|
||||||
- ECMWF? Local weather station? Sentinel-3 satellite?
|
|
||||||
- Check data format and availability for 2020-2024
|
|
||||||
- Estimate: 1-2 hours research
|
|
||||||
|
|
||||||
### **Near-term** (Next 2 Weeks)
|
|
||||||
- [ ] **Action 3**: Implement temperature features
|
|
||||||
- Use code provided in TECHNICAL_IMPROVEMENTS.md
|
|
||||||
- Retrain with 11 features instead of 7
|
|
||||||
- Estimate: 3-4 hours implementation + 30 min runtime
|
|
||||||
|
|
||||||
- [ ] **Action 4**: Test window optimization
|
|
||||||
- Use code provided in TECHNICAL_IMPROVEMENTS.md
|
|
||||||
- Run sensitivity analysis on 5-6 different windows
|
|
||||||
- Estimate: 2 hours
|
|
||||||
|
|
||||||
### **Follow-up** (Month 1)
|
|
||||||
- [ ] **Action 5**: Operational validation
|
|
||||||
- Compute lead times, false positive rates per field
|
|
||||||
- Verify farmers have enough warning time
|
|
||||||
- Estimate: 2-3 hours
|
|
||||||
|
|
||||||
- [ ] **Action 6** (Optional): Add rainfall features
|
|
||||||
- If operational testing shows drought cases are problematic
|
|
||||||
- Estimate: 3-4 hours
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## Success Criteria
|
|
||||||
|
|
||||||
### ✅ After Phase 1 (All Clients)
|
|
||||||
- [ ] Imminent AUC ≥ 0.90
|
|
||||||
- [ ] Model trains without errors
|
|
||||||
- [ ] Can visualize predictions on all client fields
|
|
||||||
- **Timeline**: This week
|
|
||||||
- **Effort**: 30 minutes
|
|
||||||
|
|
||||||
### ✅ After Phase 2 (Temperature Features)
|
|
||||||
- [ ] Imminent AUC ≥ 0.93
|
|
||||||
- [ ] False positive rate < 10%
|
|
||||||
- [ ] Fewer false imminent peaks on seasonal dips
|
|
||||||
- **Timeline**: Next 2 weeks
|
|
||||||
- **Effort**: 3-4 hours
|
|
||||||
|
|
||||||
### ✅ After Phase 3 (Window Optimization)
|
|
||||||
- [ ] Imminent AUC ≥ 0.95
|
|
||||||
- [ ] False positive rate < 5%
|
|
||||||
- [ ] Mean lead time 7-10 days
|
|
||||||
- **Timeline**: 2-3 weeks
|
|
||||||
- **Effort**: 1-2 hours
|
|
||||||
|
|
||||||
### ✅ Production Deployment
|
|
||||||
- [ ] All above criteria met
|
|
||||||
- [ ] Operational manual written
|
|
||||||
- [ ] Tested on at least 1 recent season
|
|
||||||
- **Timeline**: 4-5 weeks
|
|
||||||
- **Effort**: 10-15 hours total
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## Documents Provided
|
|
||||||
|
|
||||||
### 1. **QUICK_SUMMARY.md** (This document + more)
|
|
||||||
- Non-technical overview
|
|
||||||
- What the model does
|
|
||||||
- Key findings and recommendations
|
|
||||||
|
|
||||||
### 2. **LSTM_HARVEST_EVALUATION.md** (Detailed)
|
|
||||||
- Section-by-section analysis
|
|
||||||
- Strengths and weaknesses
|
|
||||||
- Specific recommendations by priority
|
|
||||||
- Data quality analysis
|
|
||||||
- Deployment readiness assessment
|
|
||||||
|
|
||||||
### 3. **IMPLEMENTATION_ROADMAP.md** (Action-oriented)
|
|
||||||
- Step-by-step guide for each phase
|
|
||||||
- Expected outcomes and timelines
|
|
||||||
- Code snippets
|
|
||||||
- Performance trajectory
|
|
||||||
|
|
||||||
### 4. **TECHNICAL_IMPROVEMENTS.md** (Code-ready)
|
|
||||||
- Copy-paste ready code examples
|
|
||||||
- Temperature feature engineering
|
|
||||||
- Window optimization analysis
|
|
||||||
- Operational metrics calculation
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## Risk Assessment
|
|
||||||
|
|
||||||
### 🟢 Low Risk
|
|
||||||
- **Phase 1** (all-client retraining): Very safe, no new code
|
|
||||||
- **Phase 2** (temperature features): Low risk if temperature data available
|
|
||||||
- **Phase 3** (window optimization): No risk, only testing different parameters
|
|
||||||
|
|
||||||
### 🟡 Medium Risk
|
|
||||||
- **Phase 4** (operational validation): Requires farmer feedback and actual predictions
|
|
||||||
- **Phase 5** (rainfall features): Data availability risk
|
|
||||||
|
|
||||||
### 🔴 High Risk
|
|
||||||
- **Phase 6** (Bayesian uncertainty): High implementation complexity, optional
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## Budget & Timeline
|
|
||||||
|
|
||||||
| Phase | Effort | Timeline | Priority | Budget |
|
|
||||||
|-------|--------|----------|----------|--------|
|
|
||||||
| Phase 1: All clients | 30 min | This week | 🔴 High | Minimal |
|
|
||||||
| Phase 2: Temperature | 3-4 hrs | Week 2 | 🔴 High | Minimal |
|
|
||||||
| Phase 3: Windows | 2 hrs | Week 2-3 | 🟡 Medium | Minimal |
|
|
||||||
| Phase 4: Operational | 2-3 hrs | Week 3-4 | 🟡 Medium | Minimal |
|
|
||||||
| Phase 5: Rainfall | 3-4 hrs | Week 4+ | 🟢 Low | Minimal |
|
|
||||||
| **Total** | **10-15 hrs** | **1 month** | - | **Free** |
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## FAQ
|
|
||||||
|
|
||||||
**Q: Can I use this model in production now?**
|
|
||||||
A: Partially. The detected signal (98% AUC) is production-ready. The imminent signal (88% AUC) works but has false positives. Recommend Phase 1+2 improvements first (1-2 weeks).
|
|
||||||
|
|
||||||
**Q: What if I don't have temperature data?**
|
|
||||||
A: Model works OK with CI alone (88% AUC), but false positives are higher. Temperature data is highly recommended. Can be downloaded free from ECMWF or local weather stations.
|
|
||||||
|
|
||||||
**Q: How often should I retrain the model?**
|
|
||||||
A: Quarterly (every 3-4 months) as new harvest data comes in. Initial retraining on all clients is critical, then maintain as you collect more data.
|
|
||||||
|
|
||||||
**Q: What's the computational cost?**
|
|
||||||
A: Training takes ~10-15 minutes on GPU, ~1-2 hours on CPU. Inference (prediction) is instant (<1 second per field). Cost is negligible.
|
|
||||||
|
|
||||||
**Q: Can this work for other crops?**
|
|
||||||
A: Yes! The architecture generalizes to any crop with seasonal growth patterns (wheat, rice, corn, etc.). Tuning the harvest window and features would be needed.
|
|
||||||
|
|
||||||
**Q: What about climate variability (e.g., El Niño)?**
|
|
||||||
A: Temperature + rainfall features capture most climate effects. For very extreme events (hurricanes, frosts), may need additional handling.
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## Conclusion
|
|
||||||
|
|
||||||
**This is a well-engineered harvest detection system that's 70% production-ready.** With two weeks of focused effort (Phase 1 + Phase 2), it can become 95%+ production-ready.
|
|
||||||
|
|
||||||
### Recommended Path Forward
|
|
||||||
1. **Week 1**: Complete Phase 1 (all-client retraining) ← START HERE
|
|
||||||
2. **Week 2**: Complete Phase 2 (temperature features)
|
|
||||||
3. **Week 3**: Complete Phase 3 (window optimization)
|
|
||||||
4. **Week 4**: Complete Phase 4 (operational validation)
|
|
||||||
5. **Month 2**: Deploy to production with weekly monitoring
|
|
||||||
|
|
||||||
**Total effort**: 10-15 hours spread over 4 weeks
|
|
||||||
**Expected outcome**: 95%+ production-ready system with <5% false positive rate and 7-10 day lead time
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## Contact & Questions
|
|
||||||
|
|
||||||
- **Data quality issues**: See LSTM_HARVEST_EVALUATION.md (Data Quality section)
|
|
||||||
- **Implementation details**: See TECHNICAL_IMPROVEMENTS.md (copy-paste code)
|
|
||||||
- **Project roadmap**: See IMPLEMENTATION_ROADMAP.md (step-by-step guide)
|
|
||||||
- **Feature engineering**: See TECHNICAL_IMPROVEMENTS.md (feature ideas & code)
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
**Prepared by**: AI Evaluation
|
|
||||||
**Date**: December 8, 2025
|
|
||||||
**Status**: ✅ Ready to proceed with Phase 1
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## Appendix: Feature List
|
|
||||||
|
|
||||||
### Current Features (7)
|
|
||||||
1. CI - Raw chlorophyll index
|
|
||||||
2. 7d Velocity - Rate of CI change
|
|
||||||
3. 7d Acceleration - Change in velocity
|
|
||||||
4. 14d MA - Smoothed trend
|
|
||||||
5. 14d Velocity - Longer-term slope
|
|
||||||
6. 7d Minimum - Captures crashes
|
|
||||||
7. Velocity Magnitude - Speed (direction-independent)
|
|
||||||
|
|
||||||
### Recommended Additions (4)
|
|
||||||
8. **GDD Cumulative** - Growing Degree Days (total heat)
|
|
||||||
9. **GDD 7d Velocity** - Rate of heat accumulation
|
|
||||||
10. **Temp Anomaly** - Current temp vs. seasonal average
|
|
||||||
11. **GDD Percentile** - Position in season's heat accumulation
|
|
||||||
|
|
||||||
### Optional Additions (3)
|
|
||||||
12. **Rainfall 7d** - Weekly precipitation
|
|
||||||
13. **Rainfall Deficit** - Deficit vs. normal
|
|
||||||
14. **Drought Stress Index** - Combination metric
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
**END OF EXECUTIVE SUMMARY**
|
|
||||||
|
|
@ -1,552 +0,0 @@
|
||||||
# Implementation Roadmap: Improving the Harvest Detection Model
|
|
||||||
|
|
||||||
**Target**: Move from 88% imminent AUC (current) to 95%+ with fewer false positives
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## Phase 1: Multi-Client Retraining (Est. 1-2 hours active work)
|
|
||||||
|
|
||||||
### What to Do
|
|
||||||
Change the model from ESA-only to all-client training.
|
|
||||||
|
|
||||||
### Step-by-Step
|
|
||||||
|
|
||||||
1. **Open the notebook** at `python_app/harvest_detection_experiments/05_lstm_harvest_detection_pytorch.ipynb`
|
|
||||||
|
|
||||||
2. **Go to Section 2** (Data Loading), find this line (~line 49):
|
|
||||||
```python
|
|
||||||
CLIENT_FILTER = 'esa' # ← CHANGE THIS
|
|
||||||
```
|
|
||||||
|
|
||||||
3. **Change to:**
|
|
||||||
```python
|
|
||||||
CLIENT_FILTER = None # Now uses ALL clients
|
|
||||||
```
|
|
||||||
|
|
||||||
4. **Run Sections 2-12 sequentially**
|
|
||||||
- Section 2: Data loading & cleaning (2-5 min)
|
|
||||||
- Sections 3-6: Feature engineering (1-2 min)
|
|
||||||
- Sections 7-9: Training (5-15 min, depending on GPU)
|
|
||||||
- Sections 10-12: Evaluation & saving (2-3 min)
|
|
||||||
|
|
||||||
5. **Compare results**
|
|
||||||
- Before: `harvest_detection_model_esa_esa.pt` (ESA-only)
|
|
||||||
- After: `harvest_detection_model_esa_None.pt` (all-client)
|
|
||||||
- Expected: Imminent AUC improves from 0.8793 → 0.90+, fewer false positives
|
|
||||||
|
|
||||||
### Expected Outcome
|
|
||||||
```
|
|
||||||
ESA-Only (Current):
|
|
||||||
- Train data: ~2,000 days (2 fields)
|
|
||||||
- Imminent AUC: 0.8793
|
|
||||||
- Issue: False imminent peaks during seasonal dips
|
|
||||||
|
|
||||||
All-Client (Expected):
|
|
||||||
- Train data: ~10,000+ days (15+ fields)
|
|
||||||
- Imminent AUC: 0.90-0.92 (5-10% improvement)
|
|
||||||
- Issue: Reduced, but CI-only limitation remains
|
|
||||||
```
|
|
||||||
|
|
||||||
### Success Criteria
|
|
||||||
- ✅ Model trains without errors
|
|
||||||
- ✅ AUC scores reasonable (imminent > 0.85, detected > 0.95)
|
|
||||||
- ✅ Sequence visualization shows fewer false imminent peaks
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## Phase 2: Add Temperature Features (Est. 3-4 hours)
|
|
||||||
|
|
||||||
### Why Temperature Matters
|
|
||||||
|
|
||||||
Sugarcane harvest timing correlates with accumulated heat. Different types of CI decline:
|
|
||||||
|
|
||||||
```
|
|
||||||
Normal Ripening (HARVEST-READY):
|
|
||||||
- Temperature: Moderate-warm
|
|
||||||
- Rainfall: Normal
|
|
||||||
- CI: Declining over 2 weeks
|
|
||||||
- → Launch harvest alerts
|
|
||||||
|
|
||||||
Stress-Induced Decline (AVOID):
|
|
||||||
- Temperature: Very hot or very cold
|
|
||||||
- Rainfall: Low (drought) or excessive
|
|
||||||
- CI: Similar decline pattern
|
|
||||||
- → DON'T trigger alerts (crop stressed, not ready)
|
|
||||||
|
|
||||||
Model Problem: Can't distinguish! Need temperature + rainfall.
|
|
||||||
```
|
|
||||||
|
|
||||||
### Step 1: Find Temperature Data
|
|
||||||
|
|
||||||
**Option A: ECMWF Reanalysis** (Recommended)
|
|
||||||
- Global 0.25° resolution
|
|
||||||
- Free: https://www.ecmwf.int/
|
|
||||||
- Daily or monthly data available
|
|
||||||
- Takes 1-2 hours to download/process
|
|
||||||
|
|
||||||
**Option B: Local Weather Stations**
|
|
||||||
- Higher accuracy if available
|
|
||||||
- Must interpolate between stations
|
|
||||||
- May have gaps
|
|
||||||
|
|
||||||
**Option C: MODIS/Satellite Temperature**
|
|
||||||
- From Landsat, Sentinel-3
|
|
||||||
- Already integrated with your pipeline?
|
|
||||||
- Same download as CI
|
|
||||||
|
|
||||||
**Steps**:
|
|
||||||
1. Download daily average temperature for field locations, 2020-2024
|
|
||||||
2. Merge with CI data by date/location
|
|
||||||
3. Format: One row per field, per date with temperature column
|
|
||||||
|
|
||||||
### Step 2: Engineer Temperature-Based Features
|
|
||||||
|
|
||||||
Add to Section 5 (Feature Engineering):
|
|
||||||
|
|
||||||
```python
|
|
||||||
def add_temperature_features(df, temp_column='daily_avg_temp'):
|
|
||||||
"""
|
|
||||||
Add harvest-relevant temperature features.
|
|
||||||
|
|
||||||
New features (4 total):
|
|
||||||
1. gdd_cumulative: Growing Degree Days (sum of (T-base) where T>10°C)
|
|
||||||
2. gdd_7d_velocity: 7-day change in accumulated heat
|
|
||||||
3. temp_anomaly: Current temp vs seasonal average
|
|
||||||
4. gdd_percentile: Where in season's heat accumulation?
|
|
||||||
"""
|
|
||||||
|
|
||||||
# 1. Growing Degree Days (GDD)
|
|
||||||
# Base temp for sugarcane: 10°C
|
|
||||||
df['daily_gdd'] = np.maximum(0, df[temp_column] - 10)
|
|
||||||
df['gdd_cumulative'] = df.groupby(['field', 'model'])['daily_gdd'].cumsum()
|
|
||||||
|
|
||||||
# 2. GDD velocity
|
|
||||||
df['gdd_7d_velocity'] = 0.0
|
|
||||||
for (field, model), group in df.groupby(['field', 'model']):
|
|
||||||
idx = group.index
|
|
||||||
gdd_values = group['gdd_cumulative'].values
|
|
||||||
for i in range(7, len(gdd_values)):
|
|
||||||
df.loc[idx[i], 'gdd_7d_velocity'] = gdd_values[i] - gdd_values[i-7]
|
|
||||||
|
|
||||||
# 3. Temperature anomaly (vs 30-day rolling average)
|
|
||||||
df['temp_30d_avg'] = df.groupby('field')[temp_column].transform(
|
|
||||||
lambda x: x.rolling(30, center=True, min_periods=1).mean()
|
|
||||||
)
|
|
||||||
df['temp_anomaly'] = df[temp_column] - df['temp_30d_avg']
|
|
||||||
|
|
||||||
# 4. GDD percentile (within season)
|
|
||||||
df['gdd_percentile'] = 0.0
|
|
||||||
for (field, model), group in df.groupby(['field', 'model']):
|
|
||||||
idx = group.index
|
|
||||||
gdd_values = group['gdd_cumulative'].values
|
|
||||||
max_gdd = gdd_values[-1]
|
|
||||||
df.loc[idx, 'gdd_percentile'] = gdd_values / (max_gdd + 0.001)
|
|
||||||
|
|
||||||
return df
|
|
||||||
```
|
|
||||||
|
|
||||||
### Step 3: Update Feature List
|
|
||||||
|
|
||||||
In Section 5, change from 7 features to 11:
|
|
||||||
|
|
||||||
```python
|
|
||||||
feature_names = [
|
|
||||||
'CI', # Original
|
|
||||||
'7d Velocity', # Original
|
|
||||||
'7d Acceleration', # Original
|
|
||||||
'14d MA', # Original
|
|
||||||
'14d Velocity', # Original
|
|
||||||
'7d Min', # Original
|
|
||||||
'Velocity Magnitude', # Original
|
|
||||||
'GDD Cumulative', # NEW
|
|
||||||
'GDD 7d Velocity', # NEW
|
|
||||||
'Temp Anomaly', # NEW
|
|
||||||
'GDD Percentile' # NEW
|
|
||||||
]
|
|
||||||
|
|
||||||
# Update feature engineering:
|
|
||||||
features = np.column_stack([
|
|
||||||
ci_smooth,
|
|
||||||
velocity_7d,
|
|
||||||
acceleration_7d,
|
|
||||||
ma14_values,
|
|
||||||
velocity_14d,
|
|
||||||
min_7d,
|
|
||||||
velocity_magnitude,
|
|
||||||
gdd_cumulative, # NEW
|
|
||||||
gdd_7d_velocity, # NEW
|
|
||||||
temp_anomaly, # NEW
|
|
||||||
gdd_percentile # NEW
|
|
||||||
])
|
|
||||||
```
|
|
||||||
|
|
||||||
### Step 4: Update Model Input Size
|
|
||||||
|
|
||||||
In Section 8, change:
|
|
||||||
```python
|
|
||||||
# OLD
|
|
||||||
model = HarvestDetectionLSTM(input_size=7, ...)
|
|
||||||
|
|
||||||
# NEW
|
|
||||||
model = HarvestDetectionLSTM(input_size=11, ...) # 7 + 4 new features
|
|
||||||
```
|
|
||||||
|
|
||||||
### Step 5: Retrain
|
|
||||||
|
|
||||||
Run Sections 6-12 again with new data + model size.
|
|
||||||
|
|
||||||
### Expected Outcome
|
|
||||||
|
|
||||||
```
|
|
||||||
Before Temperature Features:
|
|
||||||
- Input: 7 features (CI-derived only)
|
|
||||||
- Imminent AUC: 0.90 (all-client baseline)
|
|
||||||
- False imminent rate: 15-20% of predictions
|
|
||||||
|
|
||||||
After Temperature Features:
|
|
||||||
- Input: 11 features (CI + temperature)
|
|
||||||
- Imminent AUC: 0.93-0.95 (3-5% gain)
|
|
||||||
- False imminent rate: 5-10% (50% reduction!)
|
|
||||||
- Model can distinguish: Stress-decline vs. harvest-ready decline
|
|
||||||
```
|
|
||||||
|
|
||||||
### Why This Works
|
|
||||||
|
|
||||||
**Harvest-specific pattern** (with temperature):
|
|
||||||
```
|
|
||||||
Imminent Harvest:
|
|
||||||
CI: Declining ↘
|
|
||||||
GDD: Very high (>3500 total)
|
|
||||||
GDD Velocity: Moderate (still accumulating)
|
|
||||||
Temp Anomaly: Normal
|
|
||||||
→ Model learns: "High GDD + declining CI + normal temp" = HARVEST
|
|
||||||
|
|
||||||
Drought Stress (False Positive Prevention):
|
|
||||||
CI: Declining ↘ (same as above)
|
|
||||||
GDD: Moderate (1500-2000)
|
|
||||||
GDD Velocity: Negative (cooling, winter)
|
|
||||||
Temp Anomaly: Very hot
|
|
||||||
→ Model learns: "Low GDD + stress temp" ≠ HARVEST
|
|
||||||
```
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## Phase 3: Test Different Imminent Windows (Est. 1-2 hours)
|
|
||||||
|
|
||||||
### Current Window: 3-14 days
|
|
||||||
|
|
||||||
**Question**: Is this optimal? Let's test:
|
|
||||||
- 5-15 days (shift right, later warning)
|
|
||||||
- 7-14 days (tighten lower bound)
|
|
||||||
- 10-21 days (wider, earlier warning)
|
|
||||||
- 3-7 days (ultra-tight, latest warning)
|
|
||||||
|
|
||||||
### How to Test
|
|
||||||
|
|
||||||
In Section 4, create a loop:
|
|
||||||
|
|
||||||
```python
|
|
||||||
windows_to_test = [
|
|
||||||
(3, 14), # Current
|
|
||||||
(5, 15),
|
|
||||||
(7, 14),
|
|
||||||
(10, 21),
|
|
||||||
(3, 7),
|
|
||||||
]
|
|
||||||
|
|
||||||
results = []
|
|
||||||
|
|
||||||
for imm_start, imm_end in windows_to_test:
|
|
||||||
# Relabel with new window
|
|
||||||
labeled_seqs = label_harvest_windows_per_season(
|
|
||||||
test_sequences,
|
|
||||||
imminent_start=imm_start,
|
|
||||||
imminent_end=imm_end,
|
|
||||||
detected_start=1,
|
|
||||||
detected_end=21
|
|
||||||
)
|
|
||||||
|
|
||||||
# Evaluate
|
|
||||||
y_true = concat labels from labeled_seqs
|
|
||||||
y_pred = get_model_predictions(test_sequences)
|
|
||||||
|
|
||||||
auc = roc_auc_score(y_true, y_pred)
|
|
||||||
fp_rate = false_positive_rate(y_true, y_pred)
|
|
||||||
|
|
||||||
results.append({
|
|
||||||
'window': f"{imm_start}-{imm_end}",
|
|
||||||
'auc': auc,
|
|
||||||
'fp_rate': fp_rate,
|
|
||||||
})
|
|
||||||
|
|
||||||
# Print results
|
|
||||||
results_df = pd.DataFrame(results).sort_values('auc', ascending=False)
|
|
||||||
print(results_df)
|
|
||||||
```
|
|
||||||
|
|
||||||
### Expected Outcome
|
|
||||||
|
|
||||||
```
|
|
||||||
Window AUC FP_Rate
|
|
||||||
0 7-14 0.920 0.08 ← RECOMMENDED (best balance)
|
|
||||||
1 5-15 0.918 0.12
|
|
||||||
2 3-14 0.915 0.15 ← Current
|
|
||||||
3 10-21 0.910 0.05 ← Too late
|
|
||||||
4 3-7 0.905 0.20 ← Too early
|
|
||||||
```
|
|
||||||
|
|
||||||
Choose the window with highest AUC and acceptable false positive rate.
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## Phase 4: Operational Metrics (Est. 2 hours)
|
|
||||||
|
|
||||||
### What We Need
|
|
||||||
|
|
||||||
For deployment, understand:
|
|
||||||
1. **Lead time**: How many days before harvest do we warn?
|
|
||||||
2. **False positive rate**: How often do we cry wolf?
|
|
||||||
3. **Miss rate**: How often do we miss the harvest window?
|
|
||||||
4. **Per-field performance**: Do some fields have worse predictions?
|
|
||||||
|
|
||||||
### Code to Add
|
|
||||||
|
|
||||||
```python
|
|
||||||
def compute_operational_metrics(model, test_sequences_labeled, test_features):
|
|
||||||
"""
|
|
||||||
Compute farmer-relevant metrics.
|
|
||||||
"""
|
|
||||||
|
|
||||||
lead_times = []
|
|
||||||
false_positives = []
|
|
||||||
misses = []
|
|
||||||
field_performance = {}
|
|
||||||
|
|
||||||
for seq_idx, seq_dict in enumerate(test_sequences_labeled):
|
|
||||||
field = seq_dict['field']
|
|
||||||
data = seq_dict['data']
|
|
||||||
|
|
||||||
# Get predictions
|
|
||||||
X_features = test_features[seq_idx]
|
|
||||||
with torch.no_grad():
|
|
||||||
imminent_pred, _ = model(torch.from_numpy(X_features[np.newaxis, :, :]))
|
|
||||||
imminent_pred = imminent_pred[0].cpu().numpy()
|
|
||||||
|
|
||||||
# Find harvest boundary
|
|
||||||
harvest_idx = np.where(data['harvest_boundary'] == 1)[0]
|
|
||||||
if len(harvest_idx) == 0:
|
|
||||||
continue
|
|
||||||
harvest_idx = harvest_idx[0]
|
|
||||||
|
|
||||||
# Find when model triggered (imminent > 0.5)
|
|
||||||
triggered_indices = np.where(imminent_pred > 0.5)[0]
|
|
||||||
|
|
||||||
if len(triggered_indices) > 0:
|
|
||||||
# Last trigger before harvest
|
|
||||||
triggers_before = triggered_indices[triggered_indices < harvest_idx]
|
|
||||||
if len(triggers_before) > 0:
|
|
||||||
last_trigger = triggers_before[-1]
|
|
||||||
lead_time = harvest_idx - last_trigger
|
|
||||||
lead_times.append(lead_time)
|
|
||||||
|
|
||||||
# Check if within optimal window (e.g., 3-14 days)
|
|
||||||
if 3 <= lead_time <= 14:
|
|
||||||
if field not in field_performance:
|
|
||||||
field_performance[field] = {'correct': 0, 'total': 0}
|
|
||||||
field_performance[field]['correct'] += 1
|
|
||||||
else:
|
|
||||||
# Triggered after harvest = false positive
|
|
||||||
false_positives.append(len(triggered_indices))
|
|
||||||
else:
|
|
||||||
# No trigger at all = miss
|
|
||||||
misses.append(seq_idx)
|
|
||||||
|
|
||||||
if field not in field_performance:
|
|
||||||
field_performance[field] = {'correct': 0, 'total': 0}
|
|
||||||
field_performance[field]['total'] += 1
|
|
||||||
|
|
||||||
# Compute statistics
|
|
||||||
print("\n" + "="*60)
|
|
||||||
print("OPERATIONAL METRICS")
|
|
||||||
print("="*60)
|
|
||||||
|
|
||||||
print(f"\nLead Time Analysis:")
|
|
||||||
print(f" Mean: {np.mean(lead_times):.1f} days")
|
|
||||||
print(f" Std: {np.std(lead_times):.1f} days")
|
|
||||||
print(f" Min: {np.min(lead_times):.0f} days")
|
|
||||||
print(f" Max: {np.max(lead_times):.0f} days")
|
|
||||||
print(f" Optimal (3-14d): {sum((3<=x<=14 for x in lead_times))/len(lead_times)*100:.1f}%")
|
|
||||||
|
|
||||||
print(f"\nError Analysis:")
|
|
||||||
print(f" False positives (wrong timing): {len(false_positives)} sequences")
|
|
||||||
print(f" Misses (no warning): {len(misses)} sequences")
|
|
||||||
print(f" Accuracy: {len(lead_times)/(len(lead_times)+len(false_positives)+len(misses))*100:.1f}%")
|
|
||||||
|
|
||||||
print(f"\nPer-Field Performance:")
|
|
||||||
for field, perf in sorted(field_performance.items()):
|
|
||||||
accuracy = perf['correct'] / perf['total'] * 100
|
|
||||||
print(f" {field:15s}: {accuracy:5.1f}% correct")
|
|
||||||
|
|
||||||
return {
|
|
||||||
'lead_times': lead_times,
|
|
||||||
'false_positives': len(false_positives),
|
|
||||||
'misses': len(misses),
|
|
||||||
'field_performance': field_performance
|
|
||||||
}
|
|
||||||
|
|
||||||
# Run it
|
|
||||||
metrics = compute_operational_metrics(model, test_sequences_labeled, X_test_features)
|
|
||||||
```
|
|
||||||
|
|
||||||
### What to Look For
|
|
||||||
|
|
||||||
**Good performance**:
|
|
||||||
```
|
|
||||||
Mean lead time: 7-10 days ✅ (gives farmer time to prepare)
|
|
||||||
Optimal timing: >80% ✅ (most warnings in 3-14d window)
|
|
||||||
False positives: <5% ✅ (rarely cry wolf)
|
|
||||||
Misses: <10% ✅ (rarely miss harvest)
|
|
||||||
```
|
|
||||||
|
|
||||||
**Poor performance**:
|
|
||||||
```
|
|
||||||
Mean lead time: 2 days ❌ (too late)
|
|
||||||
Optimal timing: <60% ❌ (inconsistent)
|
|
||||||
False positives: >20% ❌ (farmers lose trust)
|
|
||||||
Misses: >20% ❌ (unreliable)
|
|
||||||
```
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## Phase 5: Rainfall Features (Optional, High Value) (Est. 3-4 hours)
|
|
||||||
|
|
||||||
### Similar to Temperature
|
|
||||||
|
|
||||||
Add rainfall + soil moisture features:
|
|
||||||
|
|
||||||
```python
|
|
||||||
def add_rainfall_features(df, rainfall_column='daily_rainfall_mm'):
|
|
||||||
"""
|
|
||||||
Add drought/moisture stress features.
|
|
||||||
|
|
||||||
New features (3 total):
|
|
||||||
1. rainfall_7d: Total rain in last 7 days
|
|
||||||
2. rainfall_deficit: Deficit vs normal for this time of year
|
|
||||||
3. drought_stress_index: Combination metric
|
|
||||||
"""
|
|
||||||
|
|
||||||
# 1. 7-day rainfall
|
|
||||||
df['rainfall_7d'] = df.groupby('field')[rainfall_column].transform(
|
|
||||||
lambda x: x.rolling(7, min_periods=1).sum()
|
|
||||||
)
|
|
||||||
|
|
||||||
# 2. Seasonal rainfall average
|
|
||||||
df['seasonal_rain_avg'] = df.groupby('field')[rainfall_column].transform(
|
|
||||||
lambda x: x.rolling(30, center=True, min_periods=1).mean()
|
|
||||||
)
|
|
||||||
df['rainfall_deficit'] = df['seasonal_rain_avg'] - df[rainfall_column]
|
|
||||||
|
|
||||||
# 3. Drought stress index
|
|
||||||
# (0 = not stressed, 1 = severe drought)
|
|
||||||
df['drought_stress'] = np.minimum(
|
|
||||||
1.0,
|
|
||||||
df['rainfall_deficit'] / (df['seasonal_rain_avg'] + 0.1)
|
|
||||||
)
|
|
||||||
|
|
||||||
return df
|
|
||||||
```
|
|
||||||
|
|
||||||
**Why this helps**:
|
|
||||||
- Drought accelerates maturity (early harvest)
|
|
||||||
- Excessive rain delays harvest
|
|
||||||
- Model can distinguish "ready to harvest" from "crop stressed"
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## Summary: Quick Implementation Checklist
|
|
||||||
|
|
||||||
### Week 1: Foundation
|
|
||||||
- [ ] Phase 1: Retrain on all clients
|
|
||||||
- [ ] Change `CLIENT_FILTER = None`
|
|
||||||
- [ ] Run full pipeline
|
|
||||||
- [ ] Compare metrics
|
|
||||||
|
|
||||||
### Week 2: Core Enhancement
|
|
||||||
- [ ] Phase 2: Add temperature features
|
|
||||||
- [ ] Find/download temperature data
|
|
||||||
- [ ] Merge with CI data
|
|
||||||
- [ ] Update feature engineering (7 → 11 features)
|
|
||||||
- [ ] Retrain model
|
|
||||||
- [ ] Compare metrics (expect 3-5% AUC gain)
|
|
||||||
|
|
||||||
### Week 3: Optimization & Testing
|
|
||||||
- [ ] Phase 3: Test imminent windows
|
|
||||||
- [ ] Run sensitivity analysis
|
|
||||||
- [ ] Choose optimal window
|
|
||||||
- [ ] Retrain with new window
|
|
||||||
|
|
||||||
- [ ] Phase 4: Operational metrics
|
|
||||||
- [ ] Compute lead times
|
|
||||||
- [ ] Measure false positive rate
|
|
||||||
- [ ] Per-field performance analysis
|
|
||||||
|
|
||||||
### Week 4: Optional Enhancement
|
|
||||||
- [ ] Phase 5: Add rainfall features (if data available)
|
|
||||||
- [ ] Download precipitation data
|
|
||||||
- [ ] Add drought stress features
|
|
||||||
- [ ] Retrain
|
|
||||||
- [ ] Measure improvement
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## Expected Performance Trajectory
|
|
||||||
|
|
||||||
```
|
|
||||||
Current (ESA-only, CI-only):
|
|
||||||
Imminent AUC: 0.8793
|
|
||||||
False positive rate: ~15%
|
|
||||||
|
|
||||||
Phase 1 (All clients):
|
|
||||||
Imminent AUC: 0.90-0.92 (+2-3%)
|
|
||||||
False positive rate: ~12%
|
|
||||||
|
|
||||||
Phase 2 (Add temperature):
|
|
||||||
Imminent AUC: 0.93-0.95 (+3-5% from Phase 1)
|
|
||||||
False positive rate: ~5%
|
|
||||||
|
|
||||||
Phase 3 (Optimize window):
|
|
||||||
Imminent AUC: 0.95-0.96 (+1% from fine-tuning)
|
|
||||||
False positive rate: ~3%
|
|
||||||
|
|
||||||
Phase 4 (Operational tuning):
|
|
||||||
Imminent AUC: 0.95-0.96 (stable)
|
|
||||||
Lead time: 7-10 days
|
|
||||||
Operational readiness: 95%
|
|
||||||
|
|
||||||
Phase 5 (Add rainfall):
|
|
||||||
Imminent AUC: 0.96-0.97 (+1% for drought years)
|
|
||||||
False positive rate: ~2%
|
|
||||||
Operational readiness: 99%
|
|
||||||
```
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## Key Takeaways
|
|
||||||
|
|
||||||
1. **Multi-client retraining is the biggest quick win** (5-10% gain with minimal effort)
|
|
||||||
2. **Temperature features are essential** for distinguishing harvest-ready from stress
|
|
||||||
3. **Imminent window tuning** can reduce false positives by 30-50%
|
|
||||||
4. **Operational metrics** matter more than academic metrics (lead time > AUC)
|
|
||||||
5. **Rainfall features** are optional but valuable for drought-prone regions
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## Next Steps
|
|
||||||
|
|
||||||
1. **This week**: Run Phase 1 (all-client retrain)
|
|
||||||
2. **Analyze results**: Compare on same fields, measure improvements
|
|
||||||
3. **Plan Phase 2**: Identify temperature data source
|
|
||||||
4. **Schedule Phase 2**: Allocate 3-4 hours for implementation
|
|
||||||
5. **Document findings**: Track AUC, false positive rate, lead time for each phase
|
|
||||||
|
|
||||||
Good luck! This is a solid model with clear paths to improvement. 🚀
|
|
||||||
|
|
@ -1,726 +0,0 @@
|
||||||
# Harvest Detection LSTM - Comprehensive Evaluation & Recommendations
|
|
||||||
|
|
||||||
**Evaluated**: December 8, 2025
|
|
||||||
**Script**: `python_app/harvest_detection_experiments/05_lstm_harvest_detection_pytorch.ipynb`
|
|
||||||
**Status**: ✅ Well-architected, working well. Minor improvements suggested.
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## Executive Summary (Non-NN Perspective)
|
|
||||||
|
|
||||||
### What This Script Does (Plain Language)
|
|
||||||
|
|
||||||
You have a **time-series pattern recognition system** that watches the Chlorophyll Index (CI) data over a full sugarcane season (300-400+ days) and learns to recognize **two distinct signals**:
|
|
||||||
|
|
||||||
1. **"Harvest is coming soon"** - Detects when CI starts showing harvest-specific patterns (peaks 3-14 days before harvest)
|
|
||||||
2. **"Harvest just happened"** - Confirms when harvest occurred (peaks 1-21 days after harvest boundary)
|
|
||||||
|
|
||||||
**Think of it like**: A doctor learning to recognize symptoms in a patient's blood test over time. The AI sees the full history and learns what "normal seasonal variation" looks like vs. what "harvest imminent" looks like.
|
|
||||||
|
|
||||||
### Current Performance
|
|
||||||
|
|
||||||
| Task | Score | What It Means |
|
|
||||||
|------|-------|---------------|
|
|
||||||
| **Harvest Imminent** | AUC = 0.8793 | 88% accurate at detecting the coming harvest window |
|
|
||||||
| **Harvest Detected** | AUC = 0.9798 | 98% accurate at confirming harvest happened |
|
|
||||||
|
|
||||||
**AUC = Area Under Curve**: Score from 0-1 where 0.5 = guessing randomly, 1.0 = perfect.
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## Script Walkthrough (What Each Section Does)
|
|
||||||
|
|
||||||
### **Section 1-2: Data Loading & Quality Control** ✅ EXCELLENT
|
|
||||||
|
|
||||||
**What's happening:**
|
|
||||||
- Loads CI data from CSV files (mean values per field per date)
|
|
||||||
- Removes fields with poor data quality (too much linear interpolation = likely bad satellite data)
|
|
||||||
- Removes isolated spike noise (single bad sensor readings)
|
|
||||||
- Filters to seasons ≥300 days (incomplete seasons discarded)
|
|
||||||
|
|
||||||
**Current approach is smart:**
|
|
||||||
- ✅ Linear interpolation detection (R² > 0.95 = suspicious straight line)
|
|
||||||
- ✅ Spike noise removal (isolated outliers replaced with neighbor median)
|
|
||||||
- ✅ Data quality threshold = 85% (meaning up to 85% linear interpolation is tolerated)
|
|
||||||
|
|
||||||
**Assessment**: This is **gold-standard preprocessing**. Most teams skip this and wonder why models fail.
|
|
||||||
|
|
||||||
**Recommendations**:
|
|
||||||
1. **Add temperature/rainfall data** (see suggestions below) - currently missing crucial agronomic variables
|
|
||||||
2. **Document data source**: Where does `lstm_train_data.csv` come from? How is CI calculated?
|
|
||||||
3. **Cloud handling**: Current code notes "CI band = 0" for clouds. Consider separate handling for completely cloudy weeks vs. partial cloud.
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
### **Section 2b: Train/Val/Test Split by Field** ✅ EXCELLENT
|
|
||||||
|
|
||||||
**What's happening:**
|
|
||||||
- Splits entire fields into train/val/test (not individual days within a field)
|
|
||||||
- Prevents **data leakage** (model can't cheat by seeing harvest date of same field in training)
|
|
||||||
|
|
||||||
**Why this matters**:
|
|
||||||
- Wrong: "Split days randomly" → Model learns field-specific patterns, test set from same field → inflated performance
|
|
||||||
- Correct (current): "Split entire fields" → Test on completely unknown fields → true generalization
|
|
||||||
|
|
||||||
**Assessment**: ✅ This is correct and essential.
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
### **Section 3: Build Season Sequences + Next-Season Extension** ✅ CLEVER DESIGN
|
|
||||||
|
|
||||||
**What's happening:**
|
|
||||||
```
|
|
||||||
Original Season 1: [DAY 1 ........ DAY 400]
|
|
||||||
↓ HARVEST
|
|
||||||
Extended Season 1: [DAY 1 ........ DAY 400] + [40 days from Season 2]
|
|
||||||
```
|
|
||||||
|
|
||||||
**Why extend into next season?**
|
|
||||||
- Teaches model: "What does harvest look like?" (end of season 1)
|
|
||||||
- Shows: "What's the boundary?" (harvest line)
|
|
||||||
- Demonstrates: "What's healthy new growth?" (first 40 days of season 2)
|
|
||||||
|
|
||||||
**Assessment**: ✅ Excellent pedagogical design. Model learns full context, not just isolated death of CI.
|
|
||||||
|
|
||||||
**Question**: How many fields actually have next-season data in training? If many don't, this might create a data class imbalance (sequences with extension vs. without).
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
### **Section 4: Label Harvest Windows** ✅ GOOD, BUT COULD BE TIGHTER
|
|
||||||
|
|
||||||
**Current labels:**
|
|
||||||
- **Imminent**: 3-14 days BEFORE harvest (range = 11 days)
|
|
||||||
- **Detected**: 1-21 days AFTER harvest (range = 20 days)
|
|
||||||
|
|
||||||
**Assessment**:
|
|
||||||
- ✅ Good: Imminent window is now "tight" (was 7-30 days, improved to 3-14)
|
|
||||||
- ⚠️ Issue: Still overlaps with natural seasonal decline. CI naturally dips before maturity.
|
|
||||||
- ✅ Good: Detected window is wide (1-21 days = ~3 weeks), perfect for weekly operations
|
|
||||||
|
|
||||||
**Recommendations**:
|
|
||||||
1. **Consider even tighter imminent**: 7-14 days? Or 10-21 days? Test both:
|
|
||||||
- 3-14 = very early warning (more false positives, more lead time)
|
|
||||||
- 7-14 = balanced warning (moderate lead time, fewer false alarms)
|
|
||||||
- 10-21 = late warning (high precision, less lead time)
|
|
||||||
|
|
||||||
2. **Add "harvest_probable"** (5-30 days before): Intermediate confidence signal
|
|
||||||
- Used for secondary alerts ("harvest likely in 2-4 weeks, get ready")
|
|
||||||
- Less strict than "imminent" but more specific than nothing
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
### **Section 5: Feature Engineering** ✅ GOOD, COULD ADD AGRONOMIC FEATURES
|
|
||||||
|
|
||||||
**Current 7 features derived from CI:**
|
|
||||||
|
|
||||||
| Feature | Purpose |
|
|
||||||
|---------|---------|
|
|
||||||
| CI | Raw chlorophyll |
|
|
||||||
| 7d Velocity | Rate of change (fast = harvest signal) |
|
|
||||||
| 7d Acceleration | Change in rate (inflection points) |
|
|
||||||
| 14d MA | Smoothed trend |
|
|
||||||
| 14d Velocity | Longer-term slope |
|
|
||||||
| 7d Minimum | Catches crashes (harvest = minimum) |
|
|
||||||
| Velocity Magnitude | Speed of change (direction-independent) |
|
|
||||||
|
|
||||||
**Assessment**: ✅ These are harvest-relevant. Model should learn "drop to minimum" = harvest.
|
|
||||||
|
|
||||||
**Recommendations - ADD THESE FEATURES** (if data available):
|
|
||||||
|
|
||||||
1. **Temperature/Growing Degree Days (GDD)**
|
|
||||||
- Harvest timing correlates with accumulated heat
|
|
||||||
- Add: `gdd_cumulative`, `daily_temp_anomaly` (vs. seasonal average)
|
|
||||||
- Why: Sugarcane growth is temperature-dependent. Cold = slower ripening.
|
|
||||||
|
|
||||||
2. **Rainfall/Moisture Stress**
|
|
||||||
- Drought = earlier maturity (harvest signal)
|
|
||||||
- Add: `rainfall_7d`, `soil_moisture_deficit`
|
|
||||||
- Why: Water availability affects CI and harvest readiness
|
|
||||||
|
|
||||||
3. **Day-of-Year (DOY) Cyclical Encoding**
|
|
||||||
- Current: Uses raw day number (doesn't wrap around)
|
|
||||||
- Add: `sin(2π*doy/365)`, `cos(2π*doy/365)` (cyclical encoding)
|
|
||||||
- Why: Day 364 should be close to day 1 (Dec 31 ≈ Jan 1), but raw values are far apart
|
|
||||||
|
|
||||||
4. **Seasonal CI Statistics**
|
|
||||||
- `ci_percentile_of_season`: Where is current CI relative to this season's range?
|
|
||||||
- `ci_distance_to_peak`: How far from season's peak CI?
|
|
||||||
- Why: Harvest = minimum relative to season, not absolute minimum
|
|
||||||
|
|
||||||
5. **Derivative Features Already Missing**:
|
|
||||||
- ~~7-day minimum~~ ✅ You have this
|
|
||||||
- Velocity magnitude ✅ You have this
|
|
||||||
- ~~Variance over 7 days~~: `ci_std_7d` (detects smoothness vs. volatility)
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
### **Section 6: Normalization** ✅ CORRECT
|
|
||||||
|
|
||||||
**What's happening:**
|
|
||||||
- Each of 7 features normalized independently to [0, 1] using MinMaxScaler
|
|
||||||
- Scaler trained on training set only (prevents data leakage)
|
|
||||||
- NaN/Inf handled properly
|
|
||||||
|
|
||||||
**Assessment**: ✅ Correct. This is standard practice.
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
### **Section 7: PyTorch Dataset & Dynamic Padding** ✅ EXCELLENT
|
|
||||||
|
|
||||||
**What's happening:**
|
|
||||||
- Sequences have variable length (300-400+ days)
|
|
||||||
- No fixed-length padding; each batch pads to its longest sequence only
|
|
||||||
- Mask created to ignore padding in loss calculation
|
|
||||||
|
|
||||||
**Why this matters:**
|
|
||||||
- ❌ Wrong approach: Zero-pad all sequences to 500 days → Wastes memory, adds noise
|
|
||||||
- ✅ Correct approach (current): Pad to batch max → Efficient, no artificial padding noise
|
|
||||||
|
|
||||||
**Assessment**: ✅ This is the right way to handle variable-length sequences.
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
### **Section 8: LSTM Architecture** ⚠️ GOOD BUT COULD BE MORE SOPHISTICATED
|
|
||||||
|
|
||||||
**Current architecture:**
|
|
||||||
```
|
|
||||||
Input: (batch, seq_len, 7 features)
|
|
||||||
↓
|
|
||||||
LSTM: 64 hidden units, 1 layer, 50% dropout
|
|
||||||
↓
|
|
||||||
Head 1: Linear(64 → 16) + ReLU + Dropout → Sigmoid → Imminent prob
|
|
||||||
Head 2: Linear(64 → 16) + ReLU + Dropout → Sigmoid → Detected prob
|
|
||||||
↓
|
|
||||||
Output: (batch, seq_len, 1) per head
|
|
||||||
```
|
|
||||||
|
|
||||||
**Assessment**:
|
|
||||||
- ✅ Unidirectional LSTM is correct (must predict forward in time for operational use)
|
|
||||||
- ✅ Dual output heads are good (two related tasks)
|
|
||||||
- ⚠️ Model is quite **small** (64 hidden units, 1 layer)
|
|
||||||
- ⚠️ No attention mechanism (would help focus on key harvest-timing features)
|
|
||||||
|
|
||||||
**Recommendations:**
|
|
||||||
|
|
||||||
1. **Experiment with model sizes** (if not already done):
|
|
||||||
```python
|
|
||||||
# Current
|
|
||||||
LSTM(input_size=7, hidden_size=64, num_layers=1)
|
|
||||||
|
|
||||||
# Try these:
|
|
||||||
- LSTM(input_size=7, hidden_size=128, num_layers=2) # Bigger
|
|
||||||
- LSTM(input_size=7, hidden_size=32, num_layers=1) # Smaller (test efficiency)
|
|
||||||
```
|
|
||||||
|
|
||||||
2. **Add Attention Layer** (advanced, optional):
|
|
||||||
```python
|
|
||||||
# After LSTM, before output heads:
|
|
||||||
attention_weights = SoftmaxAttention(lstm_out) # Learn which timesteps matter
|
|
||||||
context_vector = weighted_sum(lstm_out, attention_weights)
|
|
||||||
# This helps model focus on harvest-critical weeks
|
|
||||||
```
|
|
||||||
|
|
||||||
3. **Consider Bidirectional LSTM for analysis** (NOT operational):
|
|
||||||
- During training/validation: Use bidirectional (sees full season)
|
|
||||||
- During operational prediction: Switch to unidirectional (only past data)
|
|
||||||
- This gives model more context during training
|
|
||||||
|
|
||||||
4. **Add Residual Connections** (if expanding to 2+ layers):
|
|
||||||
```python
|
|
||||||
lstm_out = lstm_out + input # Skip connection
|
|
||||||
# Helps gradient flow in deeper networks
|
|
||||||
```
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
### **Section 9: Training** ✅ SOLID
|
|
||||||
|
|
||||||
**What's happening:**
|
|
||||||
- Optimizer: Adam (standard, good choice)
|
|
||||||
- Loss: Focal Binary Cross-Entropy (handles class imbalance)
|
|
||||||
- Class weights: Imminent gets 5-8x weight (rare positive class)
|
|
||||||
- Early stopping: patience=20 (stop if val loss doesn't improve)
|
|
||||||
- Gradient clipping: max_norm=1.0 (prevents exploding gradients)
|
|
||||||
|
|
||||||
**Assessment**: ✅ All reasonable choices. Shows good NN practices.
|
|
||||||
|
|
||||||
**Recommendations**:
|
|
||||||
1. **Log loss curves** (appears to be done)
|
|
||||||
2. **Check if early stopping triggered**: Did training stop at 100 epochs or before?
|
|
||||||
3. **Consider learning rate schedule**: Currently fixed at 0.001
|
|
||||||
- Could decay: `lr = 0.001 * (0.95 ** epoch)` after 50 epochs
|
|
||||||
- Helps fine-tuning in later training phases
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
### **Section 10: Evaluation** ✅ GOOD STARTING POINT
|
|
||||||
|
|
||||||
**Current metrics:**
|
|
||||||
- Classification report (precision, recall, F1)
|
|
||||||
- ROC-AUC scores
|
|
||||||
- Confusion matrices
|
|
||||||
|
|
||||||
**Assessment**: ✅ Standard metrics. Good baseline.
|
|
||||||
|
|
||||||
**Recommendations - Add These Metrics:**
|
|
||||||
|
|
||||||
1. **Per-field performance** (not just overall):
|
|
||||||
```python
|
|
||||||
for field in test_fields:
|
|
||||||
field_preds = predictions[field_indices]
|
|
||||||
field_labels = labels[field_indices]
|
|
||||||
auc = roc_auc_score(field_labels, field_preds)
|
|
||||||
print(f"{field}: AUC = {auc:.4f}")
|
|
||||||
```
|
|
||||||
Why: Might perform well on some fields, poorly on others. Reveals data quality issues.
|
|
||||||
|
|
||||||
2. **Temporal distance to harvest** (operational metric):
|
|
||||||
```python
|
|
||||||
imminent_triggers = np.where(imminent_pred > 0.5)[0]
|
|
||||||
harvest_date_idx = ...
|
|
||||||
days_before_harvest = harvest_date_idx - imminent_triggers[-1]
|
|
||||||
print(f"Model predicted {days_before_harvest} days before harvest")
|
|
||||||
```
|
|
||||||
Why: For operations, you care "Did we warn farmer in time?" not just AUC.
|
|
||||||
|
|
||||||
3. **False positive rate per field-season**:
|
|
||||||
```python
|
|
||||||
false_positives = sum((pred > 0.5) & (label == 0))
|
|
||||||
positives = sum(pred > 0.5)
|
|
||||||
false_positive_rate = false_positives / positives
|
|
||||||
```
|
|
||||||
Why: Farmers don't want 10 false alarms per season.
|
|
||||||
|
|
||||||
4. **Lead time analysis**:
|
|
||||||
```
|
|
||||||
For each harvest:
|
|
||||||
- How many days before did model predict?
|
|
||||||
- Was it in the 3-14 day window?
|
|
||||||
- Too early (>14d) or too late (<3d)?
|
|
||||||
```
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
### **Sections 11: Visualizations** ✅ EXCELLENT
|
|
||||||
|
|
||||||
**Current visualizations:**
|
|
||||||
- Single sequence with CI + ground truth + model predictions
|
|
||||||
- Multiple sequences in grid view
|
|
||||||
- Confusion matrices
|
|
||||||
|
|
||||||
**Assessment**: ✅ Very informative. Shows model behavior clearly.
|
|
||||||
|
|
||||||
**Observations from the code:**
|
|
||||||
- Dual-axis plots (CI on left, predictions on right) - great design
|
|
||||||
- Threshold crossing detection (shows when model would trigger)
|
|
||||||
- Clear distinction between true positive windows and false positives
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
### **Sections 12: Model Saving** ✅ GOOD
|
|
||||||
|
|
||||||
**What's saved:**
|
|
||||||
- Model weights (.pt file)
|
|
||||||
- Feature scalers (.pkl file)
|
|
||||||
- Configuration (.json file)
|
|
||||||
- Metadata CSV files
|
|
||||||
|
|
||||||
**Assessment**: ✅ Reproducible. Everything needed to deploy is saved.
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## Data Quality & Cleaning - Deep Dive
|
|
||||||
|
|
||||||
### Linear Interpolation Detection ✅ EXCELLENT
|
|
||||||
|
|
||||||
The script detects data quality issues by looking for suspiciously straight lines in the time series.
|
|
||||||
|
|
||||||
**How it works:**
|
|
||||||
1. Uses sliding 30-day windows
|
|
||||||
2. Fits linear regression to each window: R² = correlation squared
|
|
||||||
3. If R² > 0.95, window is "suspiciously linear" = likely interpolated
|
|
||||||
4. Removes seasons where >85% of windows are linear
|
|
||||||
|
|
||||||
**Example:**
|
|
||||||
```
|
|
||||||
Good data (natural variation): R² = 0.70 (realistic noise)
|
|
||||||
Interpolated (straight line): R² = 0.98 (suspiciously smooth)
|
|
||||||
```
|
|
||||||
|
|
||||||
**Assessment**: ✅ This is smart. Prevents training on synthetic data.
|
|
||||||
|
|
||||||
**Suggestion**: Document the threshold (85%). Consider visualizing before/after for a few fields.
|
|
||||||
|
|
||||||
### Spike Noise Removal ✅ CLEVER
|
|
||||||
|
|
||||||
**How it works:**
|
|
||||||
1. For each point, checks if it's isolated from neighbors (2-day window)
|
|
||||||
2. If |value - median_neighbors| > 2.5 * std, replace with median
|
|
||||||
3. Example: [10.2, 9.8, 8.5, 9.9, 10.1] → [10.2, 9.8, 9.9, 9.9, 10.1]
|
|
||||||
(8.5 is obvious outlier; smoothed to 9.9)
|
|
||||||
|
|
||||||
**Assessment**: ✅ Good approach. Removes sensor noise without over-smoothing.
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## Test Results Analysis
|
|
||||||
|
|
||||||
### AUC Scores
|
|
||||||
|
|
||||||
| Task | AUC | Notes |
|
|
||||||
|------|-----|-------|
|
|
||||||
| Imminent | 0.8793 | Good but not perfect |
|
|
||||||
| Detected | 0.9798 | Excellent (nearly perfect) |
|
|
||||||
|
|
||||||
**What these mean:**
|
|
||||||
- **Detected = 0.98**: Out of 100 random harvest-confirmed vs. non-confirmed days, model ranks confirmed days higher 98% of the time
|
|
||||||
- **Imminent = 0.88**: Same logic, but imminent signal is less clear (more affected by seasonal variation)
|
|
||||||
|
|
||||||
### Why Imminent < Detected
|
|
||||||
|
|
||||||
| Aspect | Imminent | Detected |
|
|
||||||
|--------|----------|----------|
|
|
||||||
| **Signal clarity** | 🟡 Ambiguous (harvest time varies by variety/environment) | 🟢 Clear (harvest boundary is definite point) |
|
|
||||||
| **Class imbalance** | 🔴 Severe (11 days labeled out of 300+) | 🟡 Moderate (20 days labeled out of 300+) |
|
|
||||||
| **Natural variation** | 🔴 High (seasonal decline looks like harvest) | 🟢 Low (harvest is unique transition) |
|
|
||||||
|
|
||||||
**This is expected and acceptable.**
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## Key Findings: Strengths & Weaknesses
|
|
||||||
|
|
||||||
### ✅ STRENGTHS
|
|
||||||
|
|
||||||
1. **Excellent data preprocessing**
|
|
||||||
- Linear interpolation detection
|
|
||||||
- Spike noise removal
|
|
||||||
- Quality filtering
|
|
||||||
|
|
||||||
2. **No data leakage**
|
|
||||||
- Split by field (entire fields to test, not individual days)
|
|
||||||
- Scalers fit on training only
|
|
||||||
- Proper sequence boundaries
|
|
||||||
|
|
||||||
3. **Thoughtful architecture**
|
|
||||||
- Variable-length sequences with dynamic padding
|
|
||||||
- Dual-output for two related tasks
|
|
||||||
- Appropriate loss function (focal BCE for imbalance)
|
|
||||||
- Per-timestep predictions (not just last timestep)
|
|
||||||
|
|
||||||
4. **Good visualizations**
|
|
||||||
- Shows model behavior on individual sequences
|
|
||||||
- Easy to spot false positives
|
|
||||||
|
|
||||||
### ⚠️ WEAKNESSES & LIMITATIONS
|
|
||||||
|
|
||||||
1. **Limited input features** (only 7 derived from CI)
|
|
||||||
- Missing: Temperature, rainfall, soil moisture, phenological stage
|
|
||||||
- CI alone may not capture all harvest signals
|
|
||||||
- Especially for stress-driven early harvest
|
|
||||||
|
|
||||||
2. **Small training dataset** (currently ESA-only)
|
|
||||||
- 2-3 fields, ~8-10 seasons = ~2,000 training days
|
|
||||||
- Limited diversity (single climate region)
|
|
||||||
- Model may overfit to ESA-specific patterns
|
|
||||||
- **Solution**: Retrain on all clients (50+ seasons, 10,000+ days)
|
|
||||||
|
|
||||||
3. **Imminent signal has false positives**
|
|
||||||
- Observations show imminent peaks during mid-season decline
|
|
||||||
- Expected: Peak 3-14 days before harvest
|
|
||||||
- Actual: Peaks multiple times during season
|
|
||||||
- Likely because natural CI decline "looks like" harvest decline
|
|
||||||
- **Partial solution**: Tighter imminent window (7-14 instead of 3-14)
|
|
||||||
- **Better solution**: Add temperature/seasonal features to distinguish types of decline
|
|
||||||
|
|
||||||
4. **No confidence intervals**
|
|
||||||
- Model outputs single probability, not range
|
|
||||||
- Operational: "89% confidence" better than "0.89 probability"
|
|
||||||
- Consider: Bayesian LSTM or ensemble
|
|
||||||
|
|
||||||
5. **Limited evaluation on inter-client generalization**
|
|
||||||
- Only tested on one client's fields
|
|
||||||
- Unknown how it performs on chemba, bagamoyo, etc.
|
|
||||||
- Different climates, varieties, management → Different CI patterns
|
|
||||||
|
|
||||||
6. **No temporal validation**
|
|
||||||
- All test data is from past (2020-2023)
|
|
||||||
- Unknown: Will it work on 2024 data? 2025?
|
|
||||||
- Requires: Forward validation on newer seasons
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## Specific Recommendations by Priority
|
|
||||||
|
|
||||||
### 🔴 HIGH PRIORITY (Do First)
|
|
||||||
|
|
||||||
#### 1. **Retrain on All Clients** (Quick, High-Impact)
|
|
||||||
**Why**: ESA-only model shows false imminent triggers on seasonal dips. All-client training adds diversity.
|
|
||||||
|
|
||||||
**Steps**:
|
|
||||||
1. In Section 2, change `CLIENT_FILTER = 'esa'` → `CLIENT_FILTER = None`
|
|
||||||
2. Re-run Sections 2-12
|
|
||||||
3. Evaluate same fields (00F52, 00308) to see if imminent signal improves
|
|
||||||
|
|
||||||
**Expected gain**: 5-10% fewer false imminent positives, better generalization
|
|
||||||
|
|
||||||
**Effort**: 30 minutes to run, 2 hours to analyze
|
|
||||||
|
|
||||||
#### 2. **Add Temperature Data** (Medium Effort, High Value)
|
|
||||||
**Why**: Harvest timing strongly correlates with accumulated heat. CI decline during cold weather is different from harvest decline.
|
|
||||||
|
|
||||||
**Steps**:
|
|
||||||
1. Find temperature data source (ECMWF, NOAA, or local station)
|
|
||||||
2. Merge with CI data by date/location
|
|
||||||
3. Add features:
|
|
||||||
```python
|
|
||||||
gdd = cumsum(max(0, daily_temp - baseline_temp)) # Growing Degree Days
|
|
||||||
temp_anomaly = current_temp - seasonal_avg_temp
|
|
||||||
```
|
|
||||||
4. Update feature count from 7 → 9
|
|
||||||
5. Retrain
|
|
||||||
|
|
||||||
**Expected gain**: 10-15% improvement on imminent signal, better handles off-season decline
|
|
||||||
|
|
||||||
**Effort**: 2-3 hours (depends on data availability)
|
|
||||||
|
|
||||||
#### 3. **Add Tighter Imminent Window** (Quick)
|
|
||||||
**Why**: Current 3-14d window includes natural seasonal decline (7-30d would be too wide).
|
|
||||||
|
|
||||||
**Steps**:
|
|
||||||
1. In Section 4, try these imminent windows:
|
|
||||||
- 7-14 days (conservative, high precision)
|
|
||||||
- 10-21 days (moderate)
|
|
||||||
- 3-7 days (ultra-aggressive, early warning)
|
|
||||||
2. Compare AUC, false positives, lead time on test set
|
|
||||||
|
|
||||||
**Expected gain**: Reduce false positive rate 30-50%
|
|
||||||
|
|
||||||
**Effort**: 20 minutes
|
|
||||||
|
|
||||||
### 🟡 MEDIUM PRIORITY (Do Next)
|
|
||||||
|
|
||||||
#### 4. **Per-Field Performance Analysis** (Quick)
|
|
||||||
**Why**: Model might excel on some fields and fail on others. Reveals which fields need attention.
|
|
||||||
|
|
||||||
**Code**:
|
|
||||||
```python
|
|
||||||
for field in test_fields:
|
|
||||||
field_mask = meta_test['field'] == field
|
|
||||||
field_auc_imm = roc_auc_score(test_labels_imminent[field_mask],
|
|
||||||
test_preds_imminent[field_mask])
|
|
||||||
print(f"{field:15s} Imminent AUC: {field_auc_imm:.4f}")
|
|
||||||
```
|
|
||||||
|
|
||||||
**Expected gain**: Identify problem fields, focus data collection efforts
|
|
||||||
|
|
||||||
**Effort**: 15 minutes
|
|
||||||
|
|
||||||
#### 5. **Add Rainfall/Moisture Features** (Medium Effort)
|
|
||||||
**Why**: Drought stress accelerates maturity. Water stress CI patterns differ from normal decline.
|
|
||||||
|
|
||||||
**Similar to temperature**:
|
|
||||||
1. Find rainfall data (CHIRPS, local stations)
|
|
||||||
2. Add: `rainfall_7d`, `moisture_deficit`, `drought_stress_index`
|
|
||||||
3. Retrain
|
|
||||||
|
|
||||||
**Expected gain**: 5-10% improvement, especially for drought years
|
|
||||||
|
|
||||||
**Effort**: 2-3 hours (if data accessible)
|
|
||||||
|
|
||||||
#### 6. **Add Operational Metrics** (Quick)
|
|
||||||
**Why**: AUC is good, but farmers care "Did we warn in time?"
|
|
||||||
|
|
||||||
**Code**:
|
|
||||||
```python
|
|
||||||
# For each sequence, measure lead time
|
|
||||||
lead_times = []
|
|
||||||
for seq_idx, seq in enumerate(test_sequences_labeled):
|
|
||||||
harvest_idx = ... # find harvest
|
|
||||||
trigger_idx = np.where(imminent_pred > 0.5)[0]
|
|
||||||
if len(trigger_idx) > 0:
|
|
||||||
lead_time = harvest_idx - trigger_idx[-1]
|
|
||||||
lead_times.append(lead_time)
|
|
||||||
|
|
||||||
print(f"Mean lead time: {np.mean(lead_times):.1f} days")
|
|
||||||
print(f"Std lead time: {np.std(lead_times):.1f} days")
|
|
||||||
```
|
|
||||||
|
|
||||||
**Expected gain**: Understand operational viability
|
|
||||||
|
|
||||||
**Effort**: 30 minutes
|
|
||||||
|
|
||||||
### 🟢 LOW PRIORITY (Nice to Have)
|
|
||||||
|
|
||||||
#### 7. **Bidirectional LSTM for Benchmarking**
|
|
||||||
**Why**: See how much extra context helps during training (can't use in operations).
|
|
||||||
|
|
||||||
**Expected gain**: 2-5% AUC improvement (academic interest only)
|
|
||||||
|
|
||||||
**Effort**: 1-2 hours
|
|
||||||
|
|
||||||
#### 8. **Attention Mechanism**
|
|
||||||
**Why**: Helps model learn which weeks matter most for harvest.
|
|
||||||
|
|
||||||
**Expected gain**: Better interpretability, possible 2-3% AUC improvement
|
|
||||||
|
|
||||||
**Effort**: 3-4 hours
|
|
||||||
|
|
||||||
#### 9. **Ensemble Model**
|
|
||||||
**Why**: Combine multiple models for robustness.
|
|
||||||
|
|
||||||
**Expected gain**: 1-2% AUC improvement, better uncertainty estimates
|
|
||||||
|
|
||||||
**Effort**: 2-3 hours
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## Sugarcane Agronomic Context (For Model Improvement)
|
|
||||||
|
|
||||||
To improve the model further, understand these facts about sugarcane:
|
|
||||||
|
|
||||||
### Growth Stages
|
|
||||||
1. **Germination** (0-30 days): Low CI
|
|
||||||
2. **Tillering** (30-120 days): CI rises rapidly
|
|
||||||
3. **Grand Growth** (120-300 days): CI peaks, rapid biomass accumulation
|
|
||||||
4. **Ripening** (300+ days): CI stable or slight decline
|
|
||||||
5. **Harvest-ready** (350+ days): Clear CI minimum + specific patterns
|
|
||||||
|
|
||||||
**Model implication**: Need to distinguish "ripening decline" (stages 4-5) from "stress decline" (drought, frost) at other times.
|
|
||||||
|
|
||||||
### Environmental Factors Affecting CI & Harvest
|
|
||||||
|
|
||||||
| Factor | Effect on CI | Effect on Harvest | How to Model |
|
|
||||||
|--------|------------|-----------------|------------|
|
|
||||||
| **Temperature** | Warm → CI up, Cold → CI down | >Heat days = earlier mature | Add GDD, temp anomaly |
|
|
||||||
| **Rainfall** | Rain → CI up, Drought → CI down | Drought = earlier mature | Add rainfall, moisture deficit |
|
|
||||||
| **Soil Type** | Rich → higher CI | Affects growth rate | Field-specific features |
|
|
||||||
| **Variety** | Affects CI baseline | Affects growth duration | Variety encoding |
|
|
||||||
| **Latitude/Season** | Day-length effect | Affects phenology | DOY + latitude encoding |
|
|
||||||
|
|
||||||
**Current model limitation**: Only sees CI, misses these drivers. Temperature feature would help enormously.
|
|
||||||
|
|
||||||
### Why CI Alone Is Imperfect
|
|
||||||
|
|
||||||
```
|
|
||||||
Scenario 1: Normal Ripening (SHOULD trigger "imminent")
|
|
||||||
- Temperature: Moderate
|
|
||||||
- Rainfall: Normal
|
|
||||||
- CI: Steady decline over 2 weeks
|
|
||||||
- Decision: YES, harvest imminent
|
|
||||||
|
|
||||||
Scenario 2: Drought Stress (FALSE POSITIVE)
|
|
||||||
- Temperature: High
|
|
||||||
- Rainfall: Low
|
|
||||||
- CI: Steady decline over 2 weeks ← Looks identical!
|
|
||||||
- Decision: NO, stress, not harvest-ready (crops need water)
|
|
||||||
|
|
||||||
Problem: CI decline looks the same; must distinguish context.
|
|
||||||
Solution: Add temperature + rainfall features
|
|
||||||
```
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## Data & Code Quality Assessment
|
|
||||||
|
|
||||||
### ✅ Code Quality
|
|
||||||
- Well-commented
|
|
||||||
- Organized into logical sections
|
|
||||||
- Proper error handling (NaN, Inf)
|
|
||||||
- Reproducible (seeds set, configs saved)
|
|
||||||
- Professional PyTorch practices
|
|
||||||
|
|
||||||
### ✅ Documentation
|
|
||||||
- Docstrings for major functions
|
|
||||||
- Print statements show progress clearly
|
|
||||||
- Saved configuration files
|
|
||||||
|
|
||||||
### ⚠️ Could Improve
|
|
||||||
1. No unit tests (though not critical for research)
|
|
||||||
2. No logging to file (all output to stdout only)
|
|
||||||
3. Hardcoded thresholds (0.5 probability, 2.5 std, 14 days, etc.) - consider `config.yaml`
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## Deployment & Operational Readiness
|
|
||||||
|
|
||||||
### Ready for Production? ⚠️ PARTIAL
|
|
||||||
|
|
||||||
**✅ Ready:**
|
|
||||||
- Data preprocessing solid
|
|
||||||
- Model architecture sound
|
|
||||||
- Evaluation metrics reasonable
|
|
||||||
- Code is clean and reproducible
|
|
||||||
|
|
||||||
**⚠️ Not quite:**
|
|
||||||
- Imminent signal has false positives (needs all-client retraining or temperature feature)
|
|
||||||
- Limited to one client (ESA-only)
|
|
||||||
- No confidence intervals or uncertainty quantification
|
|
||||||
- No forward temporal validation (unknown on 2024/2025 data)
|
|
||||||
|
|
||||||
### To Deploy
|
|
||||||
|
|
||||||
1. **Retrain on all clients** (reduces false positives)
|
|
||||||
2. **Test on held-out recent data** (2024 if available)
|
|
||||||
3. **Implement threshold tuning** (maybe 0.7 instead of 0.5 probability)
|
|
||||||
4. **Create monitoring dashboard**:
|
|
||||||
- Weekly alerts per field
|
|
||||||
- False positive tracking
|
|
||||||
- Lead time statistics
|
|
||||||
5. **Add feedback loop**: After harvest, measure accuracy, retrain quarterly
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## Quick-Start Recommendations (In Order)
|
|
||||||
|
|
||||||
### Week 1
|
|
||||||
1. ✅ Change `CLIENT_FILTER = None` and retrain
|
|
||||||
2. ✅ Evaluate on same fields, compare imminent behavior
|
|
||||||
3. ✅ Run per-field performance analysis
|
|
||||||
|
|
||||||
### Week 2
|
|
||||||
4. 🔄 Get temperature data + merge with CI
|
|
||||||
5. 🔄 Add GDD and temperature anomaly features
|
|
||||||
6. 🔄 Retrain with 9 features instead of 7
|
|
||||||
|
|
||||||
### Week 3
|
|
||||||
7. 🔄 Test different imminent windows (7-14d, 10-21d)
|
|
||||||
8. 🔄 Add operational metrics (lead time, false positive rate)
|
|
||||||
9. 🔄 Create visualizations of best configuration
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## Summary Table: Feature Ideas
|
|
||||||
|
|
||||||
| Feature | Source | Priority | Impact | Effort |
|
|
||||||
|---------|--------|----------|--------|--------|
|
|
||||||
| **GDD (Growing Degree Days)** | Temperature data | 🔴 High | High (10-15% gain) | Medium |
|
|
||||||
| **Rainfall (7d)** | Precipitation data | 🔴 High | Medium (5-10% gain) | Medium |
|
|
||||||
| **Soil Moisture Deficit** | Agricultural data | 🟡 Medium | High (10% gain) | High |
|
|
||||||
| **Day-of-Year (cyclic)** | Computed | 🟡 Medium | Low (2-3% gain) | Low |
|
|
||||||
| **CI percentile** | Computed | 🟡 Medium | Medium (5% gain) | Low |
|
|
||||||
| **Variety/Field ID** | Metadata | 🟡 Medium | Medium (3% gain) | Low |
|
|
||||||
| **Latitude/Climate Zone** | Metadata | 🟢 Low | Low (1% gain) | Low |
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## Final Assessment
|
|
||||||
|
|
||||||
### Overall Score: **8.5/10**
|
|
||||||
|
|
||||||
**This is a well-engineered harvest detection system.** The architecture is sound, data preprocessing is excellent, and results are promising. Main limitation is feature richness (CI alone) and single-client training.
|
|
||||||
|
|
||||||
### Quick Wins (Do These Next)
|
|
||||||
1. Retrain on all clients → Likely 5-10% performance gain
|
|
||||||
2. Add temperature features → Likely 10-15% gain on imminent signal
|
|
||||||
3. Test tighter imminent window → Likely 30% reduction in false positives
|
|
||||||
|
|
||||||
### Path to Production
|
|
||||||
- Current state: **Research prototype** (80% ready)
|
|
||||||
- After client retraining: **Pilot ready** (90% ready)
|
|
||||||
- After temperature features: **Production ready** (95% ready)
|
|
||||||
- After forward validation on 2024 data: **Fully operational** (99% ready)
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
**Questions?** Contact data science team for implementation details.
|
|
||||||
|
|
@ -1,195 +0,0 @@
|
||||||
# SmartCane Project - Package Management & Crop Analysis
|
|
||||||
|
|
||||||
## Quick Start
|
|
||||||
|
|
||||||
### For New Team Members
|
|
||||||
1. Open R/RStudio
|
|
||||||
2. Set working directory to the project root: `setwd("path/to/smartcane")`
|
|
||||||
3. Run: `source("r_app/package_manager.R")`
|
|
||||||
4. Type `y` when prompted
|
|
||||||
5. Wait for completion ✅
|
|
||||||
|
|
||||||
### For Existing Team Members (After Git Pull)
|
|
||||||
Same steps as above - the script will check for updates automatically.
|
|
||||||
|
|
||||||
## Crop Analysis Messaging System
|
|
||||||
|
|
||||||
### Overview
|
|
||||||
The `crop_analysis_messaging.R` script provides automated field monitoring with intelligent alerting based on crop index (CI) analysis. It compares weekly satellite data to detect uniformity issues and performance changes.
|
|
||||||
|
|
||||||
### Message Determination Logic
|
|
||||||
|
|
||||||
#### **Uniformity Thresholds (CV = Coefficient of Variation)**
|
|
||||||
| Threshold | CV Range | Category | Action |
|
|
||||||
|-----------|----------|----------|---------|
|
|
||||||
| Excellent | ≤ 0.08 | ✅ Excellent | Monitor only |
|
|
||||||
| Good | 0.08 - 0.15 | ✅ Good | Monitor only |
|
|
||||||
| Moderate | 0.15 - 0.25 | ⚠️ Alert | Review management |
|
|
||||||
| Poor | > 0.25 | 🚨 Urgent | Immediate action |
|
|
||||||
|
|
||||||
#### **Additional Alert Triggers**
|
|
||||||
- **Low Acceptable Area**: < 40% of field within ±25% of mean CI
|
|
||||||
- **Very Strong Clustering**: Moran's I > 0.95 (indicates management issues)
|
|
||||||
- **Declining Trends**: Combined with poor uniformity = escalated urgency
|
|
||||||
|
|
||||||
#### **Message Categories & Examples**
|
|
||||||
|
|
||||||
**🚨 URGENT Messages:**
|
|
||||||
- `🚨 URGENT: Poor field uniformity detected - immediate management review required`
|
|
||||||
- `🚨 CRITICAL: Poor uniformity with declining trend - emergency intervention needed`
|
|
||||||
|
|
||||||
**⚠️ ALERT Messages:**
|
|
||||||
- `⚠️ Alert: Moderate field variation detected - low acceptable area - review management uniformity`
|
|
||||||
- `⚠️ Alert: Good uniformity but very strong clustering detected - check management practices`
|
|
||||||
|
|
||||||
**✅ POSITIVE Messages:**
|
|
||||||
- `✅ Excellent: Optimal field uniformity and stability`
|
|
||||||
- `✅ Great: Good uniformity with improvement trend`
|
|
||||||
|
|
||||||
**💡 OPPORTUNITY Messages:**
|
|
||||||
- `💡 Opportunity: X% of field performing well - replicate conditions in remaining areas`
|
|
||||||
|
|
||||||
#### **Spatial Pattern Analysis**
|
|
||||||
- **Moran's I Interpretation**: Measures spatial autocorrelation
|
|
||||||
- 0.7-0.85: Normal field continuity
|
|
||||||
- 0.85-0.95: Strong spatial pattern (monitor)
|
|
||||||
- >0.95: Very strong clustering (management concern)
|
|
||||||
|
|
||||||
#### **Farm-Wide Summary Statistics**
|
|
||||||
- **Field Distribution**: Percentage in each uniformity category
|
|
||||||
- **Area Analysis**: Hectares improving/declining/stable
|
|
||||||
- **Consistency Checks**: Flags inconsistent CV vs Entropy readings
|
|
||||||
- **Alert Rate**: Typically 15-20% of fields for healthy monitoring
|
|
||||||
|
|
||||||
## Package Management
|
|
||||||
|
|
||||||
### What This Script Does
|
|
||||||
|
|
||||||
1. **Initializes renv** - Creates isolated package environment
|
|
||||||
2. **Checks package versions** - Compares installed vs required
|
|
||||||
3. **Installs/Updates packages** - Only if needed
|
|
||||||
4. **Creates lockfile** - `renv.lock` for exact reproducibility
|
|
||||||
5. **Generates reports** - Console output + `package_manager.log`
|
|
||||||
|
|
||||||
### Key Features
|
|
||||||
|
|
||||||
- ✅ **Minimum version requirements** (allows patch updates)
|
|
||||||
- ✅ **Critical package locking** (tmap v4 for new syntax)
|
|
||||||
- ✅ **Automatic installation** of missing packages
|
|
||||||
- ✅ **Console + Log output** for debugging
|
|
||||||
- ✅ **Cross-platform compatibility**
|
|
||||||
|
|
||||||
### Required Packages & Versions
|
|
||||||
|
|
||||||
| Package | Min Version | Purpose |
|
|
||||||
|---------|-------------|---------|
|
|
||||||
| tmap | 4.0.0 | **CRITICAL** - New syntax used |
|
|
||||||
| tidyverse | 2.0.0 | Data manipulation |
|
|
||||||
| sf | 1.0.0 | Spatial data |
|
|
||||||
| terra | 1.7.0 | Raster processing |
|
|
||||||
| rmarkdown | 2.21.0 | Report generation |
|
|
||||||
| spdep | 1.2.0 | **NEW** - Spatial statistics for Moran's I |
|
|
||||||
|
|
||||||
### Crop Analysis Usage
|
|
||||||
|
|
||||||
```r
|
|
||||||
# Basic usage (defaults to weeks 30 vs 29, simba farm)
|
|
||||||
Rscript crop_analysis_messaging.R
|
|
||||||
|
|
||||||
# Custom analysis
|
|
||||||
Rscript crop_analysis_messaging.R 32 31 other_farm
|
|
||||||
|
|
||||||
# Or from R console
|
|
||||||
source("r_app/crop_analysis_messaging.R")
|
|
||||||
```
|
|
||||||
|
|
||||||
#### **Output Interpretation**
|
|
||||||
- **Field-by-field analysis**: Individual field metrics and alerts
|
|
||||||
- **Summary section**: Total alerts and problem fields
|
|
||||||
- **Farm-wide statistics**: Overall uniformity distribution
|
|
||||||
- **Consistency checks**: Spatial pattern validation
|
|
||||||
|
|
||||||
## Workflow
|
|
||||||
|
|
||||||
### Development Workflow
|
|
||||||
```
|
|
||||||
1. 👨💻 Developer: Make changes → run package_manager.R → test → commit + push
|
|
||||||
2. 👥 Teammate: Pull → run package_manager.R → test
|
|
||||||
3. 🚀 Production: Pull → run package_manager.R → deploy
|
|
||||||
```
|
|
||||||
|
|
||||||
### Crop Analysis Workflow
|
|
||||||
```
|
|
||||||
1. 📊 Weekly Data: New satellite mosaics arrive
|
|
||||||
2. 🔄 Analysis: Run crop_analysis_messaging.R
|
|
||||||
3. 📋 Review: Check alerts and prioritize fields
|
|
||||||
4. 🚜 Action: Implement management recommendations
|
|
||||||
5. 📈 Monitor: Track improvements in next week's analysis
|
|
||||||
```
|
|
||||||
|
|
||||||
### Files Created
|
|
||||||
- `renv.lock` - Exact package versions (commit this!)
|
|
||||||
- `package_manager.log` - Installation log (don't commit)
|
|
||||||
- `renv/` folder - Package cache (don't commit)
|
|
||||||
|
|
||||||
## Troubleshooting
|
|
||||||
|
|
||||||
### "Package failed to install"
|
|
||||||
- Check internet connection
|
|
||||||
- Update R to latest version
|
|
||||||
- Install system dependencies (varies by OS)
|
|
||||||
|
|
||||||
### "Version conflicts"
|
|
||||||
- Delete `renv/` folder
|
|
||||||
- Run script again for clean install
|
|
||||||
|
|
||||||
### "renv not working"
|
|
||||||
- Install manually: `install.packages("renv")`
|
|
||||||
- Restart R session
|
|
||||||
- Run script again
|
|
||||||
|
|
||||||
## Team Guidelines
|
|
||||||
|
|
||||||
1. **Always run** `package_manager.R` after pulling changes
|
|
||||||
2. **Commit** `renv.lock` to git (not `renv/` folder)
|
|
||||||
3. **Don't modify** package versions in scripts - use this manager
|
|
||||||
4. **Report issues** in the log file to team
|
|
||||||
5. **Review crop alerts** weekly and prioritize urgent fields first
|
|
||||||
6. **Document management actions** taken in response to alerts
|
|
||||||
|
|
||||||
## Crop Analysis Best Practices
|
|
||||||
|
|
||||||
### **Alert Prioritization**
|
|
||||||
1. 🚨 **URGENT** alerts: Address within 24-48 hours
|
|
||||||
2. ⚠️ **ALERT** messages: Schedule for next management cycle
|
|
||||||
3. 💡 **OPPORTUNITIES**: Implement when resources available
|
|
||||||
|
|
||||||
### **Field Investigation Guidelines**
|
|
||||||
- **High CV + Low Acceptable Area**: Check irrigation uniformity
|
|
||||||
- **Very Strong Clustering**: Look for management zone boundaries
|
|
||||||
- **Declining Trends**: Investigate nutrition or disease issues
|
|
||||||
- **Spatial Patterns**: Consider soil variability or equipment patterns
|
|
||||||
|
|
||||||
### **Validation Steps**
|
|
||||||
1. Cross-reference alerts with field observations
|
|
||||||
2. Check weather data for recent stress events
|
|
||||||
3. Validate spatial patterns with drone imagery if available
|
|
||||||
4. Document management responses and track improvements
|
|
||||||
|
|
||||||
## Advanced Usage
|
|
||||||
|
|
||||||
### Restore from lockfile only:
|
|
||||||
```r
|
|
||||||
renv::restore()
|
|
||||||
```
|
|
||||||
|
|
||||||
### Add new package requirement:
|
|
||||||
1. Edit `REQUIRED_PACKAGES` in `package_manager.R`
|
|
||||||
2. Run the script
|
|
||||||
3. Commit updated `renv.lock`
|
|
||||||
|
|
||||||
### Check status without changes:
|
|
||||||
```r
|
|
||||||
source("r_app/package_manager.R")
|
|
||||||
# Then just read the log or run generate_package_report()
|
|
||||||
```
|
|
||||||
|
|
@ -1,251 +0,0 @@
|
||||||
# TL;DR - Harvest Detection Script Summary
|
|
||||||
|
|
||||||
## What Is This?
|
|
||||||
|
|
||||||
A **deep learning model** that watches the Chlorophyll Index (CI) time series of a sugarcane field over a full season (300-400+ days) and predicts two things:
|
|
||||||
|
|
||||||
1. **"Harvest is coming in 3-14 days"** (sends farmer alert) - AUC = 0.88
|
|
||||||
2. **"Harvest happened 1-21 days ago"** (confirms in database) - AUC = 0.98
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## How Does It Work? (Simple Explanation)
|
|
||||||
|
|
||||||
**Imagine** you're teaching a doctor to recognize when a patient is about to have a seizure by looking at their brainwave readings over weeks of data.
|
|
||||||
|
|
||||||
- **Input**: Brainwave readings over weeks (like CI over a season)
|
|
||||||
- **Pattern Recognition**: The model learns what the brainwave looks like JUST BEFORE a seizure
|
|
||||||
- **Output**: "High probability of seizure in next 3-14 hours" (like our harvest warning)
|
|
||||||
|
|
||||||
**Your model** does the same with sugarcane:
|
|
||||||
- **Input**: Chlorophyll Index readings over 300-400 days
|
|
||||||
- **Pattern Recognition**: Learns what CI looks like just before harvest
|
|
||||||
- **Output**: "Harvest likely in next 3-14 days"
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## Architecture in Plain English
|
|
||||||
|
|
||||||
```
|
|
||||||
Input: Weekly CI values for 300+ days
|
|
||||||
↓
|
|
||||||
Clean & Smooth: Remove sensor noise, detect bad data
|
|
||||||
↓
|
|
||||||
Feature Engineering: Create 7 metrics from CI
|
|
||||||
- "How fast is CI changing?" (velocity)
|
|
||||||
- "How fast is that change changing?" (acceleration)
|
|
||||||
- "What's the minimum CI so far?" (useful for detecting harvest)
|
|
||||||
- ... 4 more patterns
|
|
||||||
↓
|
|
||||||
LSTM Neural Network: "Processes the full season story"
|
|
||||||
- Works like: "Remember what happened weeks ago, use it to predict now"
|
|
||||||
- Not like: "Just look at today's number"
|
|
||||||
↓
|
|
||||||
Two Output Heads:
|
|
||||||
- Head 1: "How imminent is harvest?" (0-100% probability)
|
|
||||||
- Head 2: "Has harvest happened?" (0-100% probability)
|
|
||||||
↓
|
|
||||||
Output: Per-day probabilities for 300+ days
|
|
||||||
```
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## Key Strengths ✅
|
|
||||||
|
|
||||||
1. **Smart preprocessing** - Removes bad data (interpolated/noisy)
|
|
||||||
2. **No data leakage** - Tests on completely different fields
|
|
||||||
3. **Variable-length sequences** - Handles 300-400 day seasons flexibly
|
|
||||||
4. **Per-timestep predictions** - Predictions for every single day
|
|
||||||
5. **Dual output** - Two related tasks (warning + confirmation)
|
|
||||||
6. **Works in practice** - Detected signal is 98% accurate
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## Key Limitations ⚠️
|
|
||||||
|
|
||||||
1. **Limited input data** - Only uses CI (no temperature, rainfall, soil data)
|
|
||||||
2. **False positives** - Triggers on seasonal dips, not just harvest (88% vs 98%)
|
|
||||||
3. **Single-client training** - Trained on ESA fields only (overfits)
|
|
||||||
4. **No uncertainty bounds** - Gives percentage, not confidence range
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## Performance Report Card
|
|
||||||
|
|
||||||
| What | Score | Notes |
|
|
||||||
|------|-------|-------|
|
|
||||||
| **Imminent Prediction** | 88/100 (AUC 0.88) | "Good" - detects most harvest windows, some false alarms |
|
|
||||||
| **Detected Prediction** | 98/100 (AUC 0.98) | "Excellent" - harvest confirmation is rock-solid |
|
|
||||||
| **Data Quality** | 95/100 | Excellent preprocessing, good noise removal |
|
|
||||||
| **Code Quality** | 90/100 | Clean, reproducible, well-documented |
|
|
||||||
| **Production Readiness** | 70/100 | Good foundation, needs all-client retraining + temperature data |
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## What Can Make It Better (Priority Order)
|
|
||||||
|
|
||||||
### 🔴 HIGH IMPACT, QUICK (Do First)
|
|
||||||
|
|
||||||
1. **Train on all sugarcane farms** (not just ESA)
|
|
||||||
- Current: ~2,000 training samples, 2 fields
|
|
||||||
- Improved: ~10,000+ samples, 15+ fields
|
|
||||||
- Expected gain: 5-10% better on imminent signal
|
|
||||||
- Effort: 30 min setup + 15 min runtime
|
|
||||||
|
|
||||||
2. **Add temperature data**
|
|
||||||
- Why: Harvest timing depends on accumulated heat, not just CI
|
|
||||||
- Impact: Distinguish "harvest-ready decline" from "stress decline"
|
|
||||||
- Expected gain: 10-15% improvement on imminent
|
|
||||||
- Effort: 3-4 hours
|
|
||||||
|
|
||||||
### 🟡 MEDIUM PRIORITY
|
|
||||||
|
|
||||||
3. **Test different imminent prediction windows**
|
|
||||||
- Current: 3-14 days before harvest
|
|
||||||
- Try: 7-14, 10-21, etc.
|
|
||||||
- Expected gain: 30% fewer false alarms
|
|
||||||
- Effort: 1-2 hours
|
|
||||||
|
|
||||||
4. **Add rainfall/moisture data**
|
|
||||||
- Why: Drought = early harvest, floods = late harvest
|
|
||||||
- Expected gain: 5-10% improvement
|
|
||||||
- Effort: 3-4 hours
|
|
||||||
|
|
||||||
5. **Per-field performance analysis**
|
|
||||||
- Reveals which fields are hard to predict
|
|
||||||
- Effort: 30 minutes
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## Current Issues Observed
|
|
||||||
|
|
||||||
### Issue 1: False Imminent Positives
|
|
||||||
**Symptom**: Model triggers "harvest imminent" multiple times during the season, not just at harvest.
|
|
||||||
|
|
||||||
**Root cause**: Sugarcane CI naturally declines as it grows. Model trained on limited data (ESA-only) can't distinguish:
|
|
||||||
- "This is a natural mid-season dip" ← Don't alert farmer
|
|
||||||
- "This is the pre-harvest dip" ← Alert farmer
|
|
||||||
|
|
||||||
**Fix**: Add temperature data or retrain on all clients (more diversity = better learning)
|
|
||||||
|
|
||||||
### Issue 2: Limited Generalization
|
|
||||||
**Symptom**: Only trained on ESA fields. Unknown performance on chemba, bagamoyo, etc.
|
|
||||||
|
|
||||||
**Root cause**: Different climates, varieties, soils have different CI patterns.
|
|
||||||
|
|
||||||
**Fix**: Retrain with `CLIENT_FILTER = None` (takes all clients)
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## Bottom Line Assessment
|
|
||||||
|
|
||||||
**Current**: ⭐⭐⭐⭐ (4/5 stars)
|
|
||||||
- Well-engineered, works well, good data practices
|
|
||||||
- Ready for research/demonstration
|
|
||||||
|
|
||||||
**With Phase 1 & 2 improvements**: ⭐⭐⭐⭐⭐ (5/5 stars)
|
|
||||||
- Production-ready
|
|
||||||
- Reliable, accurate, generalizable
|
|
||||||
|
|
||||||
**Estimated time to 5-star**: 1-2 weeks part-time work
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## Quick Start to Improve It
|
|
||||||
|
|
||||||
### In 30 Minutes
|
|
||||||
```python
|
|
||||||
# Go to line ~49 in the notebook
|
|
||||||
CLIENT_FILTER = 'esa' # ← Change to:
|
|
||||||
CLIENT_FILTER = None # Now uses all clients
|
|
||||||
# Run Sections 2-12
|
|
||||||
# Compare results
|
|
||||||
```
|
|
||||||
|
|
||||||
### In 3-4 Hours (After Phase 1)
|
|
||||||
1. Download daily temperature data for 2020-2024
|
|
||||||
2. Merge with existing CI data
|
|
||||||
3. Add 4 new temperature features (GDD, velocity, anomaly, percentile)
|
|
||||||
4. Retrain
|
|
||||||
5. Measure improvement
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## Sugarcane Biology (Why This Matters)
|
|
||||||
|
|
||||||
Sugarcane has **phenological constraints** - it follows a strict schedule:
|
|
||||||
|
|
||||||
```
|
|
||||||
Stage 1 (Days 0-30): GERMINATION
|
|
||||||
- CI = low
|
|
||||||
|
|
||||||
Stage 2 (Days 30-120): TILLERING (growth spurt)
|
|
||||||
- CI rising rapidly
|
|
||||||
- Natural increase (not mature yet)
|
|
||||||
|
|
||||||
Stage 3 (Days 120-300): GRAND GROWTH (bulk accumulation)
|
|
||||||
- CI high, stable
|
|
||||||
- Farmer wants to extend this
|
|
||||||
|
|
||||||
Stage 4 (Days 300-350+): RIPENING
|
|
||||||
- CI peaks then slight decline
|
|
||||||
- This is normal maturation
|
|
||||||
- HARVEST WINDOW OPENS in this stage
|
|
||||||
|
|
||||||
Stage 5: HARVEST
|
|
||||||
- Farmer decides to cut
|
|
||||||
- CI drops to minimum
|
|
||||||
- Followed by new season
|
|
||||||
|
|
||||||
Model's job: Distinguish Stage 4 from earlier stages
|
|
||||||
Current weakness: Can confuse Stage 2-3 natural variation with Stage 4 ripening
|
|
||||||
```
|
|
||||||
|
|
||||||
**Temperature helps because**:
|
|
||||||
- Heat units accumulate only during ripening
|
|
||||||
- Cold = slow growth, delayed ripening
|
|
||||||
- Extreme heat = early ripening
|
|
||||||
- Model can see: "High heat units + declining CI" = ripening (not mid-season dip)
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## Key Files Created
|
|
||||||
|
|
||||||
1. **LSTM_HARVEST_EVALUATION.md** - Detailed analysis of the script
|
|
||||||
- Section-by-section walkthrough
|
|
||||||
- Strengths and weaknesses
|
|
||||||
- Recommendations by priority
|
|
||||||
|
|
||||||
2. **IMPLEMENTATION_ROADMAP.md** - Step-by-step guide to improvements
|
|
||||||
- Phase 1: All-client retraining (quick)
|
|
||||||
- Phase 2: Temperature features (high-impact)
|
|
||||||
- Phase 3-5: Optimization steps
|
|
||||||
- Code snippets ready to use
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## Questions to Ask Next
|
|
||||||
|
|
||||||
1. **Is temperature data available?** (If yes → 10-15% gain)
|
|
||||||
2. **Which fields have most false positives?** (Identifies patterns)
|
|
||||||
3. **What lead time does farmer need?** (Currently ~7 days, is that enough?)
|
|
||||||
4. **Any fields we should exclude?** (Data quality, variety issues?)
|
|
||||||
5. **How often will this run operationally?** (Weekly? Monthly?)
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## Next Meeting Agenda
|
|
||||||
|
|
||||||
- [ ] Review: Do you agree with assessment?
|
|
||||||
- [ ] Decide: Proceed with Phase 1 (all-client retraining)?
|
|
||||||
- [ ] Obtain: Temperature data source and format
|
|
||||||
- [ ] Plan: Timeline for Phase 2 implementation
|
|
||||||
- [ ] Discuss: Operational thresholds (0.5 probability right?)
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## Summary in One Sentence
|
|
||||||
|
|
||||||
**The script is well-engineered and works well (88-98% accuracy), but can improve 10-15% with multi-client retraining and temperature data, taking it from research prototype to production-ready system.**
|
|
||||||
|
|
||||||
🎯 **Next step**: Change `CLIENT_FILTER = None` and retrain (30 minutes setup, 15 minutes run)
|
|
||||||
|
|
@ -1,55 +0,0 @@
|
||||||
# Archive: Old Experiments & Docs
|
|
||||||
|
|
||||||
This folder contains experimental code, old model files, and supporting documentation from earlier iterations of the harvest detection project. These are kept for reference but **are not part of the current production workflow**.
|
|
||||||
|
|
||||||
## Contents
|
|
||||||
|
|
||||||
### Notebooks (Early Development)
|
|
||||||
- `05_lstm_harvest_detection_pytorch.ipynb` - Early LSTM implementation
|
|
||||||
- `11_data_cleaning_labeling.ipynb` - Data preparation exploration
|
|
||||||
- `12_model_training_prediction.ipynb` - Initial training experiments
|
|
||||||
|
|
||||||
### Old Model Files
|
|
||||||
- `best_harvest_detection_model_esa.pt` - Earlier model variant
|
|
||||||
- `best_harvest_model.pt` - Earlier model variant
|
|
||||||
- `harvest_detection_model_esa_None.pt` - Experimental model
|
|
||||||
- `harvest_detection_config_esa_None.json` - Config for experimental model
|
|
||||||
- `harvest_test_metadata_esa_None.csv` - Test set metadata
|
|
||||||
- `harvest_train_metadata_esa_None.csv` - Train set metadata
|
|
||||||
|
|
||||||
### Documentation (Reference Only)
|
|
||||||
- `ACTION_PLAN.md` - Early planning
|
|
||||||
- `CI_ONLY_IMPROVEMENTS.md` - Feature exploration
|
|
||||||
- `DEPLOYMENT_README.md` - Deployment notes
|
|
||||||
- `EXECUTIVE_SUMMARY.md` - Project overview
|
|
||||||
- `IMPLEMENTATION_ROADMAP.md` - Development roadmap
|
|
||||||
- `LSTM_HARVEST_EVALUATION.md` - Evaluation notes
|
|
||||||
- `README_EVALUATION.md` - Evaluation docs
|
|
||||||
- `TECHNICAL_IMPROVEMENTS.md` - Technical notes
|
|
||||||
- `YOUR_FEEDBACK_SUMMARY.md` - Feedback tracking
|
|
||||||
|
|
||||||
### Old Data Files
|
|
||||||
- `lstm_complete_data_dedup.csv` - Deduplicated data variant
|
|
||||||
- `lstm_test_data_cleaned.csv` - Cleaned test data
|
|
||||||
- `lstm_train_data_cleaned.csv` - Cleaned train data
|
|
||||||
- `data_cleaning_metadata.csv` - Cleaning notes
|
|
||||||
- `trigger_analysis_summary.csv` - Analysis results
|
|
||||||
- `in_season_predictions_*.csv` - Old prediction results
|
|
||||||
- `hyperparameter_tuning_results.csv` - Tuning history
|
|
||||||
- `feature_engineering_config.json` - Feature config variant
|
|
||||||
- `prepare_lstm_data_from_rds.R` - Old R data prep script
|
|
||||||
- `IN_SEASON_SIMULATION_README.txt` - Old simulation docs
|
|
||||||
|
|
||||||
## Current Active Workflow
|
|
||||||
|
|
||||||
For the current production harvest detection system, see:
|
|
||||||
- **Main folder** (`../`): Clean working directory with current data files
|
|
||||||
- **experiment_framework/** (`../experiment_framework/`):
|
|
||||||
- Phase 1, 2, 3 implementations
|
|
||||||
- Model 307 (current production model)
|
|
||||||
- Complete README: `PRODUCTION_WORKFLOW.md`
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
_Archive created: December 12, 2025_
|
|
||||||
_All files preserved (nothing deleted)_
|
|
||||||
|
|
@ -1,324 +0,0 @@
|
||||||
# Harvest Detection Model Evaluation - Document Index
|
|
||||||
|
|
||||||
**Evaluation Date**: December 8, 2025
|
|
||||||
**Model**: LSTM-based harvest detection using Chlorophyll Index (CI) time series
|
|
||||||
**Overall Score**: ⭐⭐⭐⭐ (4/5 stars - excellent foundation, ready for Phase 2)
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## 📄 Documents Created
|
|
||||||
|
|
||||||
### 1. **EXECUTIVE_SUMMARY.md** ← START HERE
|
|
||||||
**Best for**: Management, quick overview, decision-making
|
|
||||||
**Contains**:
|
|
||||||
- Key findings at a glance
|
|
||||||
- Strengths & weaknesses summary
|
|
||||||
- Quick wins (high-impact, low-effort actions)
|
|
||||||
- Recommended actions by timeline
|
|
||||||
- Budget & resource requirements
|
|
||||||
- FAQ
|
|
||||||
|
|
||||||
**Read time**: 5-10 minutes
|
|
||||||
**Action**: Review findings, approve Phase 1 implementation
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
### 2. **QUICK_SUMMARY.md** ← FOR NON-TECHNICAL STAKEHOLDERS
|
|
||||||
**Best for**: Farmers, extension officers, project managers
|
|
||||||
**Contains**:
|
|
||||||
- Plain English explanation of what model does
|
|
||||||
- Performance report card (simple language)
|
|
||||||
- What can make it better (priority order)
|
|
||||||
- Sugarcane biology context
|
|
||||||
- Current issues and fixes
|
|
||||||
- One-sentence summary
|
|
||||||
|
|
||||||
**Read time**: 10-15 minutes
|
|
||||||
**Action**: Share with project team, gather requirements
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
### 3. **LSTM_HARVEST_EVALUATION.md** ← COMPREHENSIVE TECHNICAL ANALYSIS
|
|
||||||
**Best for**: Data scientists, engineers, deep-dive technical review
|
|
||||||
**Contains**:
|
|
||||||
- Section-by-section script walkthrough (all 12 sections)
|
|
||||||
- Detailed architecture explanation
|
|
||||||
- Feature engineering analysis
|
|
||||||
- Model recommendations
|
|
||||||
- Per-field performance analysis
|
|
||||||
- Deployment readiness checklist
|
|
||||||
- Specific code improvements with examples
|
|
||||||
- Data quality deep-dive
|
|
||||||
- Agronomic context for sugarcane
|
|
||||||
|
|
||||||
**Read time**: 30-45 minutes (reference document)
|
|
||||||
**Action**: Technical review, identify implementation priorities
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
### 4. **IMPLEMENTATION_ROADMAP.md** ← STEP-BY-STEP ACTION PLAN
|
|
||||||
**Best for**: Implementation team, project leads
|
|
||||||
**Contains**:
|
|
||||||
- **Phase 1**: Multi-client retraining (quick win)
|
|
||||||
- Exact steps, expected outcomes, success criteria
|
|
||||||
- **Phase 2**: Add temperature features (high-impact)
|
|
||||||
- Data sources, feature engineering, code structure
|
|
||||||
- Expected AUC improvement: 88% → 93%
|
|
||||||
- **Phase 3**: Test imminent windows
|
|
||||||
- How to test different 3-14, 7-14, 10-21 day windows
|
|
||||||
- Expected FP reduction: 30-50%
|
|
||||||
- **Phase 4**: Operational metrics
|
|
||||||
- Lead time analysis, per-field performance
|
|
||||||
- **Phase 5**: Optional rainfall features
|
|
||||||
- Weekly checklist
|
|
||||||
- Performance trajectory predictions
|
|
||||||
|
|
||||||
**Read time**: 20-30 minutes
|
|
||||||
**Action**: Follow step-by-step, assign work, track progress
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
### 5. **TECHNICAL_IMPROVEMENTS.md** ← COPY-PASTE READY CODE
|
|
||||||
**Best for**: Developers, data engineers
|
|
||||||
**Contains**:
|
|
||||||
- **Code Block 1**: Temperature feature engineering (ready to use)
|
|
||||||
- GDD calculation, temperature anomaly, velocity
|
|
||||||
- Drop-in replacement for Section 5
|
|
||||||
- **Code Block 2**: Window optimization analysis
|
|
||||||
- Test 5-6 different imminent windows
|
|
||||||
- Visualization of trade-offs (AUC vs. FP rate)
|
|
||||||
- **Code Block 3**: Operational metrics calculation
|
|
||||||
- Lead time distribution
|
|
||||||
- Per-field accuracy
|
|
||||||
- Visualizations
|
|
||||||
- **Code Block 4**: Enhanced model configuration saving
|
|
||||||
- Implementation priority table
|
|
||||||
|
|
||||||
**Read time**: 20-30 minutes (reference)
|
|
||||||
**Action**: Copy code, integrate into notebook, run
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## 🎯 Quick Navigation
|
|
||||||
|
|
||||||
### "I need to understand this model in 5 minutes"
|
|
||||||
→ Read: **EXECUTIVE_SUMMARY.md** (Key Findings section)
|
|
||||||
|
|
||||||
### "I need to explain this to a farmer"
|
|
||||||
→ Read: **QUICK_SUMMARY.md** (entire document)
|
|
||||||
|
|
||||||
### "I need to improve this model"
|
|
||||||
→ Read: **IMPLEMENTATION_ROADMAP.md** (Phase 1-2)
|
|
||||||
|
|
||||||
### "I need the technical details"
|
|
||||||
→ Read: **LSTM_HARVEST_EVALUATION.md** (sections of interest)
|
|
||||||
|
|
||||||
### "I need to write code"
|
|
||||||
→ Read: **TECHNICAL_IMPROVEMENTS.md** (code blocks)
|
|
||||||
|
|
||||||
### "I need to know if it's production-ready"
|
|
||||||
→ Read: **EXECUTIVE_SUMMARY.md** (Deployment Readiness section)
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## 📊 Document Comparison
|
|
||||||
|
|
||||||
| Document | Audience | Length | Depth | Action |
|
|
||||||
|----------|----------|--------|-------|--------|
|
|
||||||
| Executive Summary | Managers | 10 min | Medium | Approve Phase 1 |
|
|
||||||
| Quick Summary | Non-tech | 15 min | Medium | Share findings |
|
|
||||||
| LSTM Evaluation | Engineers | 45 min | Deep | Technical review |
|
|
||||||
| Implementation Roadmap | Developers | 30 min | Medium | Follow steps |
|
|
||||||
| Technical Improvements | Coders | 30 min | Deep | Write code |
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## 🚀 Getting Started
|
|
||||||
|
|
||||||
### Step 1: Decision (Today)
|
|
||||||
- [ ] Read **EXECUTIVE_SUMMARY.md** (Key Findings)
|
|
||||||
- [ ] Approve Phase 1 (all-client retraining)
|
|
||||||
- [ ] Identify temperature data source
|
|
||||||
|
|
||||||
### Step 2: Setup (This Week)
|
|
||||||
- [ ] Follow **IMPLEMENTATION_ROADMAP.md** Phase 1 (30 min)
|
|
||||||
- [ ] Run notebook with `CLIENT_FILTER = None`
|
|
||||||
- [ ] Compare results: ESA-only vs. all-client
|
|
||||||
|
|
||||||
### Step 3: Implementation (Next 2 Weeks)
|
|
||||||
- [ ] Get temperature data ready
|
|
||||||
- [ ] Copy code from **TECHNICAL_IMPROVEMENTS.md**
|
|
||||||
- [ ] Implement Phase 2 (temperature features)
|
|
||||||
- [ ] Measure improvement: AUC and false positives
|
|
||||||
|
|
||||||
### Step 4: Optimization (Week 3-4)
|
|
||||||
- [ ] Follow **IMPLEMENTATION_ROADMAP.md** Phase 3
|
|
||||||
- [ ] Test window optimization
|
|
||||||
- [ ] Compute operational metrics
|
|
||||||
|
|
||||||
### Step 5: Deployment (Week 4+)
|
|
||||||
- [ ] Validate on recent data
|
|
||||||
- [ ] Write operational manual
|
|
||||||
- [ ] Deploy to production
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## 📈 Expected Timeline
|
|
||||||
|
|
||||||
| Timeline | Task | Document | Effort |
|
|
||||||
|----------|------|----------|--------|
|
|
||||||
| **This week** | Review & approve Phase 1 | Executive Summary | 1 hr |
|
|
||||||
| **This week** | Run Phase 1 (all-client) | Roadmap (Phase 1) | 1 hr |
|
|
||||||
| **Week 2** | Implement Phase 2 (temperature) | Technical Improvements + Roadmap | 4 hrs |
|
|
||||||
| **Week 3** | Test Phase 3 (windows) | Technical Improvements + Roadmap | 2 hrs |
|
|
||||||
| **Week 4** | Deploy Phase 4 (metrics) | Roadmap (Phase 4) | 2 hrs |
|
|
||||||
| **Total** | **All improvements** | **All documents** | **~10 hrs** |
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## 💡 Key Recommendations
|
|
||||||
|
|
||||||
### 🔴 Priority 1: Phase 1 (All-Client Retraining)
|
|
||||||
- **When**: This week
|
|
||||||
- **Effort**: 30 min setup + 15 min runtime
|
|
||||||
- **Expected gain**: +5-10% AUC
|
|
||||||
- **How**: Change 1 line in notebook
|
|
||||||
- **Document**: IMPLEMENTATION_ROADMAP.md (Phase 1)
|
|
||||||
|
|
||||||
### 🔴 Priority 2: Phase 2 (Temperature Features)
|
|
||||||
- **When**: Next 2 weeks
|
|
||||||
- **Effort**: 3-4 hours
|
|
||||||
- **Expected gain**: +10-15% AUC, -50% false positives
|
|
||||||
- **Document**: TECHNICAL_IMPROVEMENTS.md (Code Block 1)
|
|
||||||
|
|
||||||
### 🟡 Priority 3: Phase 3 (Window Optimization)
|
|
||||||
- **When**: Week 2-3
|
|
||||||
- **Effort**: 1-2 hours
|
|
||||||
- **Expected gain**: -30% false positives
|
|
||||||
- **Document**: TECHNICAL_IMPROVEMENTS.md (Code Block 2)
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## ✅ What's Working Well
|
|
||||||
|
|
||||||
1. **Data preprocessing** (linear interpolation detection, spike removal)
|
|
||||||
2. **No data leakage** (field-level train/val/test split)
|
|
||||||
3. **Variable-length handling** (dynamic batch padding)
|
|
||||||
4. **Per-timestep predictions** (each day gets own label)
|
|
||||||
5. **Dual-output architecture** (imminent + detected signals)
|
|
||||||
6. **Detected signal performance** (98% AUC - rock solid)
|
|
||||||
7. **Clean, reproducible code** (well-documented, saved config)
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## ⚠️ What Needs Improvement
|
|
||||||
|
|
||||||
1. **Limited features** (only CI, no temperature/rainfall/moisture)
|
|
||||||
2. **Single-client training** (only ESA, limited diversity)
|
|
||||||
3. **Imminent false positives** (88% vs. 98%, room for improvement)
|
|
||||||
4. **No uncertainty quantification** (point estimates, no ranges)
|
|
||||||
5. **Unvalidated operational parameters** (Is 3-14 days optimal?)
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## 📋 Document Checklist
|
|
||||||
|
|
||||||
- [ ] **EXECUTIVE_SUMMARY.md** - Key findings, decisions, timeline
|
|
||||||
- [ ] **QUICK_SUMMARY.md** - Non-technical overview, context
|
|
||||||
- [ ] **LSTM_HARVEST_EVALUATION.md** - Detailed technical analysis
|
|
||||||
- [ ] **IMPLEMENTATION_ROADMAP.md** - Step-by-step action plan
|
|
||||||
- [ ] **TECHNICAL_IMPROVEMENTS.md** - Ready-to-use code
|
|
||||||
- [ ] **Notebook updated** - Context added to first cell
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## 🎓 Learning Outcomes
|
|
||||||
|
|
||||||
After reviewing these documents, you will understand:
|
|
||||||
|
|
||||||
1. **What the model does** - Time series pattern recognition for harvest prediction
|
|
||||||
2. **Why it works** - LSTM, per-timestep predictions, dual output heads
|
|
||||||
3. **Why it's not perfect** - Limited features (CI only), single-client training
|
|
||||||
4. **How to improve it** - Temperature features are key (3-4 hours for 10-15% gain)
|
|
||||||
5. **How to deploy it** - Performance metrics, operational validation, timeline
|
|
||||||
6. **How to maintain it** - Quarterly retraining, feedback loops, monitoring
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## 🔗 Cross-References
|
|
||||||
|
|
||||||
### If you're interested in...
|
|
||||||
|
|
||||||
**Feature Engineering**
|
|
||||||
→ LSTM_HARVEST_EVALUATION.md (Section 5) + TECHNICAL_IMPROVEMENTS.md (Temperature Features)
|
|
||||||
|
|
||||||
**Data Quality**
|
|
||||||
→ LSTM_HARVEST_EVALUATION.md (Data Quality section) + LSTM_HARVEST_EVALUATION.md (Linear Interpolation)
|
|
||||||
|
|
||||||
**Model Architecture**
|
|
||||||
→ LSTM_HARVEST_EVALUATION.md (Section 8) + TECHNICAL_IMPROVEMENTS.md (GDD percentile, attention mechanisms)
|
|
||||||
|
|
||||||
**Operational Readiness**
|
|
||||||
→ EXECUTIVE_SUMMARY.md (Success Criteria) + IMPLEMENTATION_ROADMAP.md (Phase 4)
|
|
||||||
|
|
||||||
**Performance Improvement**
|
|
||||||
→ IMPLEMENTATION_ROADMAP.md (Phases 1-3) + TECHNICAL_IMPROVEMENTS.md (Code blocks)
|
|
||||||
|
|
||||||
**Agronomic Context**
|
|
||||||
→ QUICK_SUMMARY.md (Sugarcane Biology) + LSTM_HARVEST_EVALUATION.md (Agronomic Context)
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## 📞 Support
|
|
||||||
|
|
||||||
### For questions about...
|
|
||||||
|
|
||||||
| Topic | Document | Section |
|
|
||||||
|-------|----------|---------|
|
|
||||||
| Model architecture | LSTM_HARVEST_EVALUATION.md | Section 8 |
|
|
||||||
| Feature list | LSTM_HARVEST_EVALUATION.md | Feature Engineering section |
|
|
||||||
| Data preprocessing | LSTM_HARVEST_EVALUATION.md | Data Quality & Cleaning |
|
|
||||||
| Performance metrics | EXECUTIVE_SUMMARY.md | Key Findings |
|
|
||||||
| Implementation steps | IMPLEMENTATION_ROADMAP.md | Phase 1-5 |
|
|
||||||
| Code examples | TECHNICAL_IMPROVEMENTS.md | Code Blocks 1-4 |
|
|
||||||
| Deployment | EXECUTIVE_SUMMARY.md | Deployment section |
|
|
||||||
| Timeline | IMPLEMENTATION_ROADMAP.md | Summary timeline |
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## 📖 Reading Order Recommendations
|
|
||||||
|
|
||||||
### For Project Managers
|
|
||||||
1. EXECUTIVE_SUMMARY.md (entire)
|
|
||||||
2. QUICK_SUMMARY.md (entire)
|
|
||||||
3. IMPLEMENTATION_ROADMAP.md (overview)
|
|
||||||
|
|
||||||
### For Data Scientists
|
|
||||||
1. EXECUTIVE_SUMMARY.md (entire)
|
|
||||||
2. LSTM_HARVEST_EVALUATION.md (entire)
|
|
||||||
3. TECHNICAL_IMPROVEMENTS.md (code blocks)
|
|
||||||
|
|
||||||
### For Developers
|
|
||||||
1. IMPLEMENTATION_ROADMAP.md (entire)
|
|
||||||
2. TECHNICAL_IMPROVEMENTS.md (entire)
|
|
||||||
3. LSTM_HARVEST_EVALUATION.md (architecture sections)
|
|
||||||
|
|
||||||
### For Farmers/Extension Officers
|
|
||||||
1. QUICK_SUMMARY.md (entire)
|
|
||||||
2. EXECUTIVE_SUMMARY.md (highlights only)
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## ✨ Final Summary
|
|
||||||
|
|
||||||
**The harvest detection model is well-engineered and 70% production-ready.** With two weeks of focused effort (Phases 1-2), it can become 95%+ production-ready with <5% false positive rate.
|
|
||||||
|
|
||||||
**Next step**: Schedule Phase 1 implementation (all-client retraining) - takes 30 minutes setup + 15 minutes runtime.
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
**All documents are self-contained and can be read in any order.**
|
|
||||||
**Use the navigation above to find what you need.**
|
|
||||||
|
|
||||||
**Questions?** Refer to the specific document for that topic.
|
|
||||||
**Ready to implement?** Follow IMPLEMENTATION_ROADMAP.md step-by-step.
|
|
||||||
|
|
@ -1,603 +0,0 @@
|
||||||
# Technical Improvements & Code Examples
|
|
||||||
|
|
||||||
This document contains ready-to-use code snippets for enhancing the harvest detection model.
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## 1. Add Temperature Features (Copy-Paste Ready)
|
|
||||||
|
|
||||||
### Step 1: After loading data and before Section 3, add this:
|
|
||||||
|
|
||||||
```python
|
|
||||||
print("="*80)
|
|
||||||
print("ADDING TEMPERATURE FEATURES")
|
|
||||||
print("="*80)
|
|
||||||
|
|
||||||
# Assumes you have a temperature CSV with columns: date, field, avg_temp (in °C)
|
|
||||||
# If not available, download from ECMWF or local weather station
|
|
||||||
|
|
||||||
try:
|
|
||||||
df_temp = pd.read_csv('daily_temperature_data.csv', low_memory=False)
|
|
||||||
df_temp['date'] = pd.to_datetime(df_temp['date'])
|
|
||||||
print(f"✓ Temperature data loaded: {len(df_temp)} rows")
|
|
||||||
print(f" Date range: {df_temp['date'].min()} to {df_temp['date'].max()}")
|
|
||||||
print(f" Fields: {df_temp['field'].unique()}")
|
|
||||||
except FileNotFoundError:
|
|
||||||
print("⚠️ Temperature file not found. Skipping temperature features.")
|
|
||||||
df_temp = None
|
|
||||||
|
|
||||||
if df_temp is not None:
|
|
||||||
# Merge temperature with CI data
|
|
||||||
df_all = df_all.merge(
|
|
||||||
df_temp[['date', 'field', 'avg_temp']],
|
|
||||||
on=['date', 'field'],
|
|
||||||
how='left'
|
|
||||||
)
|
|
||||||
|
|
||||||
print(f"\n[FEATURE ENGINEERING] Creating temperature-based features...")
|
|
||||||
|
|
||||||
# 1. Growing Degree Days (GDD)
|
|
||||||
# Sugarcane base temperature: 10°C
|
|
||||||
df_all['daily_gdd'] = np.maximum(0, df_all['avg_temp'] - 10)
|
|
||||||
|
|
||||||
# Cumulative GDD per field-season
|
|
||||||
df_all['gdd_cumulative'] = 0.0
|
|
||||||
for (field, model), group in df_all.groupby(['field', 'model']):
|
|
||||||
idx = group.index
|
|
||||||
gdd_values = np.nancumsum(group['daily_gdd'].values)
|
|
||||||
df_all.loc[idx, 'gdd_cumulative'] = gdd_values
|
|
||||||
|
|
||||||
# 2. 7-day GDD velocity
|
|
||||||
df_all['gdd_7d_velocity'] = 0.0
|
|
||||||
for (field, model), group in df_all.groupby(['field', 'model']):
|
|
||||||
idx = group.index
|
|
||||||
gdd_cum = group['gdd_cumulative'].values
|
|
||||||
for i in range(7, len(gdd_cum)):
|
|
||||||
df_all.loc[idx.iloc[i], 'gdd_7d_velocity'] = gdd_cum[i] - gdd_cum[i-7]
|
|
||||||
|
|
||||||
# 3. Temperature anomaly (vs 30-day rolling average)
|
|
||||||
df_all['temp_30d_avg'] = df_all.groupby('field')['avg_temp'].transform(
|
|
||||||
lambda x: x.rolling(30, center=True, min_periods=1).mean()
|
|
||||||
)
|
|
||||||
df_all['temp_anomaly'] = df_all['avg_temp'] - df_all['temp_30d_avg']
|
|
||||||
|
|
||||||
# 4. GDD percentile (how far through season in heat accumulation)
|
|
||||||
df_all['gdd_percentile'] = 0.0
|
|
||||||
for (field, model), group in df_all.groupby(['field', 'model']):
|
|
||||||
idx = group.index
|
|
||||||
gdd_values = group['gdd_cumulative'].values
|
|
||||||
max_gdd = gdd_values[-1]
|
|
||||||
if max_gdd > 0:
|
|
||||||
df_all.loc[idx, 'gdd_percentile'] = gdd_values / max_gdd
|
|
||||||
|
|
||||||
# Handle NaN
|
|
||||||
df_all['gdd_cumulative'].fillna(0, inplace=True)
|
|
||||||
df_all['gdd_7d_velocity'].fillna(0, inplace=True)
|
|
||||||
df_all['temp_anomaly'].fillna(0, inplace=True)
|
|
||||||
df_all['gdd_percentile'].fillna(0, inplace=True)
|
|
||||||
|
|
||||||
print(f"\n✓ Temperature features created:")
|
|
||||||
print(f" gdd_cumulative: {df_all['gdd_cumulative'].min():.0f} - {df_all['gdd_cumulative'].max():.0f}")
|
|
||||||
print(f" gdd_7d_velocity: {df_all['gdd_7d_velocity'].min():.1f} - {df_all['gdd_7d_velocity'].max():.1f}")
|
|
||||||
print(f" temp_anomaly: {df_all['temp_anomaly'].min():.1f} - {df_all['temp_anomaly'].max():.1f}")
|
|
||||||
print(f" gdd_percentile: {df_all['gdd_percentile'].min():.2f} - {df_all['gdd_percentile'].max():.2f}")
|
|
||||||
else:
|
|
||||||
# Create dummy columns if temperature not available
|
|
||||||
df_all['gdd_cumulative'] = 0.0
|
|
||||||
df_all['gdd_7d_velocity'] = 0.0
|
|
||||||
df_all['temp_anomaly'] = 0.0
|
|
||||||
df_all['gdd_percentile'] = 0.0
|
|
||||||
print("⚠️ Temperature features set to zeros (data not available)")
|
|
||||||
```
|
|
||||||
|
|
||||||
### Step 2: Update feature engineering in Section 5:
|
|
||||||
|
|
||||||
```python
|
|
||||||
print("="*80)
|
|
||||||
print("FEATURE ENGINEERING: EXTENDED FEATURES (7D + 4 TEMPERATURE)")
|
|
||||||
print("="*80)
|
|
||||||
|
|
||||||
def engineer_temporal_features_with_temperature(X_sequences, gdd_cumulative_list,
|
|
||||||
gdd_7d_velocity_list, temp_anomaly_list,
|
|
||||||
gdd_percentile_list):
|
|
||||||
"""
|
|
||||||
Combine CI-derived features with temperature features.
|
|
||||||
|
|
||||||
Original 7 features:
|
|
||||||
1-7: CI, vel7d, accel7d, ma14d, vel14d, min7d, vel_mag
|
|
||||||
|
|
||||||
New 4 features:
|
|
||||||
8. gdd_cumulative: Total accumulated heat
|
|
||||||
9. gdd_7d_velocity: Rate of heat accumulation
|
|
||||||
10. temp_anomaly: Current temp vs seasonal average
|
|
||||||
11. gdd_percentile: Position in season's heat accumulation
|
|
||||||
"""
|
|
||||||
X_features = []
|
|
||||||
|
|
||||||
for ci_idx, ci_seq in enumerate(X_sequences):
|
|
||||||
seq_len = len(ci_seq)
|
|
||||||
|
|
||||||
# Original 7 features from CI
|
|
||||||
ci_smooth = ci_seq.copy()
|
|
||||||
|
|
||||||
velocity_7d = np.zeros(seq_len)
|
|
||||||
ma7_values = pd.Series(ci_seq).rolling(window=7, center=False, min_periods=1).mean().values
|
|
||||||
for i in range(seq_len):
|
|
||||||
if i >= 7:
|
|
||||||
velocity_7d[i] = ma7_values[i] - ma7_values[i-7]
|
|
||||||
|
|
||||||
acceleration_7d = np.zeros(seq_len)
|
|
||||||
for i in range(seq_len):
|
|
||||||
if i >= 7:
|
|
||||||
acceleration_7d[i] = velocity_7d[i] - velocity_7d[i-7]
|
|
||||||
|
|
||||||
ma14_values = pd.Series(ci_seq).rolling(window=14, center=False, min_periods=1).mean().values
|
|
||||||
|
|
||||||
velocity_14d = np.zeros(seq_len)
|
|
||||||
for i in range(seq_len):
|
|
||||||
if i >= 14:
|
|
||||||
velocity_14d[i] = ma14_values[i] - ma14_values[i-14]
|
|
||||||
|
|
||||||
min_7d = np.zeros(seq_len)
|
|
||||||
for i in range(seq_len):
|
|
||||||
start_idx = max(0, i - 7)
|
|
||||||
min_7d[i] = np.nanmin(ci_seq[start_idx:i+1])
|
|
||||||
|
|
||||||
velocity_magnitude = np.abs(velocity_7d)
|
|
||||||
|
|
||||||
# Temperature features (4 new)
|
|
||||||
gdd_cum = gdd_cumulative_list[ci_idx]
|
|
||||||
gdd_vel = gdd_7d_velocity_list[ci_idx]
|
|
||||||
temp_anom = temp_anomaly_list[ci_idx]
|
|
||||||
gdd_perc = gdd_percentile_list[ci_idx]
|
|
||||||
|
|
||||||
# Ensure all are same length
|
|
||||||
if len(gdd_cum) < seq_len:
|
|
||||||
gdd_cum = np.pad(gdd_cum, (0, seq_len - len(gdd_cum)), constant_values=0)
|
|
||||||
if len(gdd_vel) < seq_len:
|
|
||||||
gdd_vel = np.pad(gdd_vel, (0, seq_len - len(gdd_vel)), constant_values=0)
|
|
||||||
if len(temp_anom) < seq_len:
|
|
||||||
temp_anom = np.pad(temp_anom, (0, seq_len - len(temp_anom)), constant_values=0)
|
|
||||||
if len(gdd_perc) < seq_len:
|
|
||||||
gdd_perc = np.pad(gdd_perc, (0, seq_len - len(gdd_perc)), constant_values=0)
|
|
||||||
|
|
||||||
# Stack all 11 features
|
|
||||||
features = np.column_stack([
|
|
||||||
ci_smooth, # 1
|
|
||||||
velocity_7d, # 2
|
|
||||||
acceleration_7d, # 3
|
|
||||||
ma14_values, # 4
|
|
||||||
velocity_14d, # 5
|
|
||||||
min_7d, # 6
|
|
||||||
velocity_magnitude, # 7
|
|
||||||
gdd_cum[:seq_len], # 8
|
|
||||||
gdd_vel[:seq_len], # 9
|
|
||||||
temp_anom[:seq_len], # 10
|
|
||||||
gdd_perc[:seq_len] # 11
|
|
||||||
])
|
|
||||||
|
|
||||||
X_features.append(features)
|
|
||||||
|
|
||||||
return X_features
|
|
||||||
|
|
||||||
# Extract temperature sequences from data
|
|
||||||
gdd_cumulative_seqs = []
|
|
||||||
gdd_7d_velocity_seqs = []
|
|
||||||
temp_anomaly_seqs = []
|
|
||||||
gdd_percentile_seqs = []
|
|
||||||
|
|
||||||
for seq_dict in train_sequences:
|
|
||||||
data = seq_dict['data'].sort_values('date')
|
|
||||||
gdd_cumulative_seqs.append(data['gdd_cumulative'].values)
|
|
||||||
gdd_7d_velocity_seqs.append(data['gdd_7d_velocity'].values)
|
|
||||||
temp_anomaly_seqs.append(data['temp_anomaly'].values)
|
|
||||||
gdd_percentile_seqs.append(data['gdd_percentile'].values)
|
|
||||||
|
|
||||||
# Create extended features
|
|
||||||
X_train_features = engineer_temporal_features_with_temperature(
|
|
||||||
X_train_list, gdd_cumulative_seqs, gdd_7d_velocity_seqs,
|
|
||||||
temp_anomaly_seqs, gdd_percentile_seqs
|
|
||||||
)
|
|
||||||
|
|
||||||
# ... same for val and test sets
|
|
||||||
|
|
||||||
print(f"\n✓ Extended feature engineering complete!")
|
|
||||||
print(f" Features per timestep: 11 (7 CI-derived + 4 temperature)")
|
|
||||||
```
|
|
||||||
|
|
||||||
### Step 3: Update normalization in Section 6:
|
|
||||||
|
|
||||||
```python
|
|
||||||
# OLD: feature_names = ['CI', '7d Velocity', ...]
|
|
||||||
# NEW:
|
|
||||||
feature_names = [
|
|
||||||
'CI', # 0
|
|
||||||
'7d Velocity', # 1
|
|
||||||
'7d Acceleration', # 2
|
|
||||||
'14d MA', # 3
|
|
||||||
'14d Velocity', # 4
|
|
||||||
'7d Min', # 5
|
|
||||||
'Velocity Magnitude', # 6
|
|
||||||
'GDD Cumulative', # 7
|
|
||||||
'GDD 7d Velocity', # 8
|
|
||||||
'Temp Anomaly', # 9
|
|
||||||
'GDD Percentile' # 10
|
|
||||||
]
|
|
||||||
|
|
||||||
# Update normalization loop
|
|
||||||
for feat_idx in range(11): # Changed from 7 to 11
|
|
||||||
train_feat_data = np.concatenate([f[:, feat_idx] for f in X_train_features])
|
|
||||||
scaler = MinMaxScaler(feature_range=(0, 1))
|
|
||||||
scaler.fit(train_feat_data.reshape(-1, 1))
|
|
||||||
feature_scalers.append(scaler)
|
|
||||||
print(f" {feature_names[feat_idx]:20s}: [{train_feat_data.min():.4f}, {train_feat_data.max():.4f}]")
|
|
||||||
```
|
|
||||||
|
|
||||||
### Step 4: Update model in Section 8:
|
|
||||||
|
|
||||||
```python
|
|
||||||
# OLD: model = HarvestDetectionLSTM(input_size=7, ...)
|
|
||||||
# NEW:
|
|
||||||
model = HarvestDetectionLSTM(input_size=11, hidden_size=64, num_layers=1, dropout=0.5)
|
|
||||||
model = model.to(device)
|
|
||||||
|
|
||||||
print(f"\nModel input size: 11 features (7 CI-derived + 4 temperature)")
|
|
||||||
```
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## 2. Test Different Imminent Windows
|
|
||||||
|
|
||||||
```python
|
|
||||||
print("="*80)
|
|
||||||
print("SENSITIVITY ANALYSIS: IMMINENT WINDOW OPTIMIZATION")
|
|
||||||
print("="*80)
|
|
||||||
|
|
||||||
windows_to_test = [
|
|
||||||
(3, 14), # Current
|
|
||||||
(5, 15),
|
|
||||||
(7, 14),
|
|
||||||
(10, 21),
|
|
||||||
(3, 7),
|
|
||||||
(7, 21),
|
|
||||||
]
|
|
||||||
|
|
||||||
results_list = []
|
|
||||||
|
|
||||||
for imm_start, imm_end in windows_to_test:
|
|
||||||
print(f"\nTesting window: {imm_start}-{imm_end} days before harvest...")
|
|
||||||
|
|
||||||
# Relabel test sequences with new window
|
|
||||||
test_seqs_relabeled = label_harvest_windows_per_season(
|
|
||||||
test_sequences,
|
|
||||||
imminent_start=imm_start,
|
|
||||||
imminent_end=imm_end,
|
|
||||||
detected_start=1,
|
|
||||||
detected_end=21
|
|
||||||
)
|
|
||||||
|
|
||||||
# Get all labels and predictions
|
|
||||||
y_true_imm = np.concatenate([
|
|
||||||
s['data']['harvest_imminent'].values for s in test_seqs_relabeled
|
|
||||||
])
|
|
||||||
|
|
||||||
# Run model on test set (predictions are same regardless of labeling)
|
|
||||||
model.eval()
|
|
||||||
all_preds_imm = []
|
|
||||||
with torch.no_grad():
|
|
||||||
for X_batch, _, _, seq_lens in test_loader:
|
|
||||||
X_batch = X_batch.to(device)
|
|
||||||
seq_lens = seq_lens.to(device)
|
|
||||||
imminent_pred, _ = model(X_batch)
|
|
||||||
|
|
||||||
for i, seq_len in enumerate(seq_lens):
|
|
||||||
seq_len = seq_len.item()
|
|
||||||
all_preds_imm.extend(imminent_pred[i, :seq_len].cpu().numpy())
|
|
||||||
|
|
||||||
y_pred_imm = np.array(all_preds_imm)
|
|
||||||
y_pred_imm_binary = (y_pred_imm > 0.5).astype(int)
|
|
||||||
|
|
||||||
# Compute metrics
|
|
||||||
auc = roc_auc_score(y_true_imm, y_pred_imm)
|
|
||||||
|
|
||||||
# Compute false positive rate
|
|
||||||
false_positives = np.sum((y_pred_imm_binary == 1) & (y_true_imm == 0))
|
|
||||||
total_positives = np.sum(y_pred_imm_binary == 1)
|
|
||||||
fp_rate = false_positives / total_positives if total_positives > 0 else 0
|
|
||||||
|
|
||||||
# Compute recall (sensitivity)
|
|
||||||
true_positives = np.sum((y_pred_imm_binary == 1) & (y_true_imm == 1))
|
|
||||||
actual_positives = np.sum(y_true_imm == 1)
|
|
||||||
recall = true_positives / actual_positives if actual_positives > 0 else 0
|
|
||||||
|
|
||||||
results_list.append({
|
|
||||||
'window_start': imm_start,
|
|
||||||
'window_end': imm_end,
|
|
||||||
'auc': auc,
|
|
||||||
'recall': recall,
|
|
||||||
'false_pos_rate': fp_rate,
|
|
||||||
'window_size': imm_end - imm_start
|
|
||||||
})
|
|
||||||
|
|
||||||
print(f" AUC: {auc:.4f} | Recall: {recall:.1%} | FP Rate: {fp_rate:.1%}")
|
|
||||||
|
|
||||||
# Summary table
|
|
||||||
results_df = pd.DataFrame(results_list).sort_values('auc', ascending=False)
|
|
||||||
|
|
||||||
print("\n" + "="*80)
|
|
||||||
print("WINDOW OPTIMIZATION RESULTS (sorted by AUC)")
|
|
||||||
print("="*80)
|
|
||||||
print(results_df.to_string(index=False))
|
|
||||||
|
|
||||||
# Plot results
|
|
||||||
fig, axes = plt.subplots(1, 2, figsize=(14, 5))
|
|
||||||
|
|
||||||
# Plot 1: AUC vs window size
|
|
||||||
axes[0].scatter(results_df['window_size'], results_df['auc'], s=100, alpha=0.6)
|
|
||||||
for idx, row in results_df.iterrows():
|
|
||||||
axes[0].annotate(f"{row['window_start']}-{row['window_end']}",
|
|
||||||
(row['window_size'], row['auc']),
|
|
||||||
fontsize=9, ha='center')
|
|
||||||
axes[0].set_xlabel('Window Size (days)', fontweight='bold')
|
|
||||||
axes[0].set_ylabel('AUC', fontweight='bold')
|
|
||||||
axes[0].set_title('AUC vs Window Size', fontweight='bold')
|
|
||||||
axes[0].grid(True, alpha=0.3)
|
|
||||||
|
|
||||||
# Plot 2: Recall vs False Positive Rate (trade-off curve)
|
|
||||||
axes[1].scatter(results_df['false_pos_rate'], results_df['recall'], s=100, alpha=0.6)
|
|
||||||
for idx, row in results_df.iterrows():
|
|
||||||
axes[1].annotate(f"{row['window_start']}-{row['window_end']}",
|
|
||||||
(row['false_pos_rate'], row['recall']),
|
|
||||||
fontsize=9, ha='center')
|
|
||||||
axes[1].set_xlabel('False Positive Rate', fontweight='bold')
|
|
||||||
axes[1].set_ylabel('Recall (True Positive Rate)', fontweight='bold')
|
|
||||||
axes[1].set_title('Recall vs False Positive Rate', fontweight='bold')
|
|
||||||
axes[1].grid(True, alpha=0.3)
|
|
||||||
|
|
||||||
plt.tight_layout()
|
|
||||||
plt.savefig('window_optimization_analysis.png', dpi=150, bbox_inches='tight')
|
|
||||||
plt.show()
|
|
||||||
|
|
||||||
print("\n[RECOMMENDATION]")
|
|
||||||
best_row = results_df.iloc[0]
|
|
||||||
print(f"Optimal window: {best_row['window_start']}-{best_row['window_end']} days")
|
|
||||||
print(f" AUC: {best_row['auc']:.4f}")
|
|
||||||
print(f" Recall: {best_row['recall']:.1%}")
|
|
||||||
print(f" False Positive Rate: {best_row['false_pos_rate']:.1%}")
|
|
||||||
```
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## 3. Compute Operational Metrics
|
|
||||||
|
|
||||||
```python
|
|
||||||
print("="*80)
|
|
||||||
print("OPERATIONAL PERFORMANCE METRICS")
|
|
||||||
print("="*80)
|
|
||||||
|
|
||||||
def compute_operational_metrics(model, test_sequences_labeled, X_test_norm, test_loader):
|
|
||||||
"""
|
|
||||||
Compute farmer-relevant metrics.
|
|
||||||
|
|
||||||
Returns:
|
|
||||||
- lead_times: Days before harvest when model first predicted imminent
|
|
||||||
- false_positives: Number of false imminent predictions
|
|
||||||
- misses: Number of harvests with no imminent prediction
|
|
||||||
- field_performance: Per-field accuracy
|
|
||||||
"""
|
|
||||||
|
|
||||||
lead_times = []
|
|
||||||
false_positives = 0
|
|
||||||
misses = 0
|
|
||||||
field_performance = {}
|
|
||||||
|
|
||||||
model.eval()
|
|
||||||
seq_predictions = []
|
|
||||||
|
|
||||||
# Get all predictions
|
|
||||||
with torch.no_grad():
|
|
||||||
for X_batch, _, _, seq_lens in test_loader:
|
|
||||||
X_batch = X_batch.to(device)
|
|
||||||
seq_lens = seq_lens.to(device)
|
|
||||||
imminent_pred, _ = model(X_batch)
|
|
||||||
|
|
||||||
for i, seq_len in enumerate(seq_lens):
|
|
||||||
seq_len = seq_len.item()
|
|
||||||
seq_predictions.append({
|
|
||||||
'pred': imminent_pred[i, :seq_len].cpu().numpy(),
|
|
||||||
'seq_len': seq_len
|
|
||||||
})
|
|
||||||
|
|
||||||
# Analyze each sequence
|
|
||||||
for seq_idx, seq_dict in enumerate(test_sequences_labeled):
|
|
||||||
field = seq_dict['field']
|
|
||||||
if field not in field_performance:
|
|
||||||
field_performance[field] = {'correct': 0, 'incorrect': 0}
|
|
||||||
|
|
||||||
data = seq_dict['data'].sort_values('date')
|
|
||||||
|
|
||||||
# Get predictions for this sequence
|
|
||||||
if seq_idx < len(seq_predictions):
|
|
||||||
pred = seq_predictions[seq_idx]['pred']
|
|
||||||
else:
|
|
||||||
continue
|
|
||||||
|
|
||||||
# Find harvest boundary
|
|
||||||
harvest_idx = np.where(data['harvest_boundary'] == 1)[0]
|
|
||||||
if len(harvest_idx) == 0:
|
|
||||||
continue
|
|
||||||
harvest_idx = harvest_idx[0]
|
|
||||||
|
|
||||||
# Find when model triggered (prob > 0.5)
|
|
||||||
trigger_indices = np.where(pred > 0.5)[0]
|
|
||||||
|
|
||||||
# Look for triggers BEFORE harvest
|
|
||||||
triggers_before_harvest = trigger_indices[trigger_indices < harvest_idx]
|
|
||||||
|
|
||||||
if len(triggers_before_harvest) > 0:
|
|
||||||
# Last trigger before harvest
|
|
||||||
last_trigger_idx = triggers_before_harvest[-1]
|
|
||||||
lead_time = harvest_idx - last_trigger_idx
|
|
||||||
|
|
||||||
# Check if within optimal window (e.g., 3-14 days)
|
|
||||||
if 3 <= lead_time <= 14:
|
|
||||||
lead_times.append(lead_time)
|
|
||||||
field_performance[field]['correct'] += 1
|
|
||||||
else:
|
|
||||||
# Triggered too early or too late
|
|
||||||
false_positives += 1
|
|
||||||
field_performance[field]['incorrect'] += 1
|
|
||||||
else:
|
|
||||||
# No trigger before harvest = miss
|
|
||||||
misses += 1
|
|
||||||
field_performance[field]['incorrect'] += 1
|
|
||||||
|
|
||||||
# Print results
|
|
||||||
print(f"\n{'='*80}")
|
|
||||||
print("LEAD TIME ANALYSIS")
|
|
||||||
print(f"{'='*80}")
|
|
||||||
|
|
||||||
if len(lead_times) > 0:
|
|
||||||
print(f"Valid predictions (within 3-14d): {len(lead_times)}")
|
|
||||||
print(f" Mean: {np.mean(lead_times):.1f} days")
|
|
||||||
print(f" Std: {np.std(lead_times):.1f} days")
|
|
||||||
print(f" Min: {np.min(lead_times):.0f} days")
|
|
||||||
print(f" Max: {np.max(lead_times):.0f} days")
|
|
||||||
print(f" Median: {np.median(lead_times):.0f} days")
|
|
||||||
else:
|
|
||||||
print("No valid predictions found!")
|
|
||||||
|
|
||||||
print(f"\n{'='*80}")
|
|
||||||
print("ERROR ANALYSIS")
|
|
||||||
print(f"{'='*80}")
|
|
||||||
|
|
||||||
total_harvests = len(lead_times) + false_positives + misses
|
|
||||||
print(f"Total harvests: {total_harvests}")
|
|
||||||
print(f" Correct timing (3-14d): {len(lead_times):3d} ({len(lead_times)/total_harvests*100:5.1f}%) ✅")
|
|
||||||
print(f" Wrong timing (false pos): {false_positives:3d} ({false_positives/total_harvests*100:5.1f}%) ⚠️")
|
|
||||||
print(f" Misses (no warning): {misses:3d} ({misses/total_harvests*100:5.1f}%) ❌")
|
|
||||||
|
|
||||||
print(f"\n{'='*80}")
|
|
||||||
print("PER-FIELD PERFORMANCE")
|
|
||||||
print(f"{'='*80}")
|
|
||||||
|
|
||||||
field_summary = []
|
|
||||||
for field in sorted(field_performance.keys()):
|
|
||||||
perf = field_performance[field]
|
|
||||||
total = perf['correct'] + perf['incorrect']
|
|
||||||
accuracy = perf['correct'] / total * 100 if total > 0 else 0
|
|
||||||
field_summary.append({
|
|
||||||
'field': field,
|
|
||||||
'correct': perf['correct'],
|
|
||||||
'incorrect': perf['incorrect'],
|
|
||||||
'accuracy': accuracy
|
|
||||||
})
|
|
||||||
|
|
||||||
field_df = pd.DataFrame(field_summary).sort_values('accuracy', ascending=False)
|
|
||||||
print(field_df.to_string(index=False))
|
|
||||||
|
|
||||||
# Visualization
|
|
||||||
fig, axes = plt.subplots(1, 2, figsize=(14, 5))
|
|
||||||
|
|
||||||
# Plot 1: Lead time distribution
|
|
||||||
if len(lead_times) > 0:
|
|
||||||
axes[0].hist(lead_times, bins=10, edgecolor='black', alpha=0.7, color='steelblue')
|
|
||||||
axes[0].axvline(np.mean(lead_times), color='red', linestyle='--', linewidth=2, label=f'Mean: {np.mean(lead_times):.1f}d')
|
|
||||||
axes[0].axvspan(3, 14, alpha=0.2, color='green', label='Optimal window')
|
|
||||||
axes[0].set_xlabel('Days Before Harvest', fontweight='bold')
|
|
||||||
axes[0].set_ylabel('Frequency', fontweight='bold')
|
|
||||||
axes[0].set_title('Lead Time Distribution', fontweight='bold')
|
|
||||||
axes[0].legend()
|
|
||||||
axes[0].grid(True, alpha=0.3)
|
|
||||||
|
|
||||||
# Plot 2: Per-field accuracy
|
|
||||||
axes[1].barh(field_df['field'], field_df['accuracy'], color=['green' if x > 80 else 'orange' if x > 60 else 'red' for x in field_df['accuracy']])
|
|
||||||
axes[1].set_xlabel('Accuracy (%)', fontweight='bold')
|
|
||||||
axes[1].set_title('Per-Field Performance', fontweight='bold')
|
|
||||||
axes[1].set_xlim([0, 100])
|
|
||||||
for i, acc in enumerate(field_df['accuracy']):
|
|
||||||
axes[1].text(acc + 2, i, f'{acc:.1f}%', va='center', fontweight='bold')
|
|
||||||
axes[1].grid(True, alpha=0.3, axis='x')
|
|
||||||
|
|
||||||
plt.tight_layout()
|
|
||||||
plt.savefig('operational_metrics.png', dpi=150, bbox_inches='tight')
|
|
||||||
plt.show()
|
|
||||||
|
|
||||||
return {
|
|
||||||
'lead_times': lead_times,
|
|
||||||
'false_positives': false_positives,
|
|
||||||
'misses': misses,
|
|
||||||
'field_performance': field_df
|
|
||||||
}
|
|
||||||
|
|
||||||
# Run it
|
|
||||||
metrics = compute_operational_metrics(model, test_sequences_labeled, X_test_norm, test_loader)
|
|
||||||
```
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## 4. Save Enhanced Model Configuration
|
|
||||||
|
|
||||||
```python
|
|
||||||
# Add to Section 12, before saving config
|
|
||||||
|
|
||||||
if df_temp is not None:
|
|
||||||
temp_status = "✓ Temperature data included"
|
|
||||||
else:
|
|
||||||
temp_status = "✗ Temperature data NOT included (7 features only)"
|
|
||||||
|
|
||||||
config = {
|
|
||||||
'client': CLIENT_FILTER,
|
|
||||||
'ci_column': ci_column,
|
|
||||||
'feature_count': 11 if df_temp is not None else 7,
|
|
||||||
'feature_names': feature_names,
|
|
||||||
'temperature_data': temp_status,
|
|
||||||
'imminent_window_days': [3, 14],
|
|
||||||
'detected_window_days': [1, 21],
|
|
||||||
'test_auc_imminent': float(auc_imminent_test),
|
|
||||||
'test_auc_detected': float(auc_detected_test),
|
|
||||||
'model_type': 'PyTorch LSTM (64 hidden, 1 layer, 50% dropout)',
|
|
||||||
'training_config': {
|
|
||||||
'batch_size': batch_size,
|
|
||||||
'num_epochs': num_epochs,
|
|
||||||
'early_stopping_patience': patience,
|
|
||||||
'optimizer': 'Adam (lr=0.001)',
|
|
||||||
'loss': 'Focal BCE with class weighting'
|
|
||||||
},
|
|
||||||
'data_quality': {
|
|
||||||
'min_season_length_days': 300,
|
|
||||||
'linear_interpolation_threshold': DATA_QUALITY_THRESHOLD,
|
|
||||||
'linear_window_size': LINEAR_WINDOW_SIZE,
|
|
||||||
'train_val_test_split': list(TRAIN_VAL_TEST_SPLIT),
|
|
||||||
'total_training_days': len(df_train),
|
|
||||||
'total_fields': df_train['field'].nunique(),
|
|
||||||
'total_seasons': df_train['model'].nunique()
|
|
||||||
},
|
|
||||||
'operational_notes': {
|
|
||||||
'lead_time_mean': metrics.get('lead_time_mean', 'N/A'),
|
|
||||||
'false_positive_rate': metrics.get('false_pos_rate', 'N/A'),
|
|
||||||
'per_field_accuracies': metrics.get('field_accuracies', {})
|
|
||||||
}
|
|
||||||
}
|
|
||||||
|
|
||||||
config_name = f'harvest_detection_config_esa_{CLIENT_FILTER}.json'
|
|
||||||
with open(config_name, 'w') as f:
|
|
||||||
json.dump(config, f, indent=2)
|
|
||||||
print(f"[OK] Saved: {config_name}")
|
|
||||||
```
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## Summary: Code Changes by Priority
|
|
||||||
|
|
||||||
| Priority | Change | Effort | Impact |
|
|
||||||
|----------|--------|--------|--------|
|
|
||||||
| 🔴 High | Retrain all clients (CLIENT_FILTER = None) | 5 min | +5-10% AUC |
|
|
||||||
| 🔴 High | Add temperature features (Code #1) | 3-4 hrs | +10-15% AUC |
|
|
||||||
| 🟡 Med | Test window optimization (Code #2) | 2 hrs | -30% false pos |
|
|
||||||
| 🟡 Med | Compute operational metrics (Code #3) | 1-2 hrs | Better understanding |
|
|
||||||
| 🟢 Low | Save enhanced config (Code #4) | 10 min | Better tracking |
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
**All code above is production-ready and tested. Copy-paste and adapt as needed!**
|
|
||||||
|
|
@ -1,124 +0,0 @@
|
||||||
# Quick Reference: Your Feedback & Response
|
|
||||||
|
|
||||||
**Your Concern**: False imminent triggers on cloud dips, not real harvest signals
|
|
||||||
|
|
||||||
**What I Understood**:
|
|
||||||
1. The smooth blue LOESS curve = real field state
|
|
||||||
2. The jagged red line = noise (clouds, sensor errors, artifacts)
|
|
||||||
3. Model learns from noise, triggers falsely on cloud dips
|
|
||||||
4. Want CI-only improvements (no temperature yet)
|
|
||||||
5. Need confidence intervals to identify uncertain predictions
|
|
||||||
6. Want all .md files organized (moved to python_app/harvest_detection_experiments/)
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## 3 Core Solutions
|
|
||||||
|
|
||||||
### 1. Aggressive Smoothing (Fix Feature Calculation)
|
|
||||||
```
|
|
||||||
Current: Features calculated from NOISY raw CI
|
|
||||||
Problem: Model learns "this noise pattern = harvest"
|
|
||||||
|
|
||||||
Fixed: Features calculated from SMOOTHED CI
|
|
||||||
- 21-day median filter (removes cloud spikes)
|
|
||||||
- 7-day mean on top (further smoothing)
|
|
||||||
- All features derived from smooth curve
|
|
||||||
- Result: Model learns real trends, not noise
|
|
||||||
```
|
|
||||||
|
|
||||||
### 2. Better CI-Only Features
|
|
||||||
```
|
|
||||||
New feature 6: "Decline Rate"
|
|
||||||
- Harvest = consistent downward slope
|
|
||||||
- Noise = random spikes up and down
|
|
||||||
- Model learns the difference
|
|
||||||
|
|
||||||
New feature 7: "Stability"
|
|
||||||
- Harvest = smooth, stable decline
|
|
||||||
- Clouds = jagged, unstable spikes
|
|
||||||
- Detects smoothness automatically
|
|
||||||
```
|
|
||||||
|
|
||||||
### 3. Monte Carlo Dropout (Uncertainty)
|
|
||||||
```
|
|
||||||
Run prediction 30 times with dropout ON:
|
|
||||||
- Each run gives slightly different result
|
|
||||||
- Average = best estimate
|
|
||||||
- Std Dev = how confident model is
|
|
||||||
|
|
||||||
Result:
|
|
||||||
- High confidence + high probability = Alert farmer ✅
|
|
||||||
- High confidence + low probability = Normal growth ✅
|
|
||||||
- Low confidence + high probability = Probably noise ❌ FILTER OUT
|
|
||||||
|
|
||||||
This directly identifies cloud/noise false positives!
|
|
||||||
```
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## Where to Find Everything
|
|
||||||
|
|
||||||
### Quick Start
|
|
||||||
- **ACTION_PLAN.md** ← Start here (3-page overview + timeline)
|
|
||||||
|
|
||||||
### Implementation Details
|
|
||||||
- **CI_ONLY_IMPROVEMENTS.md** ← All code + explanations (copy-paste ready)
|
|
||||||
|
|
||||||
### Reference/Context
|
|
||||||
- **README_EVALUATION.md** ← Navigation guide for all other docs
|
|
||||||
- **LSTM_HARVEST_EVALUATION.md** ← Original detailed analysis
|
|
||||||
- **QUICK_SUMMARY.md** ← Non-technical overview
|
|
||||||
|
|
||||||
All in: `python_app/harvest_detection_experiments/`
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## Your Next Steps
|
|
||||||
|
|
||||||
### TODAY
|
|
||||||
1. Read: ACTION_PLAN.md (10 min read)
|
|
||||||
2. Review: CI_ONLY_IMPROVEMENTS.md (understand approach)
|
|
||||||
3. Decision: Approve implementation?
|
|
||||||
|
|
||||||
### IF APPROVED (This Week)
|
|
||||||
1. Implement Step 1: Update feature engineering (2 hours)
|
|
||||||
2. Implement Step 2: Add Monte Carlo Dropout (1 hour)
|
|
||||||
3. Implement Step 3: Filter by uncertainty (30 min)
|
|
||||||
4. Retrain: Run notebook (30 min)
|
|
||||||
5. Evaluate: Check if false triggers are gone
|
|
||||||
|
|
||||||
### Results Expected
|
|
||||||
- False imminent triggers: 15% → 3-5% (80% reduction!)
|
|
||||||
- Still catches 85-90% of real harvests
|
|
||||||
- Model shows which predictions are uncertain (= noise)
|
|
||||||
- Now CI-only, no external data needed
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## Key Insight
|
|
||||||
|
|
||||||
Your graph perfectly shows the problem:
|
|
||||||
```
|
|
||||||
Blue curve (smooth) = Model should learn from this
|
|
||||||
Red line (jagged) = Model currently learns from this
|
|
||||||
|
|
||||||
Solution: Make features from blue curve only
|
|
||||||
Result: Model predicts only on real patterns
|
|
||||||
Benefit: Uncertainty bands show when it's guessing (red line noise)
|
|
||||||
```
|
|
||||||
|
|
||||||
The confidence intervals are KEY because they tell you:
|
|
||||||
- "This imminent prediction is based on smooth, stable data" ✅ Trust it
|
|
||||||
- "This imminent prediction is based on noise patterns" ❌ Ignore it
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## Questions?
|
|
||||||
|
|
||||||
See the specific documents:
|
|
||||||
- **How to implement?** → CI_ONLY_IMPROVEMENTS.md (code sections)
|
|
||||||
- **What's the timeline?** → ACTION_PLAN.md
|
|
||||||
- **Why this approach?** → LSTM_HARVEST_EVALUATION.md (Data Quality section)
|
|
||||||
- **Where do files go?** → They're already organized in python_app/harvest_detection_experiments/
|
|
||||||
|
|
||||||
Ready to proceed? 🚀
|
|
||||||
|
Before Width: | Height: | Size: 560 KiB |
|
|
@ -1,23 +0,0 @@
|
||||||
{
|
|
||||||
"input_size": 7,
|
|
||||||
"feature_names": [
|
|
||||||
"CI",
|
|
||||||
"7d Velocity",
|
|
||||||
"7d Acceleration",
|
|
||||||
"14d MA",
|
|
||||||
"14d Velocity",
|
|
||||||
"7d Min",
|
|
||||||
"Is_Spike"
|
|
||||||
],
|
|
||||||
"num_train_sequences": 326,
|
|
||||||
"num_test_sequences": 18,
|
|
||||||
"imminent_window": [
|
|
||||||
14,
|
|
||||||
3
|
|
||||||
],
|
|
||||||
"detected_window": [
|
|
||||||
1,
|
|
||||||
40
|
|
||||||
],
|
|
||||||
"note": "WITH is_spike feature - using Focal Loss for training"
|
|
||||||
}
|
|
||||||
|
|
@ -1,16 +0,0 @@
|
||||||
{
|
|
||||||
"ci_column": "fitdata_ma7",
|
|
||||||
"max_sequence_length": 800,
|
|
||||||
"min_history": 30,
|
|
||||||
"imminent_window": [
|
|
||||||
7,
|
|
||||||
30
|
|
||||||
],
|
|
||||||
"detected_window": [
|
|
||||||
1,
|
|
||||||
7
|
|
||||||
],
|
|
||||||
"test_auc_imminent": 0.8142839607805498,
|
|
||||||
"test_auc_detected": 0.95001123096383,
|
|
||||||
"model_type": "PyTorch LSTM"
|
|
||||||
}
|
|
||||||
|
|
@ -1,42 +0,0 @@
|
||||||
{
|
|
||||||
"client": null,
|
|
||||||
"ci_column": "fitdata_ma7",
|
|
||||||
"feature_count": 7,
|
|
||||||
"feature_names": [
|
|
||||||
"CI",
|
|
||||||
"7d Velocity",
|
|
||||||
"7d Acceleration",
|
|
||||||
"14d MA",
|
|
||||||
"14d Velocity",
|
|
||||||
"7d Min",
|
|
||||||
"Velocity Magnitude"
|
|
||||||
],
|
|
||||||
"imminent_window_days": [
|
|
||||||
3,
|
|
||||||
14
|
|
||||||
],
|
|
||||||
"detected_window_days": [
|
|
||||||
1,
|
|
||||||
21
|
|
||||||
],
|
|
||||||
"test_auc_imminent": 0.9061061265269594,
|
|
||||||
"test_auc_detected": 0.9614787868760791,
|
|
||||||
"model_type": "PyTorch LSTM (64 hidden, 1 layer, 50% dropout)",
|
|
||||||
"training_config": {
|
|
||||||
"batch_size": 1,
|
|
||||||
"num_epochs": 150,
|
|
||||||
"early_stopping_patience": 20,
|
|
||||||
"optimizer": "Adam (lr=0.001)",
|
|
||||||
"loss": "Focal BCE with class weighting"
|
|
||||||
},
|
|
||||||
"data_quality": {
|
|
||||||
"min_season_length_days": 300,
|
|
||||||
"linear_interpolation_threshold": 0.85,
|
|
||||||
"linear_window_size": 30,
|
|
||||||
"train_val_test_split": [
|
|
||||||
0.7,
|
|
||||||
0.15,
|
|
||||||
0.15
|
|
||||||
]
|
|
||||||
}
|
|
||||||
}
|
|
||||||
|
|
@ -1,42 +0,0 @@
|
||||||
{
|
|
||||||
"client": "esa",
|
|
||||||
"ci_column": "fitdata_ma7",
|
|
||||||
"feature_count": 7,
|
|
||||||
"feature_names": [
|
|
||||||
"CI",
|
|
||||||
"7d Velocity",
|
|
||||||
"7d Acceleration",
|
|
||||||
"14d MA",
|
|
||||||
"14d Velocity",
|
|
||||||
"7d Min",
|
|
||||||
"Velocity Magnitude"
|
|
||||||
],
|
|
||||||
"imminent_window_days": [
|
|
||||||
3,
|
|
||||||
14
|
|
||||||
],
|
|
||||||
"detected_window_days": [
|
|
||||||
1,
|
|
||||||
21
|
|
||||||
],
|
|
||||||
"test_auc_imminent": 0.8896814958828911,
|
|
||||||
"test_auc_detected": 0.9816022435464252,
|
|
||||||
"model_type": "PyTorch LSTM (64 hidden, 1 layer, 50% dropout)",
|
|
||||||
"training_config": {
|
|
||||||
"batch_size": 3,
|
|
||||||
"num_epochs": 150,
|
|
||||||
"early_stopping_patience": 20,
|
|
||||||
"optimizer": "Adam (lr=0.001)",
|
|
||||||
"loss": "Focal BCE with class weighting"
|
|
||||||
},
|
|
||||||
"data_quality": {
|
|
||||||
"min_season_length_days": 300,
|
|
||||||
"linear_interpolation_threshold": 0.85,
|
|
||||||
"linear_window_size": 30,
|
|
||||||
"train_val_test_split": [
|
|
||||||
0.7,
|
|
||||||
0.15,
|
|
||||||
0.15
|
|
||||||
]
|
|
||||||
}
|
|
||||||
}
|
|
||||||
|
Before Width: | Height: | Size: 161 KiB |
|
Before Width: | Height: | Size: 328 KiB |
|
Before Width: | Height: | Size: 1.1 MiB |
|
Before Width: | Height: | Size: 313 KiB |
|
Before Width: | Height: | Size: 328 KiB |
|
Before Width: | Height: | Size: 306 KiB |
|
Before Width: | Height: | Size: 311 KiB |
|
Before Width: | Height: | Size: 307 KiB |
|
Before Width: | Height: | Size: 204 KiB |
|
Before Width: | Height: | Size: 270 KiB |
|
Before Width: | Height: | Size: 430 KiB |
|
Before Width: | Height: | Size: 95 KiB |
|
Before Width: | Height: | Size: 693 KiB |
|
|
@ -1,162 +0,0 @@
|
||||||
"""
|
|
||||||
prepare_harvest_data.py
|
|
||||||
======================
|
|
||||||
Load CI CSV data from R script 02b output and prepare it for LSTM harvest detection.
|
|
||||||
This identifies field sequences (implicitly by data continuity) and formats them for
|
|
||||||
the model to predict harvest dates.
|
|
||||||
|
|
||||||
Usage:
|
|
||||||
python prepare_harvest_data.py [project_dir] [output_csv]
|
|
||||||
|
|
||||||
Example:
|
|
||||||
python prepare_harvest_data.py esa harvest_input_data.csv
|
|
||||||
|
|
||||||
Input:
|
|
||||||
- ci_data_for_python.csv (output from 02b_convert_ci_rds_to_csv.R)
|
|
||||||
- Columns: field, sub_field, Date, FitData, DOY, value
|
|
||||||
|
|
||||||
Output:
|
|
||||||
- CSV file with columns: field, client, season, Date, FitData, DOY
|
|
||||||
- 'season' is auto-identified based on data gaps (gaps > 30 days = new season)
|
|
||||||
- 'client' is set based on project_dir
|
|
||||||
|
|
||||||
"""
|
|
||||||
|
|
||||||
import pandas as pd
|
|
||||||
import numpy as np
|
|
||||||
from datetime import datetime, timedelta
|
|
||||||
from pathlib import Path
|
|
||||||
import sys
|
|
||||||
import os
|
|
||||||
|
|
||||||
def identify_seasons(field_data, gap_threshold_days=30):
|
|
||||||
"""
|
|
||||||
Identify seasons within a field's data by detecting gaps.
|
|
||||||
A gap > gap_threshold_days indicates a new season.
|
|
||||||
|
|
||||||
Args:
|
|
||||||
field_data: DataFrame for a single field, sorted by Date
|
|
||||||
gap_threshold_days: Minimum gap (days) to start a new season
|
|
||||||
|
|
||||||
Returns:
|
|
||||||
List of season identifiers, one per row
|
|
||||||
"""
|
|
||||||
field_data = field_data.sort_values('Date').reset_index(drop=True)
|
|
||||||
seasons = []
|
|
||||||
current_season = 0
|
|
||||||
|
|
||||||
for i in range(len(field_data)):
|
|
||||||
if i == 0:
|
|
||||||
seasons.append(f"season_{current_season:03d}")
|
|
||||||
else:
|
|
||||||
prev_date = field_data.iloc[i-1]['Date']
|
|
||||||
curr_date = field_data.iloc[i]['Date']
|
|
||||||
gap_days = (curr_date - prev_date).days
|
|
||||||
|
|
||||||
if gap_days > gap_threshold_days:
|
|
||||||
current_season += 1
|
|
||||||
|
|
||||||
seasons.append(f"season_{current_season:03d}")
|
|
||||||
|
|
||||||
return seasons
|
|
||||||
|
|
||||||
|
|
||||||
def prepare_harvest_data(ci_csv_path, project_dir="esa", output_path=None):
|
|
||||||
"""
|
|
||||||
Load CI data from R conversion and prepare for harvest detection.
|
|
||||||
|
|
||||||
Args:
|
|
||||||
ci_csv_path: Path to ci_data_for_python.csv from script 02b
|
|
||||||
project_dir: Project directory (e.g., "esa", "chemba") - used as 'client'
|
|
||||||
output_path: Output CSV path (default: harvest_input_data.csv in same dir)
|
|
||||||
|
|
||||||
Returns:
|
|
||||||
DataFrame with columns: field, client, season, Date, FitData, DOY
|
|
||||||
"""
|
|
||||||
|
|
||||||
print(f"Loading CI data from: {ci_csv_path}")
|
|
||||||
|
|
||||||
# Load data
|
|
||||||
ci_data = pd.read_csv(ci_csv_path)
|
|
||||||
|
|
||||||
print(f"Loaded {len(ci_data)} rows")
|
|
||||||
print(f"Columns: {', '.join(ci_data.columns)}")
|
|
||||||
print(f"Unique fields: {ci_data['field'].nunique()}")
|
|
||||||
|
|
||||||
# Convert Date to datetime
|
|
||||||
ci_data['Date'] = pd.to_datetime(ci_data['Date'])
|
|
||||||
|
|
||||||
# Sort by field and date
|
|
||||||
ci_data = ci_data.sort_values(['field', 'Date']).reset_index(drop=True)
|
|
||||||
|
|
||||||
# Identify seasons for each field
|
|
||||||
print("\nIdentifying seasons by data gaps (>30 days)...")
|
|
||||||
|
|
||||||
seasons = []
|
|
||||||
for field, group in ci_data.groupby('field'):
|
|
||||||
field_seasons = identify_seasons(group, gap_threshold_days=30)
|
|
||||||
seasons.extend(field_seasons)
|
|
||||||
|
|
||||||
ci_data['season'] = seasons
|
|
||||||
|
|
||||||
# Add client column
|
|
||||||
ci_data['client'] = project_dir.lower()
|
|
||||||
|
|
||||||
# Select and order columns for output
|
|
||||||
output_columns = ['field', 'client', 'season', 'Date', 'FitData', 'DOY']
|
|
||||||
harvest_data = ci_data[output_columns].copy()
|
|
||||||
|
|
||||||
# Validate data
|
|
||||||
print(f"\nValidation:")
|
|
||||||
print(f" Fields: {harvest_data['field'].nunique()}")
|
|
||||||
print(f" Seasons: {harvest_data['season'].nunique()}")
|
|
||||||
print(f" Date range: {harvest_data['Date'].min()} to {harvest_data['Date'].max()}")
|
|
||||||
print(f" FitData range: {harvest_data['FitData'].min():.2f} to {harvest_data['FitData'].max():.2f}")
|
|
||||||
|
|
||||||
# Show sample of seasons per field
|
|
||||||
print(f"\nSample of season identification per field:")
|
|
||||||
for field in harvest_data['field'].unique()[:3]:
|
|
||||||
field_seasons = harvest_data[harvest_data['field'] == field]['season'].unique()
|
|
||||||
print(f" {field}: {len(field_seasons)} seasons")
|
|
||||||
|
|
||||||
# Save output
|
|
||||||
if output_path is None:
|
|
||||||
ci_dir = Path(ci_csv_path).parent
|
|
||||||
output_path = ci_dir / "harvest_input_data.csv"
|
|
||||||
|
|
||||||
print(f"\nSaving to: {output_path}")
|
|
||||||
harvest_data.to_csv(output_path, index=False)
|
|
||||||
print(f"✓ Saved {len(harvest_data)} rows\n")
|
|
||||||
|
|
||||||
return harvest_data
|
|
||||||
|
|
||||||
|
|
||||||
if __name__ == "__main__":
|
|
||||||
# Parse arguments
|
|
||||||
if len(sys.argv) >= 2:
|
|
||||||
project_dir = sys.argv[1]
|
|
||||||
else:
|
|
||||||
project_dir = "esa"
|
|
||||||
|
|
||||||
if len(sys.argv) >= 3:
|
|
||||||
output_path = sys.argv[2]
|
|
||||||
else:
|
|
||||||
output_path = None
|
|
||||||
|
|
||||||
# Build default input path based on project structure
|
|
||||||
base_path = Path(__file__).parent.parent / "laravel_app" / "storage" / "app" / project_dir / "Data" / "extracted_ci" / "cumulative_vals"
|
|
||||||
ci_csv_path = base_path / "ci_data_for_python.csv"
|
|
||||||
|
|
||||||
if not ci_csv_path.exists():
|
|
||||||
print(f"ERROR: Input file not found: {ci_csv_path}")
|
|
||||||
print(f"\nMake sure you have run script 02b first:")
|
|
||||||
print(f" Rscript r_app/02b_convert_ci_rds_to_csv.R {project_dir}")
|
|
||||||
sys.exit(1)
|
|
||||||
|
|
||||||
# Prepare data
|
|
||||||
harvest_data = prepare_harvest_data(str(ci_csv_path), project_dir, output_path)
|
|
||||||
|
|
||||||
print("Next steps:")
|
|
||||||
print(" 1. Use this CSV as input to the harvest LSTM model")
|
|
||||||
print(" 2. Run: python run_harvest_detection.py")
|
|
||||||
print(" 3. Output will be harvest dates in Excel format")
|
|
||||||
|
|
@ -1,289 +0,0 @@
|
||||||
# ==============================================================================
|
|
||||||
# PREPARE LSTM TRAINING DATA FROM RDS FILES
|
|
||||||
# ==============================================================================
|
|
||||||
# This script reads merged CI data from RDS files and creates extended season
|
|
||||||
# sequences for the LSTM harvest detection model.
|
|
||||||
#
|
|
||||||
# Input: RDS files with CI time series, field, season, date info
|
|
||||||
# Location: r_app/experiments/ci_graph_exploration/CI_data/
|
|
||||||
#
|
|
||||||
# Output: lstm_train_data.csv and lstm_test_data.csv
|
|
||||||
# Each season = all days of that season + 40 days from next season
|
|
||||||
# Columns: all columns from RDS (Python will handle feature creation)
|
|
||||||
#
|
|
||||||
# Processing:
|
|
||||||
# 1. Load all RDS files (one per client/estate)
|
|
||||||
# 2. For each field-season: extend with 40 days from next season
|
|
||||||
# 3. Create train/test split by random field selection (no data leakage)
|
|
||||||
# 4. Export to CSV (NO feature engineering - Python handles that)
|
|
||||||
# ==============================================================================
|
|
||||||
|
|
||||||
cat("\n")
|
|
||||||
cat(paste0(rep("=", 80), collapse=""))
|
|
||||||
cat("\nPREPARING LSTM TRAINING DATA FROM RDS FILES\n")
|
|
||||||
cat(paste0(rep("=", 80), collapse=""))
|
|
||||||
|
|
||||||
# Install required packages if needed
|
|
||||||
required_packages <- c("dplyr", "data.table")
|
|
||||||
for (pkg in required_packages) {
|
|
||||||
if (!require(pkg, character.only = TRUE)) {
|
|
||||||
install.packages(pkg, quiet = TRUE)
|
|
||||||
library(pkg, character.only = TRUE)
|
|
||||||
}
|
|
||||||
}
|
|
||||||
|
|
||||||
library(dplyr)
|
|
||||||
library(data.table)
|
|
||||||
|
|
||||||
# ==============================================================================
|
|
||||||
# CONFIGURATION
|
|
||||||
# ==============================================================================
|
|
||||||
|
|
||||||
# Path to RDS files
|
|
||||||
RDS_DIR <- "r_app/experiments/ci_graph_exploration/CI_data"
|
|
||||||
|
|
||||||
# Days from next season to append to each season
|
|
||||||
EXTENSION_DAYS <- 40
|
|
||||||
|
|
||||||
# Python will handle all splitting (80/20 train/test with configurable seed)
|
|
||||||
# R just does preprocessing and exports everything in ONE file
|
|
||||||
|
|
||||||
set.seed(42)
|
|
||||||
|
|
||||||
cat("\nConfiguration:\n")
|
|
||||||
cat(" RDS directory:", RDS_DIR, "\n")
|
|
||||||
cat(" Extension days from next season:", EXTENSION_DAYS, "\n")
|
|
||||||
cat(" NOTE: R does NOT split data. Python splits 80/20 with seed control.\n")
|
|
||||||
|
|
||||||
# ==============================================================================
|
|
||||||
# LOAD ALL RDS FILES
|
|
||||||
# ==============================================================================
|
|
||||||
|
|
||||||
cat("\n")
|
|
||||||
cat(paste0(rep("=", 80), collapse=""))
|
|
||||||
cat("\nLOADING RDS FILES\n")
|
|
||||||
cat(paste0(rep("=", 80), collapse=""))
|
|
||||||
|
|
||||||
# Get list of RDS files
|
|
||||||
rds_files <- list.files(RDS_DIR, pattern = "\\.rds$", full.names = TRUE)
|
|
||||||
|
|
||||||
if (length(rds_files) == 0) {
|
|
||||||
stop("No RDS files found in ", RDS_DIR)
|
|
||||||
}
|
|
||||||
|
|
||||||
cat("\nFound", length(rds_files), "RDS files\n")
|
|
||||||
|
|
||||||
# Load all RDS files into one data frame
|
|
||||||
all_data <- list()
|
|
||||||
|
|
||||||
for (rds_file in rds_files) {
|
|
||||||
client_name <- tools::file_path_sans_ext(basename(rds_file))
|
|
||||||
|
|
||||||
tryCatch({
|
|
||||||
data <- readRDS(rds_file)
|
|
||||||
|
|
||||||
# Convert to data.table
|
|
||||||
if (!is.data.table(data)) {
|
|
||||||
data <- as.data.table(data)
|
|
||||||
}
|
|
||||||
|
|
||||||
# Add client column if not present
|
|
||||||
if (!"client" %in% names(data)) {
|
|
||||||
data[, client := client_name]
|
|
||||||
}
|
|
||||||
|
|
||||||
all_data[[client_name]] <- data
|
|
||||||
|
|
||||||
cat(" ✓", client_name, ":", nrow(data), "rows\n")
|
|
||||||
}, error = function(e) {
|
|
||||||
cat(" ✗ Error loading", client_name, ":", e$message, "\n")
|
|
||||||
})
|
|
||||||
}
|
|
||||||
|
|
||||||
# Combine all data
|
|
||||||
df_all <- rbindlist(all_data, fill = TRUE)
|
|
||||||
|
|
||||||
cat("\nTotal rows:", nrow(df_all), "\n")
|
|
||||||
cat("Unique clients:", df_all[, uniqueN(client)], "\n")
|
|
||||||
cat("Unique fields:", df_all[, uniqueN(field)], "\n")
|
|
||||||
cat("Unique seasons:", df_all[, uniqueN(model)], "\n")
|
|
||||||
|
|
||||||
# ==============================================================================
|
|
||||||
# DATA CLEANING & PREPARATION
|
|
||||||
# ==============================================================================
|
|
||||||
|
|
||||||
cat("\n")
|
|
||||||
cat(paste0(rep("=", 80), collapse=""))
|
|
||||||
cat("\nDATA CLEANING & PREPARATION\n")
|
|
||||||
cat(paste0(rep("=", 80), collapse=""))
|
|
||||||
|
|
||||||
# Rename columns to standard names (case-insensitive matching)
|
|
||||||
setnames(df_all, tolower(names(df_all)))
|
|
||||||
|
|
||||||
# Check which columns exist (may vary by RDS file)
|
|
||||||
available <- names(df_all)
|
|
||||||
cat("\nAvailable columns:", paste(available, collapse=", "), "\n")
|
|
||||||
|
|
||||||
# Use FitData if available, otherwise value or fitdata_ma7
|
|
||||||
if ("fitdata" %in% available) {
|
|
||||||
ci_col <- "fitdata"
|
|
||||||
} else if ("value" %in% available) {
|
|
||||||
ci_col <- "value"
|
|
||||||
} else {
|
|
||||||
stop("Cannot find CI column (fitdata, value, or fitdata_ma7)")
|
|
||||||
}
|
|
||||||
|
|
||||||
cat("Using CI column:", ci_col, "\n")
|
|
||||||
|
|
||||||
# Keep only essential columns
|
|
||||||
df_all <- df_all[, .(
|
|
||||||
field = field,
|
|
||||||
client = client,
|
|
||||||
model = model,
|
|
||||||
Date = date,
|
|
||||||
FitData = get(ci_col),
|
|
||||||
DOY = doy
|
|
||||||
)]
|
|
||||||
|
|
||||||
# Remove rows with missing field or CI values
|
|
||||||
df_all <- df_all[!is.na(field) & !is.na(FitData)]
|
|
||||||
|
|
||||||
# Sort by field, model (season), DOY
|
|
||||||
setorder(df_all, field, model, DOY)
|
|
||||||
|
|
||||||
cat("Total rows after cleaning:", nrow(df_all), "\n")
|
|
||||||
|
|
||||||
# ==============================================================================
|
|
||||||
# BUILD EXTENDED SEASON SEQUENCES
|
|
||||||
# ==============================================================================
|
|
||||||
|
|
||||||
cat("\n")
|
|
||||||
cat(paste0(rep("=", 80), collapse=""))
|
|
||||||
cat("\nBUILDING EXTENDED SEASON SEQUENCES\n")
|
|
||||||
cat(paste0(rep("=", 80), collapse=""))
|
|
||||||
|
|
||||||
# Get unique field-season combinations
|
|
||||||
field_seasons <- unique(df_all[, .(field, model)])
|
|
||||||
setorder(field_seasons, field, model)
|
|
||||||
|
|
||||||
cat("\nTotal field-season combos:", nrow(field_seasons), "\n")
|
|
||||||
|
|
||||||
# Function to build extended season (season + 40 days from next season)
|
|
||||||
build_extended_season <- function(field_name, season_name, data, extension_days = EXTENSION_DAYS) {
|
|
||||||
|
|
||||||
# Get current season data
|
|
||||||
current <- data[field == field_name & model == season_name]
|
|
||||||
if (nrow(current) == 0) return(NULL)
|
|
||||||
|
|
||||||
# Start with current season
|
|
||||||
extended <- copy(current)
|
|
||||||
|
|
||||||
# Find the next season for this field (by date order)
|
|
||||||
next_season <- data[
|
|
||||||
field == field_name &
|
|
||||||
model != season_name &
|
|
||||||
Date > max(current$Date),
|
|
||||||
.SD[1, by = model] # Get first row of each model
|
|
||||||
]
|
|
||||||
|
|
||||||
if (nrow(next_season) > 0) {
|
|
||||||
# Get the season that starts soonest after current season ends
|
|
||||||
next_season <- next_season[order(Date)]
|
|
||||||
if (nrow(next_season) > 0) {
|
|
||||||
next_model <- next_season$model[1]
|
|
||||||
|
|
||||||
# Get data from next season (up to EXTENSION_DAYS)
|
|
||||||
next_data <- data[field == field_name & model == next_model][1:min(extension_days, .N)]
|
|
||||||
|
|
||||||
if (nrow(next_data) > 0) {
|
|
||||||
extended <- rbind(extended, next_data, fill = TRUE)
|
|
||||||
}
|
|
||||||
}
|
|
||||||
}
|
|
||||||
|
|
||||||
return(extended)
|
|
||||||
}
|
|
||||||
|
|
||||||
# Build all extended seasons
|
|
||||||
extended_sequences <- list()
|
|
||||||
|
|
||||||
for (i in 1:nrow(field_seasons)) {
|
|
||||||
field_name <- field_seasons$field[i]
|
|
||||||
season_name <- field_seasons$model[i]
|
|
||||||
|
|
||||||
seq_data <- build_extended_season(field_name, season_name, df_all, EXTENSION_DAYS)
|
|
||||||
|
|
||||||
if (!is.null(seq_data) && nrow(seq_data) > 0) {
|
|
||||||
extended_sequences[[i]] <- seq_data
|
|
||||||
}
|
|
||||||
}
|
|
||||||
|
|
||||||
# Combine all extended sequences
|
|
||||||
df_extended <- rbindlist(extended_sequences, fill = TRUE)
|
|
||||||
|
|
||||||
cat("Total sequences created:", length(extended_sequences), "\n")
|
|
||||||
cat("Total rows in extended data:", nrow(df_extended), "\n")
|
|
||||||
cat("Unique field-season combos in extended:", df_extended[, uniqueN(paste0(field, "_", model))], "\n")
|
|
||||||
|
|
||||||
# ==============================================================================
|
|
||||||
# EXPORT TO CSV FILES
|
|
||||||
# ==============================================================================
|
|
||||||
|
|
||||||
cat("\n")
|
|
||||||
cat(paste0(rep("=", 80), collapse=""))
|
|
||||||
cat("\nEXPORTING CSV FILES\n")
|
|
||||||
cat(paste0(rep("=", 80), collapse=""))
|
|
||||||
|
|
||||||
# ==============================================================================
|
|
||||||
# EXPORT TO SINGLE CSV FILE
|
|
||||||
# ==============================================================================
|
|
||||||
|
|
||||||
cat("\n")
|
|
||||||
cat(paste0(rep("=", 80), collapse=""))
|
|
||||||
cat("\nEXPORTING EXTENDED SEASON DATA\n")
|
|
||||||
cat(paste0(rep("=", 80), collapse=""))
|
|
||||||
|
|
||||||
# Select essential columns (no train/test split at R level)
|
|
||||||
df_output <- df_extended[, .(field, client, model, Date, FitData, DOY)]
|
|
||||||
|
|
||||||
# Remove any rows with NA values
|
|
||||||
df_output <- df_output[complete.cases(df_output)]
|
|
||||||
|
|
||||||
# Export to single CSV
|
|
||||||
output_csv <- "lstm_complete_data.csv"
|
|
||||||
fwrite(df_extended, output_csv)
|
|
||||||
|
|
||||||
cat("\n✓ Exported data:\n")
|
|
||||||
cat(" ", output_csv, ":", nrow(df_output), "rows\n")
|
|
||||||
cat(" Columns: field, client, model, Date, FitData, DOY\n")
|
|
||||||
|
|
||||||
# ==============================================================================
|
|
||||||
# SUMMARY STATISTICS
|
|
||||||
# ==============================================================================
|
|
||||||
|
|
||||||
cat("\n")
|
|
||||||
cat(paste0(rep("=", 80), collapse=""))
|
|
||||||
cat("\nSUMMARY STATISTICS\n")
|
|
||||||
cat(paste0(rep("=", 80), collapse=""))
|
|
||||||
|
|
||||||
cat("\nCOMPLETE DATASET:\n")
|
|
||||||
cat(" Total rows:", nrow(df_output), "\n")
|
|
||||||
cat(" Unique fields:", df_extended[, uniqueN(field)], "\n")
|
|
||||||
cat(" Unique seasons:", df_extended[, uniqueN(model)], "\n")
|
|
||||||
cat(" Unique clients:", df_extended[, uniqueN(client)], "\n")
|
|
||||||
|
|
||||||
# Sequence length statistics
|
|
||||||
seq_stats <- df_extended[, .(seq_length = .N), by = .(field, model)]
|
|
||||||
cat(" Sequence lengths: min=", min(seq_stats$seq_length),
|
|
||||||
", median=", as.integer(median(seq_stats$seq_length)),
|
|
||||||
", max=", max(seq_stats$seq_length), "\n", sep = "")
|
|
||||||
|
|
||||||
cat("\n")
|
|
||||||
cat(paste0(rep("=", 80), collapse=""))
|
|
||||||
cat("\n✓ DATA PREPARATION COMPLETE\n")
|
|
||||||
cat(paste0(rep("=", 80), collapse=""))
|
|
||||||
cat("\nNext steps in Python:\n")
|
|
||||||
cat("1. Load lstm_complete_data.csv\n")
|
|
||||||
cat("2. Do all preprocessing on complete dataset\n")
|
|
||||||
cat("3. Right before model training: split 80/20 by field (using seed)\n")
|
|
||||||
cat("4. k-fold CV trains on 80%, evaluates on held-out 20%\n")
|
|
||||||
|
Before Width: | Height: | Size: 68 KiB |
|
|
@ -1,258 +0,0 @@
|
||||||
{
|
|
||||||
"cells": [
|
|
||||||
{
|
|
||||||
"cell_type": "markdown",
|
|
||||||
"id": "d90f7f7f",
|
|
||||||
"metadata": {},
|
|
||||||
"source": [
|
|
||||||
"# Validate & Explore ESA `pivot.geojson`\n",
|
|
||||||
"\n",
|
|
||||||
"Quick inspection of the field boundary file for the ESA project:\n",
|
|
||||||
"- Geometry validity, CRS, field count\n",
|
|
||||||
"- Area/perimeter statistics\n",
|
|
||||||
"- Map visualization with field labels\n",
|
|
||||||
"- Export summary CSV for downstream use"
|
|
||||||
]
|
|
||||||
},
|
|
||||||
{
|
|
||||||
"cell_type": "markdown",
|
|
||||||
"id": "e99594bb",
|
|
||||||
"metadata": {},
|
|
||||||
"source": [
|
|
||||||
"## 1. Import Required Libraries"
|
|
||||||
]
|
|
||||||
},
|
|
||||||
{
|
|
||||||
"cell_type": "code",
|
|
||||||
"execution_count": null,
|
|
||||||
"id": "652485c7",
|
|
||||||
"metadata": {},
|
|
||||||
"outputs": [],
|
|
||||||
"source": [
|
|
||||||
"import geopandas as gpd\n",
|
|
||||||
"import pandas as pd\n",
|
|
||||||
"import numpy as np\n",
|
|
||||||
"import matplotlib.pyplot as plt\n",
|
|
||||||
"from pathlib import Path\n",
|
|
||||||
"from shapely.validation import make_valid\n",
|
|
||||||
"\n",
|
|
||||||
"# Project paths\n",
|
|
||||||
"PROJECT = \"esa\"\n",
|
|
||||||
"REPO_ROOT = Path.cwd().parents[2] # SmartCane_code\n",
|
|
||||||
"PROJECT_STORAGE = REPO_ROOT / \"laravel_app\" / \"storage\" / \"app\" / PROJECT\n",
|
|
||||||
"GEOJSON_PATH = PROJECT_STORAGE / \"pivot.geojson\"\n",
|
|
||||||
"DATA_DIR = PROJECT_STORAGE / \"Data\"\n",
|
|
||||||
"DATA_DIR.mkdir(parents=True, exist_ok=True)\n",
|
|
||||||
"\n",
|
|
||||||
"print(f\"Repo root: {REPO_ROOT}\")\n",
|
|
||||||
"print(f\"GeoJSON: {GEOJSON_PATH}\")\n",
|
|
||||||
"print(f\"Exists: {GEOJSON_PATH.exists()}\")"
|
|
||||||
]
|
|
||||||
},
|
|
||||||
{
|
|
||||||
"cell_type": "markdown",
|
|
||||||
"id": "4ff6d825",
|
|
||||||
"metadata": {},
|
|
||||||
"source": [
|
|
||||||
"## 2. Load GeoJSON & Basic Info"
|
|
||||||
]
|
|
||||||
},
|
|
||||||
{
|
|
||||||
"cell_type": "code",
|
|
||||||
"execution_count": null,
|
|
||||||
"id": "92012a8c",
|
|
||||||
"metadata": {},
|
|
||||||
"outputs": [],
|
|
||||||
"source": [
|
|
||||||
"gdf = gpd.read_file(GEOJSON_PATH)\n",
|
|
||||||
"\n",
|
|
||||||
"print(f\"Features: {len(gdf)}\")\n",
|
|
||||||
"print(f\"Geometry type: {gdf.geom_type.unique()}\")\n",
|
|
||||||
"print(f\"CRS: {gdf.crs}\")\n",
|
|
||||||
"print(f\"Bounds: {gdf.total_bounds}\")\n",
|
|
||||||
"print(f\"\\nColumns: {list(gdf.columns)}\")\n",
|
|
||||||
"print(f\"\\nAttribute table:\")\n",
|
|
||||||
"gdf.drop(columns=\"geometry\")"
|
|
||||||
]
|
|
||||||
},
|
|
||||||
{
|
|
||||||
"cell_type": "markdown",
|
|
||||||
"id": "edadb2de",
|
|
||||||
"metadata": {},
|
|
||||||
"source": [
|
|
||||||
"## 3. Validate Geometries"
|
|
||||||
]
|
|
||||||
},
|
|
||||||
{
|
|
||||||
"cell_type": "code",
|
|
||||||
"execution_count": null,
|
|
||||||
"id": "15b5e915",
|
|
||||||
"metadata": {},
|
|
||||||
"outputs": [],
|
|
||||||
"source": [
|
|
||||||
"# Check validity\n",
|
|
||||||
"validity = gdf.is_valid\n",
|
|
||||||
"empty = gdf.is_empty\n",
|
|
||||||
"\n",
|
|
||||||
"print(f\"Valid geometries: {validity.sum()}/{len(gdf)}\")\n",
|
|
||||||
"print(f\"Invalid geometries: {(~validity).sum()}\")\n",
|
|
||||||
"print(f\"Empty geometries: {empty.sum()}\")\n",
|
|
||||||
"\n",
|
|
||||||
"# Repair if needed\n",
|
|
||||||
"if not validity.all():\n",
|
|
||||||
" print(\"\\nRepairing invalid geometries...\")\n",
|
|
||||||
" gdf[\"geometry\"] = gdf[\"geometry\"].apply(make_valid)\n",
|
|
||||||
" print(f\"After repair: {gdf.is_valid.sum()}/{len(gdf)} valid\")\n",
|
|
||||||
"\n",
|
|
||||||
"# Remove empty\n",
|
|
||||||
"if empty.any():\n",
|
|
||||||
" gdf = gdf[~gdf.is_empty]\n",
|
|
||||||
" print(f\"After removing empty: {len(gdf)} features remaining\")"
|
|
||||||
]
|
|
||||||
},
|
|
||||||
{
|
|
||||||
"cell_type": "markdown",
|
|
||||||
"id": "c3f01400",
|
|
||||||
"metadata": {},
|
|
||||||
"source": [
|
|
||||||
"## 4. Field-Level Statistics (Area, Perimeter, Centroid)"
|
|
||||||
]
|
|
||||||
},
|
|
||||||
{
|
|
||||||
"cell_type": "code",
|
|
||||||
"execution_count": null,
|
|
||||||
"id": "922ffdbd",
|
|
||||||
"metadata": {},
|
|
||||||
"outputs": [],
|
|
||||||
"source": [
|
|
||||||
"# Reproject to UTM for accurate area/perimeter (auto-detect UTM zone from centroid)\n",
|
|
||||||
"centroid = gdf.dissolve().centroid.iloc[0]\n",
|
|
||||||
"utm_zone = int((centroid.x + 180) / 6) + 1\n",
|
|
||||||
"hemisphere = \"north\" if centroid.y >= 0 else \"south\"\n",
|
|
||||||
"utm_epsg = 32600 + utm_zone if hemisphere == \"north\" else 32700 + utm_zone\n",
|
|
||||||
"print(f\"Auto-detected UTM zone: {utm_zone}{hemisphere[0].upper()} (EPSG:{utm_epsg})\")\n",
|
|
||||||
"\n",
|
|
||||||
"gdf_utm = gdf.to_crs(epsg=utm_epsg)\n",
|
|
||||||
"\n",
|
|
||||||
"# Compute stats\n",
|
|
||||||
"gdf_utm[\"area_m2\"] = gdf_utm.geometry.area\n",
|
|
||||||
"gdf_utm[\"area_ha\"] = gdf_utm[\"area_m2\"] / 10_000\n",
|
|
||||||
"gdf_utm[\"area_acres\"] = gdf_utm[\"area_ha\"] * 2.47105\n",
|
|
||||||
"gdf_utm[\"perimeter_m\"] = gdf_utm.geometry.length\n",
|
|
||||||
"centroids = gdf_utm.geometry.centroid\n",
|
|
||||||
"gdf_utm[\"centroid_x\"] = centroids.x\n",
|
|
||||||
"gdf_utm[\"centroid_y\"] = centroids.y\n",
|
|
||||||
"\n",
|
|
||||||
"# Summary table\n",
|
|
||||||
"stats = gdf_utm[[\"field\", \"area_ha\", \"area_acres\", \"perimeter_m\", \"centroid_x\", \"centroid_y\"]].copy()\n",
|
|
||||||
"stats = stats.sort_values(\"area_ha\", ascending=False).reset_index(drop=True)\n",
|
|
||||||
"print(f\"\\nTotal area: {stats['area_ha'].sum():.1f} ha ({stats['area_acres'].sum():.1f} acres)\")\n",
|
|
||||||
"print(f\"Fields: {len(stats)}\")\n",
|
|
||||||
"print(f\"Area range: {stats['area_ha'].min():.1f} – {stats['area_ha'].max():.1f} ha\\n\")\n",
|
|
||||||
"stats"
|
|
||||||
]
|
|
||||||
},
|
|
||||||
{
|
|
||||||
"cell_type": "markdown",
|
|
||||||
"id": "3dbf84c7",
|
|
||||||
"metadata": {},
|
|
||||||
"source": [
|
|
||||||
"## 5. Visualize Field Boundaries"
|
|
||||||
]
|
|
||||||
},
|
|
||||||
{
|
|
||||||
"cell_type": "code",
|
|
||||||
"execution_count": null,
|
|
||||||
"id": "be1be0ea",
|
|
||||||
"metadata": {},
|
|
||||||
"outputs": [],
|
|
||||||
"source": [
|
|
||||||
"fig, ax = plt.subplots(1, 1, figsize=(12, 10))\n",
|
|
||||||
"gdf_utm.plot(ax=ax, column=\"field\", legend=True, edgecolor=\"black\", linewidth=0.8,\n",
|
|
||||||
" cmap=\"Set3\", alpha=0.7, legend_kwds={\"loc\": \"upper left\", \"fontsize\": 8})\n",
|
|
||||||
"\n",
|
|
||||||
"# Add field labels at centroids\n",
|
|
||||||
"for _, row in gdf_utm.iterrows():\n",
|
|
||||||
" c = row.geometry.centroid\n",
|
|
||||||
" ax.annotate(row[\"field\"], xy=(c.x, c.y), ha=\"center\", va=\"center\",\n",
|
|
||||||
" fontsize=7, fontweight=\"bold\",\n",
|
|
||||||
" bbox=dict(boxstyle=\"round,pad=0.2\", fc=\"white\", alpha=0.7))\n",
|
|
||||||
"\n",
|
|
||||||
"ax.set_title(f\"ESA Field Boundaries ({len(gdf_utm)} fields)\", fontsize=14)\n",
|
|
||||||
"ax.set_xlabel(\"Easting (m)\")\n",
|
|
||||||
"ax.set_ylabel(\"Northing (m)\")\n",
|
|
||||||
"plt.tight_layout()\n",
|
|
||||||
"plt.show()"
|
|
||||||
]
|
|
||||||
},
|
|
||||||
{
|
|
||||||
"cell_type": "markdown",
|
|
||||||
"id": "52301688",
|
|
||||||
"metadata": {},
|
|
||||||
"source": [
|
|
||||||
"## 6. CRS Check & Comparison"
|
|
||||||
]
|
|
||||||
},
|
|
||||||
{
|
|
||||||
"cell_type": "code",
|
|
||||||
"execution_count": null,
|
|
||||||
"id": "d4c712fa",
|
|
||||||
"metadata": {},
|
|
||||||
"outputs": [],
|
|
||||||
"source": [
|
|
||||||
"# Original CRS info\n",
|
|
||||||
"print(\"=== Original CRS ===\")\n",
|
|
||||||
"print(f\"CRS: {gdf.crs}\")\n",
|
|
||||||
"print(f\"Is geographic: {gdf.crs.is_geographic if gdf.crs else 'No CRS'}\")\n",
|
|
||||||
"print(f\"Is projected: {gdf.crs.is_projected if gdf.crs else 'No CRS'}\")\n",
|
|
||||||
"\n",
|
|
||||||
"if gdf.crs and gdf.crs.is_geographic:\n",
|
|
||||||
" # Compare area in geographic vs projected CRS\n",
|
|
||||||
" area_geographic = gdf.geometry.area # in degrees² (meaningless)\n",
|
|
||||||
" area_projected = gdf_utm.geometry.area # in m²\n",
|
|
||||||
" \n",
|
|
||||||
" print(f\"\\n=== Area comparison ===\")\n",
|
|
||||||
" print(f\"Geographic CRS area (degrees²): meaningless for spatial analysis\")\n",
|
|
||||||
" print(f\"Projected UTM area (EPSG:{utm_epsg}):\")\n",
|
|
||||||
" for _, row in gdf_utm.iterrows():\n",
|
|
||||||
" print(f\" {row['field']}: {row['area_ha']:.2f} ha\")\n",
|
|
||||||
" print(f\"\\nAlways use projected CRS (UTM) for area/distance calculations.\")"
|
|
||||||
]
|
|
||||||
},
|
|
||||||
{
|
|
||||||
"cell_type": "markdown",
|
|
||||||
"id": "3a21236a",
|
|
||||||
"metadata": {},
|
|
||||||
"source": [
|
|
||||||
"## 7. Export Summary Table"
|
|
||||||
]
|
|
||||||
},
|
|
||||||
{
|
|
||||||
"cell_type": "code",
|
|
||||||
"execution_count": null,
|
|
||||||
"id": "acd4b3c7",
|
|
||||||
"metadata": {},
|
|
||||||
"outputs": [],
|
|
||||||
"source": [
|
|
||||||
"# Save summary CSV to project data directory\n",
|
|
||||||
"output_csv = DATA_DIR / \"field_summary.csv\"\n",
|
|
||||||
"stats.to_csv(output_csv, index=False)\n",
|
|
||||||
"print(f\"Saved field summary to: {output_csv}\")\n",
|
|
||||||
"print(f\"\\nFinal summary:\")\n",
|
|
||||||
"print(f\" Project: {PROJECT}\")\n",
|
|
||||||
"print(f\" Fields: {len(stats)}\")\n",
|
|
||||||
"print(f\" Total area: {stats['area_ha'].sum():.1f} ha ({stats['area_acres'].sum():.1f} acres)\")\n",
|
|
||||||
"print(f\" CRS: {gdf.crs} (original) → EPSG:{utm_epsg} (projected)\")\n",
|
|
||||||
"print(f\" All valid: {gdf.is_valid.all()}\")\n",
|
|
||||||
"stats"
|
|
||||||
]
|
|
||||||
}
|
|
||||||
],
|
|
||||||
"metadata": {
|
|
||||||
"language_info": {
|
|
||||||
"name": "python"
|
|
||||||
}
|
|
||||||
},
|
|
||||||
"nbformat": 4,
|
|
||||||
"nbformat_minor": 5
|
|
||||||
}
|
|
||||||
|
|
@ -1,431 +0,0 @@
|
||||||
"""
|
|
||||||
01_spectral_feature_exploration.py — SC-161 Exploration
|
|
||||||
=======================================================
|
|
||||||
Explore spectral indices extracted from ESA 4-band TIFFs.
|
|
||||||
|
|
||||||
Indices (8 total from RGB+NIR):
|
|
||||||
NDVI, BSI, NDWI, CI_green, CI_red, GNDVI, SAVI, EVI2
|
|
||||||
|
|
||||||
Key questions:
|
|
||||||
1. How do indices correlate with each other? Which are redundant?
|
|
||||||
2. Does any index capture harvest signal better than CI?
|
|
||||||
3. Which indices add independent information beyond CI_green?
|
|
||||||
4. Pre/post harvest distribution shifts per index
|
|
||||||
|
|
||||||
Usage:
|
|
||||||
cd python_app/harvest_detection_experiments/angata_improvements
|
|
||||||
python 01_spectral_feature_exploration.py
|
|
||||||
"""
|
|
||||||
|
|
||||||
import matplotlib
|
|
||||||
matplotlib.use("Agg") # Non-interactive backend — save only, no window blocking
|
|
||||||
|
|
||||||
import pandas as pd
|
|
||||||
import numpy as np
|
|
||||||
import matplotlib.pyplot as plt
|
|
||||||
import matplotlib.dates as mdates
|
|
||||||
from pathlib import Path
|
|
||||||
|
|
||||||
# =============================================================================
|
|
||||||
# CONFIG
|
|
||||||
# =============================================================================
|
|
||||||
PROJECT = "esa"
|
|
||||||
REPO_ROOT = Path(__file__).resolve().parents[3]
|
|
||||||
PROJECT_STORAGE = REPO_ROOT / "laravel_app" / "storage" / "app" / PROJECT
|
|
||||||
|
|
||||||
SPECTRAL_CSV = PROJECT_STORAGE / "Data" / "extracted_ci" / "ci_data_for_python" / "spectral_indices.csv"
|
|
||||||
HARVEST_XLSX = PROJECT_STORAGE / "harvest.xlsx"
|
|
||||||
|
|
||||||
# Output directory for plots
|
|
||||||
OUTPUT_DIR = Path(__file__).parent / "plots"
|
|
||||||
OUTPUT_DIR.mkdir(exist_ok=True)
|
|
||||||
|
|
||||||
# All mean index columns
|
|
||||||
INDEX_COLS = [
|
|
||||||
"mean_ndvi", "mean_bsi", "mean_ndwi", "mean_ci_green",
|
|
||||||
"mean_ci_red", "mean_gndvi", "mean_savi", "mean_evi2",
|
|
||||||
]
|
|
||||||
INDEX_LABELS = {
|
|
||||||
"mean_ndvi": "NDVI",
|
|
||||||
"mean_bsi": "BSI",
|
|
||||||
"mean_ndwi": "NDWI",
|
|
||||||
"mean_ci_green": "CI_green",
|
|
||||||
"mean_ci_red": "CI_red",
|
|
||||||
"mean_gndvi": "GNDVI",
|
|
||||||
"mean_savi": "SAVI",
|
|
||||||
"mean_evi2": "EVI2",
|
|
||||||
}
|
|
||||||
INDEX_COLORS = {
|
|
||||||
"mean_ndvi": "green",
|
|
||||||
"mean_bsi": "saddlebrown",
|
|
||||||
"mean_ndwi": "steelblue",
|
|
||||||
"mean_ci_green": "darkgreen",
|
|
||||||
"mean_ci_red": "darkred",
|
|
||||||
"mean_gndvi": "olive",
|
|
||||||
"mean_savi": "teal",
|
|
||||||
"mean_evi2": "purple",
|
|
||||||
}
|
|
||||||
|
|
||||||
# =============================================================================
|
|
||||||
# LOAD DATA
|
|
||||||
# =============================================================================
|
|
||||||
print("=" * 80)
|
|
||||||
print("SC-161: SPECTRAL FEATURE EXPLORATION — ESA (8 indices)")
|
|
||||||
print("=" * 80)
|
|
||||||
|
|
||||||
# Spectral indices
|
|
||||||
si = pd.read_csv(SPECTRAL_CSV, parse_dates=["Date"])
|
|
||||||
si = si[si["field"] != "00F25"] # drop always-NaN field
|
|
||||||
print(f"\nSpectral indices: {len(si)} rows, {si['field'].nunique()} fields")
|
|
||||||
print(f" Date range: {si['Date'].min().date()} → {si['Date'].max().date()}")
|
|
||||||
print(f" Columns: {[c for c in si.columns if c.startswith('mean_')]}")
|
|
||||||
for col in INDEX_COLS:
|
|
||||||
if col in si.columns:
|
|
||||||
nan_pct = si[col].isna().mean() * 100
|
|
||||||
print(f" {INDEX_LABELS[col]:10s} NaN: {nan_pct:.1f}% mean: {si[col].mean():.4f} range: [{si[col].min():.4f}, {si[col].max():.4f}]")
|
|
||||||
|
|
||||||
# Harvest data
|
|
||||||
harvest = pd.read_excel(HARVEST_XLSX)
|
|
||||||
harvest["season_start"] = pd.to_datetime(harvest["season_start"], errors="coerce")
|
|
||||||
harvest["season_end"] = pd.to_datetime(harvest["season_end"], errors="coerce")
|
|
||||||
|
|
||||||
# Filter to fields we have spectral data for
|
|
||||||
our_fields = set(si["field"].unique())
|
|
||||||
harvest = harvest[harvest["field"].isin(our_fields)].copy()
|
|
||||||
# Only keep seasons with an actual end date (= confirmed harvest)
|
|
||||||
harvest_events = harvest.dropna(subset=["season_end"]).copy()
|
|
||||||
print(f"\nHarvest events (confirmed): {len(harvest_events)} across {harvest_events['field'].nunique()} fields")
|
|
||||||
|
|
||||||
|
|
||||||
# =============================================================================
|
|
||||||
# 1. FULL CORRELATION MATRIX — all 8 indices
|
|
||||||
# =============================================================================
|
|
||||||
print("\n--- Full correlation matrix (8 indices) ---")
|
|
||||||
|
|
||||||
available_cols = [c for c in INDEX_COLS if c in si.columns]
|
|
||||||
valid = si.dropna(subset=available_cols)
|
|
||||||
corr_matrix = valid[available_cols].corr()
|
|
||||||
|
|
||||||
# Print in readable format
|
|
||||||
labels = [INDEX_LABELS[c] for c in available_cols]
|
|
||||||
print(f"\n{'':>12s}", " ".join(f"{l:>10s}" for l in labels))
|
|
||||||
for i, (col, label) in enumerate(zip(available_cols, labels)):
|
|
||||||
vals = " ".join(f"{corr_matrix.iloc[i, j]:>10.3f}" for j in range(len(available_cols)))
|
|
||||||
print(f"{label:>12s} {vals}")
|
|
||||||
|
|
||||||
# Heatmap
|
|
||||||
fig, ax = plt.subplots(figsize=(10, 8))
|
|
||||||
im = ax.imshow(corr_matrix.values, cmap="RdBu_r", vmin=-1, vmax=1, aspect="equal")
|
|
||||||
ax.set_xticks(range(len(labels)))
|
|
||||||
ax.set_xticklabels(labels, rotation=45, ha="right", fontsize=10)
|
|
||||||
ax.set_yticks(range(len(labels)))
|
|
||||||
ax.set_yticklabels(labels, fontsize=10)
|
|
||||||
ax.set_title("Spectral Index Correlation Matrix (ESA)", fontsize=14)
|
|
||||||
|
|
||||||
# Annotate cells
|
|
||||||
for i in range(len(labels)):
|
|
||||||
for j in range(len(labels)):
|
|
||||||
val = corr_matrix.iloc[i, j]
|
|
||||||
txt_color = "white" if abs(val) > 0.7 else "black"
|
|
||||||
ax.text(j, i, f"{val:.2f}", ha="center", va="center", fontsize=9, color=txt_color)
|
|
||||||
|
|
||||||
plt.colorbar(im, ax=ax, label="Pearson r", shrink=0.8)
|
|
||||||
plt.tight_layout()
|
|
||||||
fig.savefig(OUTPUT_DIR / "correlation_matrix_all_indices.png", dpi=150, bbox_inches="tight")
|
|
||||||
plt.close()
|
|
||||||
print(f" → Saved correlation_matrix_all_indices.png")
|
|
||||||
|
|
||||||
|
|
||||||
# =============================================================================
|
|
||||||
# 2. PER-FIELD CORRELATION — CI_green vs each index
|
|
||||||
# =============================================================================
|
|
||||||
print("\n--- Per-field correlation: CI_green vs each index ---")
|
|
||||||
|
|
||||||
ci_col = "mean_ci_green"
|
|
||||||
other_cols = [c for c in available_cols if c != ci_col]
|
|
||||||
|
|
||||||
fig, axes = plt.subplots(2, 4, figsize=(20, 10))
|
|
||||||
axes_flat = axes.flatten()
|
|
||||||
|
|
||||||
for i, col in enumerate(other_cols):
|
|
||||||
ax = axes_flat[i]
|
|
||||||
label = INDEX_LABELS[col]
|
|
||||||
|
|
||||||
field_corrs = valid.groupby("field").apply(
|
|
||||||
lambda g: g[ci_col].corr(g[col]) if len(g) > 30 else np.nan
|
|
||||||
).dropna()
|
|
||||||
|
|
||||||
ax.hist(field_corrs, bins=20, color=INDEX_COLORS.get(col, "gray"), edgecolor="white", alpha=0.8)
|
|
||||||
ax.axvline(field_corrs.mean(), color="red", linestyle="--", linewidth=2,
|
|
||||||
label=f"mean={field_corrs.mean():.3f}")
|
|
||||||
ax.set_xlabel(f"r(CI_green, {label})")
|
|
||||||
ax.set_ylabel("Field count")
|
|
||||||
ax.set_title(f"CI_green vs {label}")
|
|
||||||
ax.legend(fontsize=8)
|
|
||||||
ax.set_xlim(-1, 1)
|
|
||||||
|
|
||||||
print(f" CI_green vs {label:8s}: mean r = {field_corrs.mean():.4f}, std = {field_corrs.std():.4f}")
|
|
||||||
|
|
||||||
# Remove unused subplot
|
|
||||||
if len(other_cols) < len(axes_flat):
|
|
||||||
for j in range(len(other_cols), len(axes_flat)):
|
|
||||||
axes_flat[j].set_visible(False)
|
|
||||||
|
|
||||||
fig.suptitle("Per-field correlation: CI_green vs each index", fontsize=14, fontweight="bold")
|
|
||||||
plt.tight_layout()
|
|
||||||
fig.savefig(OUTPUT_DIR / "ci_green_vs_all_per_field_corr.png", dpi=150, bbox_inches="tight")
|
|
||||||
plt.close()
|
|
||||||
print(f" → Saved ci_green_vs_all_per_field_corr.png")
|
|
||||||
|
|
||||||
|
|
||||||
# =============================================================================
|
|
||||||
# 3. PRE/POST HARVEST SHIFTS — all indices
|
|
||||||
# =============================================================================
|
|
||||||
print("\n--- Pre/post harvest distribution shifts (all indices) ---")
|
|
||||||
|
|
||||||
WINDOW_DAYS = 30
|
|
||||||
|
|
||||||
shift_results = {}
|
|
||||||
|
|
||||||
for col in available_cols:
|
|
||||||
label = INDEX_LABELS[col]
|
|
||||||
pre_vals = []
|
|
||||||
post_vals = []
|
|
||||||
|
|
||||||
for _, h in harvest_events.iterrows():
|
|
||||||
field_data = si[si["field"] == h["field"]].copy()
|
|
||||||
if field_data.empty:
|
|
||||||
continue
|
|
||||||
harvest_date = h["season_end"]
|
|
||||||
|
|
||||||
pre = field_data[
|
|
||||||
(field_data["Date"] >= harvest_date - pd.Timedelta(days=WINDOW_DAYS))
|
|
||||||
& (field_data["Date"] < harvest_date)
|
|
||||||
][col].dropna()
|
|
||||||
pre_vals.extend(pre.tolist())
|
|
||||||
|
|
||||||
post = field_data[
|
|
||||||
(field_data["Date"] > harvest_date)
|
|
||||||
& (field_data["Date"] <= harvest_date + pd.Timedelta(days=WINDOW_DAYS))
|
|
||||||
][col].dropna()
|
|
||||||
post_vals.extend(post.tolist())
|
|
||||||
|
|
||||||
pre_arr = np.array(pre_vals)
|
|
||||||
post_arr = np.array(post_vals)
|
|
||||||
delta = post_arr.mean() - pre_arr.mean()
|
|
||||||
# Effect size (Cohen's d)
|
|
||||||
pooled_std = np.sqrt((pre_arr.std() ** 2 + post_arr.std() ** 2) / 2)
|
|
||||||
cohens_d = delta / pooled_std if pooled_std > 0 else 0
|
|
||||||
|
|
||||||
shift_results[col] = {
|
|
||||||
"label": label,
|
|
||||||
"pre_mean": pre_arr.mean(),
|
|
||||||
"post_mean": post_arr.mean(),
|
|
||||||
"delta": delta,
|
|
||||||
"abs_delta": abs(delta),
|
|
||||||
"cohens_d": cohens_d,
|
|
||||||
"pre_n": len(pre_arr),
|
|
||||||
"post_n": len(post_arr),
|
|
||||||
}
|
|
||||||
|
|
||||||
# Print sorted by absolute Cohen's d (effect size)
|
|
||||||
print(f"\n{'Index':>12s} {'Pre':>8s} {'Post':>8s} {'Delta':>8s} {'Cohen d':>8s} {'n_pre':>6s} {'n_post':>6s}")
|
|
||||||
for col in sorted(shift_results.keys(), key=lambda c: abs(shift_results[c]["cohens_d"]), reverse=True):
|
|
||||||
r = shift_results[col]
|
|
||||||
print(f"{r['label']:>12s} {r['pre_mean']:>8.4f} {r['post_mean']:>8.4f} {r['delta']:>+8.4f} {r['cohens_d']:>+8.3f} {r['pre_n']:>6d} {r['post_n']:>6d}")
|
|
||||||
|
|
||||||
# Bar chart of effect sizes
|
|
||||||
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(16, 6))
|
|
||||||
|
|
||||||
sorted_cols = sorted(shift_results.keys(), key=lambda c: abs(shift_results[c]["cohens_d"]), reverse=True)
|
|
||||||
labels_sorted = [shift_results[c]["label"] for c in sorted_cols]
|
|
||||||
deltas = [shift_results[c]["delta"] for c in sorted_cols]
|
|
||||||
cohens_ds = [shift_results[c]["cohens_d"] for c in sorted_cols]
|
|
||||||
colors = [INDEX_COLORS.get(c, "gray") for c in sorted_cols]
|
|
||||||
|
|
||||||
# Raw delta
|
|
||||||
ax1.barh(labels_sorted, deltas, color=colors, alpha=0.8, edgecolor="white")
|
|
||||||
ax1.set_xlabel("Post - Pre harvest mean")
|
|
||||||
ax1.set_title("Raw shift at harvest (±30 days)")
|
|
||||||
ax1.axvline(0, color="black", linewidth=0.5)
|
|
||||||
ax1.grid(True, alpha=0.3, axis="x")
|
|
||||||
|
|
||||||
# Cohen's d
|
|
||||||
ax2.barh(labels_sorted, [abs(d) for d in cohens_ds], color=colors, alpha=0.8, edgecolor="white")
|
|
||||||
ax2.set_xlabel("|Cohen's d| (effect size)")
|
|
||||||
ax2.set_title("Harvest signal strength (effect size)")
|
|
||||||
ax2.axvline(0.2, color="gray", linestyle=":", label="small (0.2)")
|
|
||||||
ax2.axvline(0.5, color="gray", linestyle="--", label="medium (0.5)")
|
|
||||||
ax2.axvline(0.8, color="gray", linestyle="-", label="large (0.8)")
|
|
||||||
ax2.legend(fontsize=8)
|
|
||||||
ax2.grid(True, alpha=0.3, axis="x")
|
|
||||||
|
|
||||||
plt.tight_layout()
|
|
||||||
fig.savefig(OUTPUT_DIR / "harvest_signal_all_indices.png", dpi=150, bbox_inches="tight")
|
|
||||||
plt.close()
|
|
||||||
print(f" → Saved harvest_signal_all_indices.png")
|
|
||||||
|
|
||||||
|
|
||||||
# =============================================================================
|
|
||||||
# 4. PRE/POST DISTRIBUTIONS — top 4 by harvest signal
|
|
||||||
# =============================================================================
|
|
||||||
print("\n--- Pre/post harvest distributions (top 4 indices by effect size) ---")
|
|
||||||
|
|
||||||
top4_cols = sorted(shift_results.keys(), key=lambda c: abs(shift_results[c]["cohens_d"]), reverse=True)[:4]
|
|
||||||
|
|
||||||
fig, axes = plt.subplots(1, 4, figsize=(20, 5))
|
|
||||||
|
|
||||||
for ax, col in zip(axes, top4_cols):
|
|
||||||
label = INDEX_LABELS[col]
|
|
||||||
pre_vals = []
|
|
||||||
post_vals = []
|
|
||||||
for _, h in harvest_events.iterrows():
|
|
||||||
field_data = si[si["field"] == h["field"]].copy()
|
|
||||||
if field_data.empty:
|
|
||||||
continue
|
|
||||||
harvest_date = h["season_end"]
|
|
||||||
pre = field_data[
|
|
||||||
(field_data["Date"] >= harvest_date - pd.Timedelta(days=WINDOW_DAYS))
|
|
||||||
& (field_data["Date"] < harvest_date)
|
|
||||||
][col].dropna()
|
|
||||||
pre_vals.extend(pre.tolist())
|
|
||||||
post = field_data[
|
|
||||||
(field_data["Date"] > harvest_date)
|
|
||||||
& (field_data["Date"] <= harvest_date + pd.Timedelta(days=WINDOW_DAYS))
|
|
||||||
][col].dropna()
|
|
||||||
post_vals.extend(post.tolist())
|
|
||||||
|
|
||||||
ax.hist(pre_vals, bins=40, alpha=0.6, color="green", label=f"Pre (n={len(pre_vals)})", density=True)
|
|
||||||
ax.hist(post_vals, bins=40, alpha=0.6, color="brown", label=f"Post (n={len(post_vals)})", density=True)
|
|
||||||
d = shift_results[col]["cohens_d"]
|
|
||||||
ax.set_title(f"{label} (d={d:+.2f})")
|
|
||||||
ax.set_xlabel(label)
|
|
||||||
ax.set_ylabel("Density")
|
|
||||||
ax.legend(fontsize=8)
|
|
||||||
ax.grid(True, alpha=0.3)
|
|
||||||
|
|
||||||
fig.suptitle(f"Pre/Post Harvest ±{WINDOW_DAYS}d — Top 4 by effect size", fontsize=13, fontweight="bold")
|
|
||||||
plt.tight_layout()
|
|
||||||
fig.savefig(OUTPUT_DIR / "pre_post_harvest_top4.png", dpi=150, bbox_inches="tight")
|
|
||||||
plt.close()
|
|
||||||
print(f" → Saved pre_post_harvest_top4.png")
|
|
||||||
|
|
||||||
|
|
||||||
# =============================================================================
|
|
||||||
# 5. TIMESERIES — top field, all 8 indices
|
|
||||||
# =============================================================================
|
|
||||||
print("\n--- Timeseries: all indices for top harvest-rich fields ---")
|
|
||||||
|
|
||||||
top_fields = harvest_events.groupby("field").size().sort_values(ascending=False).head(3).index.tolist()
|
|
||||||
print(f"Top fields: {top_fields}")
|
|
||||||
|
|
||||||
for field_id in top_fields:
|
|
||||||
field_data = si[si["field"] == field_id].sort_values("Date").copy()
|
|
||||||
field_harvests = harvest_events[harvest_events["field"] == field_id]
|
|
||||||
|
|
||||||
fig, axes = plt.subplots(4, 2, figsize=(20, 16), sharex=True)
|
|
||||||
fig.suptitle(f"Field {field_id} — All Spectral Indices", fontsize=14, fontweight="bold")
|
|
||||||
|
|
||||||
for ax, col in zip(axes.flatten(), available_cols):
|
|
||||||
label = INDEX_LABELS[col]
|
|
||||||
color = INDEX_COLORS.get(col, "gray")
|
|
||||||
vals = field_data.dropna(subset=[col])
|
|
||||||
|
|
||||||
ax.plot(vals["Date"], vals[col], color=color, linewidth=0.6, alpha=0.7)
|
|
||||||
|
|
||||||
# 14-day rolling mean
|
|
||||||
if len(vals) > 14:
|
|
||||||
rolling = vals.set_index("Date")[col].rolling("14D", min_periods=3).mean()
|
|
||||||
ax.plot(rolling.index, rolling.values, color=color, linewidth=2, alpha=0.8)
|
|
||||||
|
|
||||||
# Harvest dates
|
|
||||||
for _, h in field_harvests.iterrows():
|
|
||||||
ax.axvline(h["season_end"], color="red", linestyle="--", alpha=0.6, linewidth=1)
|
|
||||||
|
|
||||||
ax.set_ylabel(label, fontsize=10, fontweight="bold")
|
|
||||||
ax.grid(True, alpha=0.3)
|
|
||||||
|
|
||||||
axes[-1, 0].xaxis.set_major_formatter(mdates.DateFormatter("%Y"))
|
|
||||||
axes[-1, 1].xaxis.set_major_formatter(mdates.DateFormatter("%Y"))
|
|
||||||
|
|
||||||
plt.tight_layout()
|
|
||||||
fig.savefig(OUTPUT_DIR / f"all_indices_{field_id}.png", dpi=150, bbox_inches="tight")
|
|
||||||
plt.close()
|
|
||||||
print(f" → Saved all_indices_{field_id}.png")
|
|
||||||
|
|
||||||
|
|
||||||
# =============================================================================
|
|
||||||
# 6. REDUNDANCY ANALYSIS — which indices add information beyond CI_green?
|
|
||||||
# =============================================================================
|
|
||||||
print("\n--- Redundancy analysis: residual variance after regressing vs CI_green ---")
|
|
||||||
|
|
||||||
from numpy.polynomial import polynomial as P
|
|
||||||
|
|
||||||
residual_variance = {}
|
|
||||||
|
|
||||||
for col in [c for c in available_cols if c != ci_col]:
|
|
||||||
label = INDEX_LABELS[col]
|
|
||||||
pair = valid[[ci_col, col]].dropna()
|
|
||||||
if len(pair) < 100:
|
|
||||||
continue
|
|
||||||
|
|
||||||
x, y = pair[ci_col].values, pair[col].values
|
|
||||||
# Fit linear regression CI_green → index
|
|
||||||
coeffs = P.polyfit(x, y, deg=1)
|
|
||||||
y_pred = P.polyval(x, coeffs)
|
|
||||||
residuals = y - y_pred
|
|
||||||
total_var = np.var(y)
|
|
||||||
resid_var = np.var(residuals)
|
|
||||||
r_squared = 1 - resid_var / total_var
|
|
||||||
|
|
||||||
residual_variance[col] = {
|
|
||||||
"label": label,
|
|
||||||
"r_squared": r_squared,
|
|
||||||
"residual_pct": (1 - r_squared) * 100,
|
|
||||||
}
|
|
||||||
|
|
||||||
print(f"\n{'Index':>12s} {'R² vs CI':>10s} {'Residual %':>12s} {'Interpretation'}")
|
|
||||||
for col in sorted(residual_variance.keys(), key=lambda c: residual_variance[c]["residual_pct"], reverse=True):
|
|
||||||
r = residual_variance[col]
|
|
||||||
interp = "UNIQUE signal" if r["residual_pct"] > 20 else "Some unique" if r["residual_pct"] > 5 else "Redundant"
|
|
||||||
print(f"{r['label']:>12s} {r['r_squared']:>10.4f} {r['residual_pct']:>11.1f}% {interp}")
|
|
||||||
|
|
||||||
|
|
||||||
# =============================================================================
|
|
||||||
# 7. SUMMARY
|
|
||||||
# =============================================================================
|
|
||||||
print("\n" + "=" * 80)
|
|
||||||
print("SUMMARY — SC-161 Expanded Index Analysis")
|
|
||||||
print("=" * 80)
|
|
||||||
|
|
||||||
# Best harvest signal
|
|
||||||
best_col = sorted(shift_results.keys(), key=lambda c: abs(shift_results[c]["cohens_d"]), reverse=True)[0]
|
|
||||||
best = shift_results[best_col]
|
|
||||||
|
|
||||||
# Most unique signal (highest residual variance vs CI_green)
|
|
||||||
most_unique = sorted(residual_variance.keys(), key=lambda c: residual_variance[c]["residual_pct"], reverse=True)
|
|
||||||
|
|
||||||
print(f"""
|
|
||||||
INDICES COMPUTED: {', '.join(INDEX_LABELS[c] for c in available_cols)}
|
|
||||||
DATA: {si['field'].nunique()} fields, {si['Date'].nunique()} dates
|
|
||||||
|
|
||||||
HARVEST SIGNAL (strongest → weakest by Cohen's d):
|
|
||||||
""")
|
|
||||||
for col in sorted(shift_results.keys(), key=lambda c: abs(shift_results[c]["cohens_d"]), reverse=True):
|
|
||||||
r = shift_results[col]
|
|
||||||
print(f" {r['label']:>12s}: Cohen's d = {r['cohens_d']:+.3f} (Δ = {r['delta']:+.4f})")
|
|
||||||
|
|
||||||
print(f"""
|
|
||||||
REDUNDANCY vs CI_green (unique information):
|
|
||||||
""")
|
|
||||||
for col in most_unique:
|
|
||||||
r = residual_variance[col]
|
|
||||||
print(f" {r['label']:>12s}: {r['residual_pct']:.1f}% unique variance (R² = {r['r_squared']:.3f})")
|
|
||||||
|
|
||||||
print(f"""
|
|
||||||
RECOMMENDATION:
|
|
||||||
Best harvest signal: {best['label']} (d={best['cohens_d']:+.3f})
|
|
||||||
Most unique vs CI: {residual_variance[most_unique[0]]['label']} ({residual_variance[most_unique[0]]['residual_pct']:.1f}% independent)
|
|
||||||
|
|
||||||
Plots saved to: {OUTPUT_DIR}
|
|
||||||
""")
|
|
||||||
|
|
@ -1,465 +0,0 @@
|
||||||
{
|
|
||||||
"cells": [
|
|
||||||
{
|
|
||||||
"cell_type": "markdown",
|
|
||||||
"id": "e9cdc8ee",
|
|
||||||
"metadata": {},
|
|
||||||
"source": [
|
|
||||||
"# SC-162: Season-Length Normalization Analysis\n",
|
|
||||||
"## Compare ESA vs Angata Spectral Indices (CI_green, SAVI)\n",
|
|
||||||
"\n",
|
|
||||||
"Multi-year data analysis focusing on:\n",
|
|
||||||
"- **Peak timing** (at what season-age % do peaks occur?)\n",
|
|
||||||
"- **Amplitude** (how high/low?)\n",
|
|
||||||
"- **Phase** (lag between indices)\n",
|
|
||||||
"- **Shape** (growth curve pattern)\n",
|
|
||||||
"\n",
|
|
||||||
"Note: Cannot compare absolute DOY across multiple years, but CAN compare normalized season age (0-1 scale) and pattern shapes."
|
|
||||||
]
|
|
||||||
},
|
|
||||||
{
|
|
||||||
"cell_type": "code",
|
|
||||||
"execution_count": null,
|
|
||||||
"id": "59943da7",
|
|
||||||
"metadata": {},
|
|
||||||
"outputs": [],
|
|
||||||
"source": [
|
|
||||||
"import pandas as pd\n",
|
|
||||||
"import numpy as np\n",
|
|
||||||
"import matplotlib.pyplot as plt\n",
|
|
||||||
"import matplotlib.dates as mdates\n",
|
|
||||||
"from pathlib import Path\n",
|
|
||||||
"from scipy import stats\n",
|
|
||||||
"import warnings\n",
|
|
||||||
"warnings.filterwarnings('ignore')\n",
|
|
||||||
"\n",
|
|
||||||
"# Configuration\n",
|
|
||||||
"REPO_ROOT = Path.cwd().parent.parent.parent.parent\n",
|
|
||||||
"print(f\"Repo root: {REPO_ROOT}\")\n",
|
|
||||||
"\n",
|
|
||||||
"# Project-specific config\n",
|
|
||||||
"PROJECTS = {\n",
|
|
||||||
" 'esa': {\n",
|
|
||||||
" 'expected_season_length': 365,\n",
|
|
||||||
" 'storage_path': REPO_ROOT / 'laravel_app' / 'storage' / 'app' / 'esa',\n",
|
|
||||||
" },\n",
|
|
||||||
" 'angata': {\n",
|
|
||||||
" 'expected_season_length': 540,\n",
|
|
||||||
" 'storage_path': REPO_ROOT / 'laravel_app' / 'storage' / 'app' / 'angata',\n",
|
|
||||||
" }\n",
|
|
||||||
"}\n",
|
|
||||||
"\n",
|
|
||||||
"# Load data for both projects\n",
|
|
||||||
"data = {}\n",
|
|
||||||
"harvest_data = {}\n",
|
|
||||||
"\n",
|
|
||||||
"for project, config in PROJECTS.items():\n",
|
|
||||||
" print(f\"\\n{'='*60}\")\n",
|
|
||||||
" print(f\"Loading {project.upper()}\")\n",
|
|
||||||
" print(f\"{'='*60}\")\n",
|
|
||||||
" \n",
|
|
||||||
" # Spectral indices CSV\n",
|
|
||||||
" csv_path = config['storage_path'] / 'Data' / 'extracted_ci' / 'ci_data_for_python' / 'spectral_indices.csv'\n",
|
|
||||||
" df = pd.read_csv(csv_path, parse_dates=['Date'])\n",
|
|
||||||
" df = df[df['field'] != '00F25'] # Remove always-NaN field if present\n",
|
|
||||||
" print(f\"Spectral data: {len(df)} rows, {df['field'].nunique()} fields\")\n",
|
|
||||||
" print(f\"Date range: {df['Date'].min().date()} → {df['Date'].max().date()}\")\n",
|
|
||||||
" \n",
|
|
||||||
" # Harvest dates\n",
|
|
||||||
" harvest_path = config['storage_path'] / 'Data' / 'harvest.xlsx'\n",
|
|
||||||
" harvest_df = pd.read_excel(harvest_path)\n",
|
|
||||||
" harvest_df['season_start'] = pd.to_datetime(harvest_df['season_start'], errors='coerce')\n",
|
|
||||||
" harvest_df['season_end'] = pd.to_datetime(harvest_df['season_end'], errors='coerce')\n",
|
|
||||||
" \n",
|
|
||||||
" # Filter to fields we have spectral data for\n",
|
|
||||||
" our_fields = set(df['field'].unique())\n",
|
|
||||||
" harvest_df = harvest_df[harvest_df['field'].isin(our_fields)].copy()\n",
|
|
||||||
" harvest_events = harvest_df.dropna(subset=['season_end']).copy()\n",
|
|
||||||
" print(f\"Harvest events: {len(harvest_events)} across {harvest_events['field'].nunique()} fields\")\n",
|
|
||||||
" \n",
|
|
||||||
" data[project] = df\n",
|
|
||||||
" harvest_data[project] = harvest_events"
|
|
||||||
]
|
|
||||||
},
|
|
||||||
{
|
|
||||||
"cell_type": "markdown",
|
|
||||||
"id": "260b405b",
|
|
||||||
"metadata": {},
|
|
||||||
"source": [
|
|
||||||
"## Step 1: Compute Season Age for Each Observation\n",
|
|
||||||
"\n",
|
|
||||||
"For each field-date pair, calculate days since season start and normalize to 0-1 scale."
|
|
||||||
]
|
|
||||||
},
|
|
||||||
{
|
|
||||||
"cell_type": "code",
|
|
||||||
"execution_count": null,
|
|
||||||
"id": "55a0f94f",
|
|
||||||
"metadata": {},
|
|
||||||
"outputs": [],
|
|
||||||
"source": [
|
|
||||||
"def compute_season_age(spec_df, harvest_df, project_season_length):\n",
|
|
||||||
" \"\"\"\n",
|
|
||||||
" Compute normalized season age (0-1) for each row.\n",
|
|
||||||
" \n",
|
|
||||||
" For each field, use the earliest season_start from harvest data as the baseline.\n",
|
|
||||||
" If no season_start available, use the earliest date in spectral data.\n",
|
|
||||||
" \"\"\"\n",
|
|
||||||
" spec_copy = spec_df.copy()\n",
|
|
||||||
" spec_copy['season_age'] = np.nan\n",
|
|
||||||
" spec_copy['days_since_start'] = np.nan\n",
|
|
||||||
" \n",
|
|
||||||
" for field_id in spec_copy['field'].unique():\n",
|
|
||||||
" # Get season start for this field\n",
|
|
||||||
" field_harvest = harvest_df[harvest_df['field'] == field_id]\n",
|
|
||||||
" \n",
|
|
||||||
" if not field_harvest.empty and not field_harvest['season_start'].isna().all():\n",
|
|
||||||
" season_start = field_harvest['season_start'].min()\n",
|
|
||||||
" else:\n",
|
|
||||||
" # Fallback: use first observation date\n",
|
|
||||||
" field_data = spec_copy[spec_copy['field'] == field_id]\n",
|
|
||||||
" season_start = field_data['Date'].min()\n",
|
|
||||||
" \n",
|
|
||||||
" # Compute for all rows of this field\n",
|
|
||||||
" field_mask = spec_copy['field'] == field_id\n",
|
|
||||||
" days_since = (spec_copy.loc[field_mask, 'Date'] - season_start).dt.days\n",
|
|
||||||
" spec_copy.loc[field_mask, 'days_since_start'] = days_since\n",
|
|
||||||
" spec_copy.loc[field_mask, 'season_age'] = days_since / project_season_length\n",
|
|
||||||
" \n",
|
|
||||||
" return spec_copy\n",
|
|
||||||
"\n",
|
|
||||||
"# Compute for both projects\n",
|
|
||||||
"for project, config in PROJECTS.items():\n",
|
|
||||||
" data[project] = compute_season_age(\n",
|
|
||||||
" data[project],\n",
|
|
||||||
" harvest_data[project],\n",
|
|
||||||
" config['expected_season_length']\n",
|
|
||||||
" )\n",
|
|
||||||
" print(f\"{project.upper()} season_age range: [{data[project]['season_age'].min():.3f}, {data[project]['season_age'].max():.3f}]\")"
|
|
||||||
]
|
|
||||||
},
|
|
||||||
{
|
|
||||||
"cell_type": "markdown",
|
|
||||||
"id": "b946c36c",
|
|
||||||
"metadata": {},
|
|
||||||
"source": [
|
|
||||||
"## Step 2: Plot CI_green + SAVI Trajectories by Project\n",
|
|
||||||
"\n",
|
|
||||||
"Multi-year overlay to compare:\n",
|
|
||||||
"- Peak timing (at what season-age %?)\n",
|
|
||||||
"- Amplitude (range of values)\n",
|
|
||||||
"- Phase (timing relative to each other)"
|
|
||||||
]
|
|
||||||
},
|
|
||||||
{
|
|
||||||
"cell_type": "code",
|
|
||||||
"execution_count": null,
|
|
||||||
"id": "c8de5a03",
|
|
||||||
"metadata": {},
|
|
||||||
"outputs": [],
|
|
||||||
"source": [
|
|
||||||
"# Create figure: 2 projects × 2 indices = 4 subplots\n",
|
|
||||||
"fig, axes = plt.subplots(2, 2, figsize=(18, 12))\n",
|
|
||||||
"\n",
|
|
||||||
"indices = ['mean_ci_green', 'mean_savi']\n",
|
|
||||||
"index_labels = {'mean_ci_green': 'CI_green (Chlorophyll)', 'mean_savi': 'SAVI (Soil-Adjusted)'}\n",
|
|
||||||
"\n",
|
|
||||||
"for row, project in enumerate(['esa', 'angata']):\n",
|
|
||||||
" df = data[project]\n",
|
|
||||||
" season_len = PROJECTS[project]['expected_season_length']\n",
|
|
||||||
" \n",
|
|
||||||
" for col, idx in enumerate(indices):\n",
|
|
||||||
" ax = axes[row, col]\n",
|
|
||||||
" \n",
|
|
||||||
" # Plot all field timeseries (normalized season age)\n",
|
|
||||||
" for field in df['field'].unique():\n",
|
|
||||||
" field_data = df[df['field'] == field].dropna(subset=[idx, 'season_age']).sort_values('season_age')\n",
|
|
||||||
" ax.plot(field_data['season_age'], field_data[idx], alpha=0.15, color='steelblue', linewidth=0.8)\n",
|
|
||||||
" \n",
|
|
||||||
" # Overlay mean trajectory with rolling average\n",
|
|
||||||
" binned = df.dropna(subset=[idx, 'season_age']).copy()\n",
|
|
||||||
" binned['season_age_bin'] = pd.cut(binned['season_age'], bins=20)\n",
|
|
||||||
" mean_trajectory = binned.groupby('season_age_bin')[idx].mean()\n",
|
|
||||||
" bin_centers = [interval.mid for interval in mean_trajectory.index]\n",
|
|
||||||
" ax.plot(bin_centers, mean_trajectory.values, color='red', linewidth=3, label='Average', zorder=10)\n",
|
|
||||||
" \n",
|
|
||||||
" ax.set_xlabel('Season Age (0 = start, 1 = expected end)', fontsize=11)\n",
|
|
||||||
" ax.set_ylabel(index_labels[idx], fontsize=11)\n",
|
|
||||||
" ax.set_title(f\"{project.upper()} - {index_labels[idx]}\\n(Season length: {season_len}d)\", fontsize=12, fontweight='bold')\n",
|
|
||||||
" ax.set_xlim(0, 1.2) # Allow slight overgrowth\n",
|
|
||||||
" ax.grid(True, alpha=0.3)\n",
|
|
||||||
" ax.legend(fontsize=9)\n",
|
|
||||||
"\n",
|
|
||||||
"plt.tight_layout()\n",
|
|
||||||
"plt.savefig('season_normalization_trajectories.png', dpi=150, bbox_inches='tight')\n",
|
|
||||||
"plt.show()\n",
|
|
||||||
"print(\"\\nSaved: season_normalization_trajectories.png\")"
|
|
||||||
]
|
|
||||||
},
|
|
||||||
{
|
|
||||||
"cell_type": "markdown",
|
|
||||||
"id": "f60a0dc5",
|
|
||||||
"metadata": {},
|
|
||||||
"source": [
|
|
||||||
"## Step 3: Peak Analysis - Where do maxima occur?"
|
|
||||||
]
|
|
||||||
},
|
|
||||||
{
|
|
||||||
"cell_type": "code",
|
|
||||||
"execution_count": null,
|
|
||||||
"id": "6e149843",
|
|
||||||
"metadata": {},
|
|
||||||
"outputs": [],
|
|
||||||
"source": [
|
|
||||||
"def find_peaks_per_field(df, index_col, season_age_col='season_age'):\n",
|
|
||||||
" \"\"\"\n",
|
|
||||||
" For each field, find the peak (maximum) in the normalized season age.\n",
|
|
||||||
" Return statistics about where peaks occur.\n",
|
|
||||||
" \"\"\"\n",
|
|
||||||
" peaks = []\n",
|
|
||||||
" \n",
|
|
||||||
" for field in df['field'].unique():\n",
|
|
||||||
" field_data = df[df['field'] == field].dropna(subset=[index_col, season_age_col]).sort_values(season_age_col)\n",
|
|
||||||
" \n",
|
|
||||||
" if len(field_data) > 10: # Need enough data\n",
|
|
||||||
" max_idx = field_data[index_col].idxmax()\n",
|
|
||||||
" peak_season_age = field_data.loc[max_idx, season_age_col]\n",
|
|
||||||
" peak_value = field_data.loc[max_idx, index_col]\n",
|
|
||||||
" peaks.append({\n",
|
|
||||||
" 'field': field,\n",
|
|
||||||
" 'peak_season_age': peak_season_age,\n",
|
|
||||||
" 'peak_value': peak_value\n",
|
|
||||||
" })\n",
|
|
||||||
" \n",
|
|
||||||
" return pd.DataFrame(peaks)\n",
|
|
||||||
"\n",
|
|
||||||
"print(\"\\n\" + \"=\"*70)\n",
|
|
||||||
"print(\"PEAK TIMING ANALYSIS (at what season-age do indices peak?)\")\n",
|
|
||||||
"print(\"=\"*70)\n",
|
|
||||||
"\n",
|
|
||||||
"for project in ['esa', 'angata']:\n",
|
|
||||||
" print(f\"\\n{project.upper()}:\")\n",
|
|
||||||
" print(\"-\" * 70)\n",
|
|
||||||
" \n",
|
|
||||||
" for idx, label in [('mean_ci_green', 'CI_green'), ('mean_savi', 'SAVI')]:\n",
|
|
||||||
" peaks_df = find_peaks_per_field(data[project], idx, 'season_age')\n",
|
|
||||||
" \n",
|
|
||||||
" print(f\"\\n {label}:\")\n",
|
|
||||||
" print(f\" Fields analyzed: {len(peaks_df)}\")\n",
|
|
||||||
" print(f\" Peak season_age: mean={peaks_df['peak_season_age'].mean():.3f}, std={peaks_df['peak_season_age'].std():.3f}\")\n",
|
|
||||||
" print(f\" Peak occurs at: {peaks_df['peak_season_age'].mean()*100:.1f}% through season\")\n",
|
|
||||||
" print(f\" Peak value range: [{peaks_df['peak_value'].min():.4f}, {peaks_df['peak_value'].max():.4f}]\")"
|
|
||||||
]
|
|
||||||
},
|
|
||||||
{
|
|
||||||
"cell_type": "markdown",
|
|
||||||
"id": "1d6e97e4",
|
|
||||||
"metadata": {},
|
|
||||||
"source": [
|
|
||||||
"## Step 4: Amplitude Comparison - How strong are the signals?"
|
|
||||||
]
|
|
||||||
},
|
|
||||||
{
|
|
||||||
"cell_type": "code",
|
|
||||||
"execution_count": null,
|
|
||||||
"id": "75c72340",
|
|
||||||
"metadata": {},
|
|
||||||
"outputs": [],
|
|
||||||
"source": [
|
|
||||||
"print(\"\\n\" + \"=\"*70)\n",
|
|
||||||
"print(\"AMPLITUDE ANALYSIS (value ranges per project)\")\n",
|
|
||||||
"print(\"=\"*70)\n",
|
|
||||||
"\n",
|
|
||||||
"for project in ['esa', 'angata']:\n",
|
|
||||||
" print(f\"\\n{project.upper()}:\")\n",
|
|
||||||
" print(\"-\" * 70)\n",
|
|
||||||
" \n",
|
|
||||||
" df = data[project]\n",
|
|
||||||
" \n",
|
|
||||||
" for idx, label in [('mean_ci_green', 'CI_green'), ('mean_savi', 'SAVI')]:\n",
|
|
||||||
" valid = df[idx].dropna()\n",
|
|
||||||
" amplitude = valid.max() - valid.min()\n",
|
|
||||||
" \n",
|
|
||||||
" print(f\"\\n {label}:\")\n",
|
|
||||||
" print(f\" Mean: {valid.mean():.4f}\")\n",
|
|
||||||
" print(f\" Std: {valid.std():.4f}\")\n",
|
|
||||||
" print(f\" Min: {valid.min():.4f}\")\n",
|
|
||||||
" print(f\" Max: {valid.max():.4f}\")\n",
|
|
||||||
" print(f\" Amplitude (max-min): {amplitude:.4f}\")"
|
|
||||||
]
|
|
||||||
},
|
|
||||||
{
|
|
||||||
"cell_type": "markdown",
|
|
||||||
"id": "8c2e3fc8",
|
|
||||||
"metadata": {},
|
|
||||||
"source": [
|
|
||||||
"## Step 5: Phase Analysis - Lead/lag between CI_green and SAVI"
|
|
||||||
]
|
|
||||||
},
|
|
||||||
{
|
|
||||||
"cell_type": "code",
|
|
||||||
"execution_count": null,
|
|
||||||
"id": "867d39e9",
|
|
||||||
"metadata": {},
|
|
||||||
"outputs": [],
|
|
||||||
"source": [
|
|
||||||
"def compute_cross_correlation(df, idx1, idx2, max_lag=0.2):\n",
|
|
||||||
" \"\"\"\n",
|
|
||||||
" Compute cross-correlation between two indices to find phase lag.\n",
|
|
||||||
" \"\"\"\n",
|
|
||||||
" # Aggregate by season_age bins\n",
|
|
||||||
" binned = df.dropna(subset=[idx1, idx2, 'season_age']).copy()\n",
|
|
||||||
" binned['bin'] = pd.cut(binned['season_age'], bins=30)\n",
|
|
||||||
" \n",
|
|
||||||
" agg = binned.groupby('bin')[[idx1, idx2]].mean()\n",
|
|
||||||
" agg = agg.dropna()\n",
|
|
||||||
" \n",
|
|
||||||
" if len(agg) < 10:\n",
|
|
||||||
" return np.nan, np.nan\n",
|
|
||||||
" \n",
|
|
||||||
" # Normalize\n",
|
|
||||||
" x = (agg[idx1] - agg[idx1].mean()) / agg[idx1].std()\n",
|
|
||||||
" y = (agg[idx2] - agg[idx2].mean()) / agg[idx2].std()\n",
|
|
||||||
" \n",
|
|
||||||
" # Cross-correlation\n",
|
|
||||||
" corr = np.correlate(x.values, y.values, mode='full')\n",
|
|
||||||
" lags = np.arange(-len(x)+1, len(x))\n",
|
|
||||||
" \n",
|
|
||||||
" best_lag_idx = np.argmax(np.abs(corr))\n",
|
|
||||||
" best_lag = lags[best_lag_idx]\n",
|
|
||||||
" max_corr = corr[best_lag_idx] / (len(x) - 1)\n",
|
|
||||||
" \n",
|
|
||||||
" return best_lag, max_corr\n",
|
|
||||||
"\n",
|
|
||||||
"print(\"\\n\" + \"=\"*70)\n",
|
|
||||||
"print(\"PHASE ANALYSIS (CI_green vs SAVI lag)\")\n",
|
|
||||||
"print(\"=\"*70)\n",
|
|
||||||
"\n",
|
|
||||||
"for project in ['esa', 'angata']:\n",
|
|
||||||
" df = data[project]\n",
|
|
||||||
" lag, corr = compute_cross_correlation(df, 'mean_ci_green', 'mean_savi')\n",
|
|
||||||
" \n",
|
|
||||||
" print(f\"\\n{project.upper()}:\")\n",
|
|
||||||
" if not np.isnan(lag):\n",
|
|
||||||
" print(f\" CI_green → SAVI lag: {lag:.2f} bins (~{(lag/30):.3f} of season)\")\n",
|
|
||||||
" print(f\" Max cross-correlation: {corr:.3f}\")\n",
|
|
||||||
" print(f\" Interpretation: {'SAVI slightly leads CI_green' if lag < 0 else 'CI_green slightly leads SAVI' if lag > 0 else 'Synchronized'}\")\n",
|
|
||||||
" else:\n",
|
|
||||||
" print(f\" Insufficient data for lag analysis\")"
|
|
||||||
]
|
|
||||||
},
|
|
||||||
{
|
|
||||||
"cell_type": "markdown",
|
|
||||||
"id": "5277beb9",
|
|
||||||
"metadata": {},
|
|
||||||
"source": [
|
|
||||||
"## Step 6: Growth Curve Shape - Do they follow similar patterns?"
|
|
||||||
]
|
|
||||||
},
|
|
||||||
{
|
|
||||||
"cell_type": "code",
|
|
||||||
"execution_count": null,
|
|
||||||
"id": "24685c7d",
|
|
||||||
"metadata": {},
|
|
||||||
"outputs": [],
|
|
||||||
"source": [
|
|
||||||
"# Compare growth curves directly: normalized to 0-1 on both axes\n",
|
|
||||||
"fig, axes = plt.subplots(1, 2, figsize=(16, 6))\n",
|
|
||||||
"\n",
|
|
||||||
"for col, idx in enumerate(['mean_ci_green', 'mean_savi']):\n",
|
|
||||||
" ax = axes[col]\n",
|
|
||||||
" \n",
|
|
||||||
" for project, color in [('esa', 'blue'), ('angata', 'orange')]:\n",
|
|
||||||
" df = data[project]\n",
|
|
||||||
" binned = df.dropna(subset=[idx, 'season_age']).copy()\n",
|
|
||||||
" binned['bin'] = pd.cut(binned['season_age'], bins=25)\n",
|
|
||||||
" \n",
|
|
||||||
" # Mean trajectory per bin\n",
|
|
||||||
" traj = binned.groupby('bin')[idx].agg(['mean', 'std', 'count'])\n",
|
|
||||||
" traj = traj[traj['count'] > 5] # Only bins with N>5\n",
|
|
||||||
" bin_centers = [interval.mid for interval in traj.index]\n",
|
|
||||||
" \n",
|
|
||||||
" # Normalize to 0-1 for visual comparison\n",
|
|
||||||
" normalized_vals = (traj['mean'] - traj['mean'].min()) / (traj['mean'].max() - traj['mean'].min())\n",
|
|
||||||
" \n",
|
|
||||||
" ax.plot(bin_centers, normalized_vals.values, color=color, linewidth=2.5, \n",
|
|
||||||
" marker='o', markersize=6, label=f\"{project.upper()}\", zorder=10)\n",
|
|
||||||
" ax.fill_between(bin_centers, \n",
|
|
||||||
" (normalized_vals - traj['std']/traj['mean'].std()).values,\n",
|
|
||||||
" (normalized_vals + traj['std']/traj['mean'].std()).values,\n",
|
|
||||||
" alpha=0.2, color=color)\n",
|
|
||||||
" \n",
|
|
||||||
" ax.set_xlabel('Season Age (0-1)', fontsize=12)\n",
|
|
||||||
" ax.set_ylabel('Normalized Index (0-1)', fontsize=12)\n",
|
|
||||||
" ax.set_title(f\"Growth Curve Shape: {['CI_green', 'SAVI'][col]}\", fontsize=13, fontweight='bold')\n",
|
|
||||||
" ax.set_xlim(0, 1.1)\n",
|
|
||||||
" ax.set_ylim(-0.1, 1.1)\n",
|
|
||||||
" ax.grid(True, alpha=0.3)\n",
|
|
||||||
" ax.legend(fontsize=11, loc='best')\n",
|
|
||||||
"\n",
|
|
||||||
"plt.tight_layout()\n",
|
|
||||||
"plt.savefig('growth_curve_comparison.png', dpi=150, bbox_inches='tight')\n",
|
|
||||||
"plt.show()\n",
|
|
||||||
"print(\"\\nSaved: growth_curve_comparison.png\")"
|
|
||||||
]
|
|
||||||
},
|
|
||||||
{
|
|
||||||
"cell_type": "markdown",
|
|
||||||
"id": "27b5e61a",
|
|
||||||
"metadata": {},
|
|
||||||
"source": [
|
|
||||||
"## Summary: Does Season-Age Normalization Work?"
|
|
||||||
]
|
|
||||||
},
|
|
||||||
{
|
|
||||||
"cell_type": "code",
|
|
||||||
"execution_count": null,
|
|
||||||
"id": "9252cf10",
|
|
||||||
"metadata": {},
|
|
||||||
"outputs": [],
|
|
||||||
"source": [
|
|
||||||
"print(\"\\n\" + \"=\"*70)\n",
|
|
||||||
"print(\"SC-162 FINDINGS: Season-Age Normalization Validation\")\n",
|
|
||||||
"print(\"=\"*70)\n",
|
|
||||||
"\n",
|
|
||||||
"print(\"\"\"\n",
|
|
||||||
"▶ KEY QUESTION: Do ESA and Angata follow similar growth patterns when normalized to season age?\n",
|
|
||||||
"\n",
|
|
||||||
"If YES (patterns align on 0-1 scale):\n",
|
|
||||||
" ✅ Season-age normalization is valid\n",
|
|
||||||
" ✅ Can train model on ESA, apply to Angata with confidence\n",
|
|
||||||
" ✅ Use proportional imminent window (7-8% of season length)\n",
|
|
||||||
"\n",
|
|
||||||
"If NO (patterns diverge):\n",
|
|
||||||
" ❌ Need project-specific models\n",
|
|
||||||
" ❌ Investigate why patterns differ (climate, variety, soil, etc.)\n",
|
|
||||||
" ❌ Consider multivariate approach (season_age + other features)\n",
|
|
||||||
"\n",
|
|
||||||
"OBSERVATIONS TO CHECK ABOVE:\n",
|
|
||||||
"1. Peak timing: Do both peak at ~same season-age?\n",
|
|
||||||
"2. Growth curve shape: Do normalized curves look similar?\n",
|
|
||||||
"3. Amplitude: Is the relative range comparable?\n",
|
|
||||||
"4. Phase: Is lag between CI_green and SAVI similar across projects?\n",
|
|
||||||
"\"\"\")\n",
|
|
||||||
"\n",
|
|
||||||
"print(\"\\nNEXT STEPS:\")\n",
|
|
||||||
"print(\"1. If patterns align → Proceed with feature engineering (Phase 5)\")\n",
|
|
||||||
"print(\"2. If patterns differ → Debug and adjust project configs\")\n",
|
|
||||||
"print(\"3. Test both fixed (28d) and proportional imminent windows on ESA\")"
|
|
||||||
]
|
|
||||||
}
|
|
||||||
],
|
|
||||||
"metadata": {
|
|
||||||
"kernelspec": {
|
|
||||||
"display_name": "pytorch_gpu",
|
|
||||||
"language": "python",
|
|
||||||
"name": "python3"
|
|
||||||
},
|
|
||||||
"language_info": {
|
|
||||||
"name": "python",
|
|
||||||
"version": "3.11.14"
|
|
||||||
}
|
|
||||||
},
|
|
||||||
"nbformat": 4,
|
|
||||||
"nbformat_minor": 5
|
|
||||||
}
|
|
||||||
|
|
@ -1,199 +0,0 @@
|
||||||
"""
|
|
||||||
run_spectral_extraction.py — Batch extract spectral indices from all TIFFs
|
|
||||||
==========================================================================
|
|
||||||
Part of SC-161: Extract spectral indices from 4-band TIFFs (BSI, NDVI, NDWI)
|
|
||||||
|
|
||||||
Loops over all dated TIF files in a project's TIF folder, computes per-field
|
|
||||||
NDVI, BSI, NDWI and saves a single CSV with the same field/date structure
|
|
||||||
as ci_data_for_python.csv (can be joined on field + Date).
|
|
||||||
|
|
||||||
Usage:
|
|
||||||
cd python_app/harvest_detection_experiments/angata_improvements
|
|
||||||
python run_spectral_extraction.py --project esa
|
|
||||||
python run_spectral_extraction.py --project esa --tif-folder merged_final_tif
|
|
||||||
python run_spectral_extraction.py --project esa --start 2024-01-01 --end 2024-12-31
|
|
||||||
"""
|
|
||||||
|
|
||||||
import argparse
|
|
||||||
import sys
|
|
||||||
import time
|
|
||||||
from datetime import datetime
|
|
||||||
from pathlib import Path
|
|
||||||
|
|
||||||
import geopandas as gpd
|
|
||||||
import pandas as pd
|
|
||||||
|
|
||||||
# Add parent dirs to path so we can import the module
|
|
||||||
sys.path.insert(0, str(Path(__file__).parent))
|
|
||||||
from spectral_features import extract_field_spectral_indices
|
|
||||||
|
|
||||||
|
|
||||||
def find_tif_dates(tif_folder: Path) -> list[tuple[str, Path]]:
|
|
||||||
"""
|
|
||||||
Find all dated TIF files (YYYY-MM-DD.tif) in a folder.
|
|
||||||
Returns sorted list of (date_str, path) tuples.
|
|
||||||
"""
|
|
||||||
tifs = []
|
|
||||||
for f in sorted(tif_folder.glob("*.tif")):
|
|
||||||
stem = f.stem # e.g. "2024-01-15"
|
|
||||||
try:
|
|
||||||
datetime.strptime(stem, "%Y-%m-%d")
|
|
||||||
tifs.append((stem, f))
|
|
||||||
except ValueError:
|
|
||||||
continue # skip non-date-named TIFs
|
|
||||||
return tifs
|
|
||||||
|
|
||||||
|
|
||||||
def main():
|
|
||||||
parser = argparse.ArgumentParser(
|
|
||||||
description="Extract spectral indices (NDVI, BSI, NDWI) from 4-band TIFFs per field"
|
|
||||||
)
|
|
||||||
parser.add_argument("--project", required=True, help="Project name (e.g. esa, angata)")
|
|
||||||
parser.add_argument(
|
|
||||||
"--tif-folder", default="merged_final_tif",
|
|
||||||
help="TIF subfolder name within project storage (default: merged_final_tif)"
|
|
||||||
)
|
|
||||||
parser.add_argument("--start", default=None, help="Start date filter (YYYY-MM-DD)")
|
|
||||||
parser.add_argument("--end", default=None, help="End date filter (YYYY-MM-DD)")
|
|
||||||
parser.add_argument(
|
|
||||||
"--output", default=None,
|
|
||||||
help="Output CSV path. Default: {project_storage}/Data/extracted_ci/ci_data_for_python/spectral_indices.csv"
|
|
||||||
)
|
|
||||||
parser.add_argument("--field-id-col", default="field", help="Field ID column in GeoJSON")
|
|
||||||
args = parser.parse_args()
|
|
||||||
|
|
||||||
# ==========================================================================
|
|
||||||
# RESOLVE PATHS
|
|
||||||
# ==========================================================================
|
|
||||||
# Navigate from this script to the repo root
|
|
||||||
# __file__ = angata_improvements/run_spectral_extraction.py
|
|
||||||
# parents: [0]=angata_improvements, [1]=harvest_detection_experiments, [2]=python_app, [3]=SmartCane_code
|
|
||||||
repo_root = Path(__file__).resolve().parents[3]
|
|
||||||
project_storage = repo_root / "laravel_app" / "storage" / "app" / args.project
|
|
||||||
|
|
||||||
tif_folder = project_storage / args.tif_folder
|
|
||||||
geojson_path = project_storage / "pivot.geojson"
|
|
||||||
|
|
||||||
if args.output:
|
|
||||||
output_csv = Path(args.output)
|
|
||||||
else:
|
|
||||||
output_dir = project_storage / "Data" / "extracted_ci" / "ci_data_for_python"
|
|
||||||
output_dir.mkdir(parents=True, exist_ok=True)
|
|
||||||
output_csv = output_dir / "spectral_indices.csv"
|
|
||||||
|
|
||||||
# ==========================================================================
|
|
||||||
# VALIDATION
|
|
||||||
# ==========================================================================
|
|
||||||
print("=" * 80)
|
|
||||||
print(f"SPECTRAL INDEX EXTRACTION — {args.project.upper()}")
|
|
||||||
print("=" * 80)
|
|
||||||
|
|
||||||
if not tif_folder.exists():
|
|
||||||
print(f"ERROR: TIF folder not found: {tif_folder}")
|
|
||||||
sys.exit(1)
|
|
||||||
|
|
||||||
if not geojson_path.exists():
|
|
||||||
print(f"ERROR: Field boundaries not found: {geojson_path}")
|
|
||||||
sys.exit(1)
|
|
||||||
|
|
||||||
# Load field boundaries once
|
|
||||||
print(f"\nLoading field boundaries: {geojson_path}")
|
|
||||||
gdf = gpd.read_file(geojson_path)
|
|
||||||
n_fields = len(gdf)
|
|
||||||
print(f" Fields: {n_fields}")
|
|
||||||
print(f" Field IDs: {', '.join(gdf[args.field_id_col].astype(str).tolist())}")
|
|
||||||
|
|
||||||
# Find all dated TIFs
|
|
||||||
all_tifs = find_tif_dates(tif_folder)
|
|
||||||
print(f"\nTIF files found: {len(all_tifs)}")
|
|
||||||
|
|
||||||
# Apply date filters
|
|
||||||
if args.start:
|
|
||||||
all_tifs = [(d, p) for d, p in all_tifs if d >= args.start]
|
|
||||||
if args.end:
|
|
||||||
all_tifs = [(d, p) for d, p in all_tifs if d <= args.end]
|
|
||||||
print(f"TIF files after filtering: {len(all_tifs)}")
|
|
||||||
|
|
||||||
if not all_tifs:
|
|
||||||
print("ERROR: No TIF files to process.")
|
|
||||||
sys.exit(1)
|
|
||||||
|
|
||||||
print(f"Date range: {all_tifs[0][0]} → {all_tifs[-1][0]}")
|
|
||||||
print(f"Output: {output_csv}")
|
|
||||||
print()
|
|
||||||
|
|
||||||
# ==========================================================================
|
|
||||||
# EXTRACTION LOOP
|
|
||||||
# ==========================================================================
|
|
||||||
all_rows = []
|
|
||||||
errors = 0
|
|
||||||
t0 = time.time()
|
|
||||||
|
|
||||||
for i, (date_str, tif_path) in enumerate(all_tifs):
|
|
||||||
try:
|
|
||||||
df = extract_field_spectral_indices(
|
|
||||||
tif_path, gdf, field_id_col=args.field_id_col
|
|
||||||
)
|
|
||||||
df.insert(1, "Date", date_str)
|
|
||||||
all_rows.append(df)
|
|
||||||
except Exception as e:
|
|
||||||
errors += 1
|
|
||||||
print(f" [{i+1}/{len(all_tifs)}] ERROR {date_str}: {e}")
|
|
||||||
continue
|
|
||||||
|
|
||||||
# Progress every 100 files
|
|
||||||
if (i + 1) % 100 == 0 or (i + 1) == len(all_tifs):
|
|
||||||
elapsed = time.time() - t0
|
|
||||||
rate = (i + 1) / elapsed
|
|
||||||
remaining = (len(all_tifs) - i - 1) / rate if rate > 0 else 0
|
|
||||||
print(
|
|
||||||
f" [{i+1}/{len(all_tifs)}] {date_str} "
|
|
||||||
f"({rate:.1f} files/s, ~{remaining:.0f}s remaining)"
|
|
||||||
)
|
|
||||||
|
|
||||||
elapsed_total = time.time() - t0
|
|
||||||
|
|
||||||
# ==========================================================================
|
|
||||||
# COMBINE & SAVE
|
|
||||||
# ==========================================================================
|
|
||||||
if not all_rows:
|
|
||||||
print("ERROR: No data extracted.")
|
|
||||||
sys.exit(1)
|
|
||||||
|
|
||||||
result = pd.concat(all_rows, ignore_index=True)
|
|
||||||
|
|
||||||
# Sort by field, date for consistency with CI data
|
|
||||||
result = result.sort_values(["field", "Date"]).reset_index(drop=True)
|
|
||||||
|
|
||||||
# Save
|
|
||||||
output_csv.parent.mkdir(parents=True, exist_ok=True)
|
|
||||||
result.to_csv(output_csv, index=False)
|
|
||||||
|
|
||||||
# ==========================================================================
|
|
||||||
# SUMMARY
|
|
||||||
# ==========================================================================
|
|
||||||
print()
|
|
||||||
print("=" * 80)
|
|
||||||
print("EXTRACTION COMPLETE")
|
|
||||||
print("=" * 80)
|
|
||||||
print(f" Dates processed: {len(all_tifs)}")
|
|
||||||
print(f" Errors: {errors}")
|
|
||||||
print(f" Total rows: {len(result)}")
|
|
||||||
print(f" Fields: {result['field'].nunique()}")
|
|
||||||
print(f" Date range: {result['Date'].min()} → {result['Date'].max()}")
|
|
||||||
print(f" Time: {elapsed_total:.1f}s ({len(all_tifs)/elapsed_total:.1f} files/s)")
|
|
||||||
print()
|
|
||||||
|
|
||||||
# Quick stats
|
|
||||||
for col in ["mean_ndvi", "mean_bsi", "mean_ndwi"]:
|
|
||||||
vals = result[col].dropna()
|
|
||||||
if len(vals) > 0:
|
|
||||||
print(f" {col}: [{vals.min():.4f}, {vals.max():.4f}] (mean={vals.mean():.4f})")
|
|
||||||
|
|
||||||
nan_pct = result[["mean_ndvi", "mean_bsi", "mean_ndwi"]].isna().mean() * 100
|
|
||||||
print(f"\n NaN rates: NDVI={nan_pct['mean_ndvi']:.1f}%, BSI={nan_pct['mean_bsi']:.1f}%, NDWI={nan_pct['mean_ndwi']:.1f}%")
|
|
||||||
print(f"\n Output: {output_csv}")
|
|
||||||
|
|
||||||
|
|
||||||
if __name__ == "__main__":
|
|
||||||
main()
|
|
||||||
|
|
@ -1,241 +0,0 @@
|
||||||
"""
|
|
||||||
spectral_features.py — Extract per-field spectral indices from 4-band GeoTIFFs
|
|
||||||
================================================================================
|
|
||||||
Part of SC-161: Extract spectral indices from 4-band TIFFs (BSI, NDVI, NDWI)
|
|
||||||
|
|
||||||
Band order in TIFFs: [R, G, B, NIR] (uint16)
|
|
||||||
|
|
||||||
Spectral Indices:
|
|
||||||
NDVI = (NIR - R) / (NIR + R) — vegetation density
|
|
||||||
BSI = ((R + NIR) - (G + B)) / ((R + NIR) + (G + B)) — bare soil detection
|
|
||||||
NDWI = (G - NIR) / (G + NIR) — moisture content
|
|
||||||
CI_green = NIR / Green - 1 — chlorophyll index (= SmartCane CI)
|
|
||||||
CI_red = NIR / Red - 1 — chlorophyll index (red-based)
|
|
||||||
GNDVI = (NIR - G) / (NIR + G) — green NDVI (less saturation)
|
|
||||||
SAVI = 1.5 * (NIR - R) / (NIR + R + 0.5) — soil-adjusted vegetation
|
|
||||||
EVI2 = 2.5 * (NIR - R) / (NIR + 2.4*R + 1) — enhanced vegetation (2-band)
|
|
||||||
"""
|
|
||||||
|
|
||||||
import numpy as np
|
|
||||||
import pandas as pd
|
|
||||||
import geopandas as gpd
|
|
||||||
import rasterio
|
|
||||||
from rasterio.mask import mask as rio_mask
|
|
||||||
from pathlib import Path
|
|
||||||
from typing import Optional
|
|
||||||
|
|
||||||
|
|
||||||
# =============================================================================
|
|
||||||
# SPECTRAL INDEX CALCULATIONS
|
|
||||||
# =============================================================================
|
|
||||||
|
|
||||||
def _safe_normalized_diff(a: np.ndarray, b: np.ndarray) -> np.ndarray:
|
|
||||||
"""Compute (a - b) / (a + b) with safe division, masking invalid pixels."""
|
|
||||||
with np.errstate(divide="ignore", invalid="ignore"):
|
|
||||||
result = (a - b) / (a + b)
|
|
||||||
result[~np.isfinite(result)] = np.nan
|
|
||||||
return result
|
|
||||||
|
|
||||||
|
|
||||||
def compute_ndvi(red: np.ndarray, nir: np.ndarray) -> np.ndarray:
|
|
||||||
"""NDVI = (NIR - R) / (NIR + R)"""
|
|
||||||
return _safe_normalized_diff(nir, red)
|
|
||||||
|
|
||||||
|
|
||||||
def compute_bsi(red: np.ndarray, green: np.ndarray, blue: np.ndarray, nir: np.ndarray) -> np.ndarray:
|
|
||||||
"""BSI = ((R + NIR) - (G + B)) / ((R + NIR) + (G + B))"""
|
|
||||||
numerator = (red + nir) - (green + blue)
|
|
||||||
denominator = (red + nir) + (green + blue)
|
|
||||||
with np.errstate(divide="ignore", invalid="ignore"):
|
|
||||||
result = numerator / denominator
|
|
||||||
result[~np.isfinite(result)] = np.nan
|
|
||||||
return result
|
|
||||||
|
|
||||||
|
|
||||||
def compute_ndwi(green: np.ndarray, nir: np.ndarray) -> np.ndarray:
|
|
||||||
"""NDWI = (G - NIR) / (G + NIR)"""
|
|
||||||
return _safe_normalized_diff(green, nir)
|
|
||||||
|
|
||||||
|
|
||||||
def compute_ci_green(green: np.ndarray, nir: np.ndarray) -> np.ndarray:
|
|
||||||
"""CI_green = NIR / Green - 1 (= SmartCane pipeline CI)"""
|
|
||||||
with np.errstate(divide="ignore", invalid="ignore"):
|
|
||||||
result = nir / green - 1.0
|
|
||||||
result[~np.isfinite(result)] = np.nan
|
|
||||||
return result
|
|
||||||
|
|
||||||
|
|
||||||
def compute_ci_red(red: np.ndarray, nir: np.ndarray) -> np.ndarray:
|
|
||||||
"""CI_red = NIR / Red - 1 (more sensitive at low chlorophyll)"""
|
|
||||||
with np.errstate(divide="ignore", invalid="ignore"):
|
|
||||||
result = nir / red - 1.0
|
|
||||||
result[~np.isfinite(result)] = np.nan
|
|
||||||
return result
|
|
||||||
|
|
||||||
|
|
||||||
def compute_gndvi(green: np.ndarray, nir: np.ndarray) -> np.ndarray:
|
|
||||||
"""GNDVI = (NIR - G) / (NIR + G)"""
|
|
||||||
return _safe_normalized_diff(nir, green)
|
|
||||||
|
|
||||||
|
|
||||||
def compute_savi(red: np.ndarray, nir: np.ndarray, L: float = 0.5) -> np.ndarray:
|
|
||||||
"""SAVI = (1 + L) * (NIR - R) / (NIR + R + L), L=0.5 typical"""
|
|
||||||
with np.errstate(divide="ignore", invalid="ignore"):
|
|
||||||
result = (1.0 + L) * (nir - red) / (nir + red + L)
|
|
||||||
result[~np.isfinite(result)] = np.nan
|
|
||||||
return result
|
|
||||||
|
|
||||||
|
|
||||||
def compute_evi2(red: np.ndarray, nir: np.ndarray) -> np.ndarray:
|
|
||||||
"""EVI2 = 2.5 * (NIR - R) / (NIR + 2.4*R + 1)"""
|
|
||||||
with np.errstate(divide="ignore", invalid="ignore"):
|
|
||||||
result = 2.5 * (nir - red) / (nir + 2.4 * red + 1.0)
|
|
||||||
result[~np.isfinite(result)] = np.nan
|
|
||||||
return result
|
|
||||||
|
|
||||||
|
|
||||||
# =============================================================================
|
|
||||||
# PER-FIELD EXTRACTION
|
|
||||||
# =============================================================================
|
|
||||||
|
|
||||||
def extract_field_spectral_indices(
|
|
||||||
tif_path: str | Path,
|
|
||||||
field_boundaries_gdf: gpd.GeoDataFrame,
|
|
||||||
field_id_col: str = "field",
|
|
||||||
nodata_threshold: float = 0.0,
|
|
||||||
min_valid_fraction: float = 0.1,
|
|
||||||
) -> pd.DataFrame:
|
|
||||||
"""
|
|
||||||
Extract mean/std spectral indices per field from a single 4-band TIF.
|
|
||||||
|
|
||||||
Parameters
|
|
||||||
----------
|
|
||||||
tif_path : path to 4-band GeoTIFF (R, G, B, NIR — uint16)
|
|
||||||
field_boundaries_gdf : GeoDataFrame with field polygons
|
|
||||||
field_id_col : column name for field identifier
|
|
||||||
nodata_threshold : pixels with ALL bands <= this value are treated as nodata
|
|
||||||
min_valid_fraction : minimum fraction of valid pixels required (else NaN)
|
|
||||||
|
|
||||||
Returns
|
|
||||||
-------
|
|
||||||
DataFrame with columns:
|
|
||||||
field, mean_ndvi, mean_bsi, mean_ndwi, mean_ci_green, mean_ci_red,
|
|
||||||
mean_gndvi, mean_savi, mean_evi2,
|
|
||||||
std_ndvi, std_bsi, std_ndwi, std_ci_green, std_ci_red,
|
|
||||||
std_gndvi, std_savi, std_evi2,
|
|
||||||
valid_pixel_count, total_pixel_count
|
|
||||||
"""
|
|
||||||
tif_path = Path(tif_path)
|
|
||||||
rows = []
|
|
||||||
|
|
||||||
with rasterio.open(tif_path) as src:
|
|
||||||
tif_crs = src.crs
|
|
||||||
|
|
||||||
# Reproject field boundaries to match TIF CRS if needed
|
|
||||||
gdf = field_boundaries_gdf.copy()
|
|
||||||
if gdf.crs is not None and not gdf.crs.equals(tif_crs):
|
|
||||||
gdf = gdf.to_crs(tif_crs)
|
|
||||||
elif gdf.crs is None:
|
|
||||||
# Assume EPSG:4326 if not set
|
|
||||||
gdf = gdf.set_crs("EPSG:4326").to_crs(tif_crs)
|
|
||||||
|
|
||||||
for _, row_geom in gdf.iterrows():
|
|
||||||
field_id = row_geom[field_id_col]
|
|
||||||
geom = [row_geom.geometry.__geo_interface__]
|
|
||||||
|
|
||||||
try:
|
|
||||||
out_image, _ = rio_mask(src, geom, crop=True, all_touched=True, nodata=0)
|
|
||||||
except Exception:
|
|
||||||
# Field doesn't overlap with this TIF
|
|
||||||
rows.append(_make_nan_row(field_id))
|
|
||||||
continue
|
|
||||||
|
|
||||||
# out_image shape: (bands, height, width) — convert to float
|
|
||||||
bands = out_image.astype(np.float64)
|
|
||||||
|
|
||||||
red = bands[0]
|
|
||||||
green = bands[1]
|
|
||||||
blue = bands[2]
|
|
||||||
nir = bands[3]
|
|
||||||
|
|
||||||
# Mask: valid pixels have at least one band > nodata_threshold
|
|
||||||
valid_mask = np.any(bands > nodata_threshold, axis=0)
|
|
||||||
total_pixels = valid_mask.size
|
|
||||||
valid_pixels = int(valid_mask.sum())
|
|
||||||
|
|
||||||
if valid_pixels < max(1, int(total_pixels * min_valid_fraction)):
|
|
||||||
rows.append(_make_nan_row(field_id, valid_pixels, total_pixels))
|
|
||||||
continue
|
|
||||||
|
|
||||||
# Compute indices (only on valid pixels)
|
|
||||||
ndvi = compute_ndvi(red, nir)
|
|
||||||
bsi = compute_bsi(red, green, blue, nir)
|
|
||||||
ndwi = compute_ndwi(green, nir)
|
|
||||||
ci_green = compute_ci_green(green, nir)
|
|
||||||
ci_red = compute_ci_red(red, nir)
|
|
||||||
gndvi = compute_gndvi(green, nir)
|
|
||||||
savi = compute_savi(red, nir)
|
|
||||||
evi2 = compute_evi2(red, nir)
|
|
||||||
|
|
||||||
# Apply mask
|
|
||||||
ndvi_valid = ndvi[valid_mask]
|
|
||||||
bsi_valid = bsi[valid_mask]
|
|
||||||
ndwi_valid = ndwi[valid_mask]
|
|
||||||
ci_green_valid = ci_green[valid_mask]
|
|
||||||
ci_red_valid = ci_red[valid_mask]
|
|
||||||
gndvi_valid = gndvi[valid_mask]
|
|
||||||
savi_valid = savi[valid_mask]
|
|
||||||
evi2_valid = evi2[valid_mask]
|
|
||||||
|
|
||||||
rows.append({
|
|
||||||
"field": field_id,
|
|
||||||
"mean_ndvi": np.nanmean(ndvi_valid),
|
|
||||||
"mean_bsi": np.nanmean(bsi_valid),
|
|
||||||
"mean_ndwi": np.nanmean(ndwi_valid),
|
|
||||||
"mean_ci_green": np.nanmean(ci_green_valid),
|
|
||||||
"mean_ci_red": np.nanmean(ci_red_valid),
|
|
||||||
"mean_gndvi": np.nanmean(gndvi_valid),
|
|
||||||
"mean_savi": np.nanmean(savi_valid),
|
|
||||||
"mean_evi2": np.nanmean(evi2_valid),
|
|
||||||
"std_ndvi": np.nanstd(ndvi_valid),
|
|
||||||
"std_bsi": np.nanstd(bsi_valid),
|
|
||||||
"std_ndwi": np.nanstd(ndwi_valid),
|
|
||||||
"std_ci_green": np.nanstd(ci_green_valid),
|
|
||||||
"std_ci_red": np.nanstd(ci_red_valid),
|
|
||||||
"std_gndvi": np.nanstd(gndvi_valid),
|
|
||||||
"std_savi": np.nanstd(savi_valid),
|
|
||||||
"std_evi2": np.nanstd(evi2_valid),
|
|
||||||
"valid_pixel_count": valid_pixels,
|
|
||||||
"total_pixel_count": total_pixels,
|
|
||||||
})
|
|
||||||
|
|
||||||
return pd.DataFrame(rows)
|
|
||||||
|
|
||||||
|
|
||||||
def _make_nan_row(
|
|
||||||
field_id: str,
|
|
||||||
valid_pixels: int = 0,
|
|
||||||
total_pixels: int = 0,
|
|
||||||
) -> dict:
|
|
||||||
"""Return a row with NaN values for a field that couldn't be processed."""
|
|
||||||
return {
|
|
||||||
"field": field_id,
|
|
||||||
"mean_ndvi": np.nan,
|
|
||||||
"mean_bsi": np.nan,
|
|
||||||
"mean_ndwi": np.nan,
|
|
||||||
"mean_ci_green": np.nan,
|
|
||||||
"mean_ci_red": np.nan,
|
|
||||||
"mean_gndvi": np.nan,
|
|
||||||
"mean_savi": np.nan,
|
|
||||||
"mean_evi2": np.nan,
|
|
||||||
"std_ndvi": np.nan,
|
|
||||||
"std_bsi": np.nan,
|
|
||||||
"std_ndwi": np.nan,
|
|
||||||
"std_ci_green": np.nan,
|
|
||||||
"std_ci_red": np.nan,
|
|
||||||
"std_gndvi": np.nan,
|
|
||||||
"std_savi": np.nan,
|
|
||||||
"std_evi2": np.nan,
|
|
||||||
"valid_pixel_count": valid_pixels,
|
|
||||||
"total_pixel_count": total_pixels,
|
|
||||||
}
|
|
||||||
|
|
@ -1,210 +0,0 @@
|
||||||
"""
|
|
||||||
Batch harvest detection across all fields.
|
|
||||||
Generates accuracy metrics: mean error, std dev, percentage within thresholds.
|
|
||||||
"""
|
|
||||||
|
|
||||||
import sys
|
|
||||||
import numpy as np
|
|
||||||
import pandas as pd
|
|
||||||
from pathlib import Path
|
|
||||||
import matplotlib.pyplot as plt
|
|
||||||
|
|
||||||
# Add parent to path for imports
|
|
||||||
sys.path.insert(0, str(Path(__file__).parent))
|
|
||||||
|
|
||||||
from multi_year_harvest_detection import (
|
|
||||||
load_model_and_config, load_harvest_data, run_iterative_harvest_detection,
|
|
||||||
export_results, detect_actual_harvest_dates, DATA_FILE, DEVICE
|
|
||||||
)
|
|
||||||
|
|
||||||
OUTPUT_DIR = Path("multi_year_analysis_batch")
|
|
||||||
OUTPUT_DIR.mkdir(exist_ok=True)
|
|
||||||
|
|
||||||
def run_field_detection(field_id, data_df, model, scalers, config):
|
|
||||||
"""Run detection for a single field."""
|
|
||||||
print(f"\n{'='*80}")
|
|
||||||
print(f"Processing Field: {field_id}")
|
|
||||||
print(f"{'='*80}")
|
|
||||||
|
|
||||||
field_data = data_df[data_df['field'] == field_id].copy()
|
|
||||||
|
|
||||||
if len(field_data) == 0:
|
|
||||||
print(f" ⚠ No data found for field {field_id}")
|
|
||||||
return None
|
|
||||||
|
|
||||||
print(f" Data points: {len(field_data)} ({field_data['Date'].min()} to {field_data['Date'].max()})")
|
|
||||||
|
|
||||||
try:
|
|
||||||
results_df, detected_harvests, full_data = run_iterative_harvest_detection(
|
|
||||||
field_id, field_data, model, scalers, config
|
|
||||||
)
|
|
||||||
|
|
||||||
# Export field results
|
|
||||||
export_results(field_id, results_df, detected_harvests, full_data,
|
|
||||||
output_dir=OUTPUT_DIR)
|
|
||||||
|
|
||||||
return {
|
|
||||||
'field_id': field_id,
|
|
||||||
'num_detections': len(detected_harvests),
|
|
||||||
'detected_harvests': detected_harvests,
|
|
||||||
'results_df': results_df,
|
|
||||||
'full_data': full_data
|
|
||||||
}
|
|
||||||
|
|
||||||
except Exception as e:
|
|
||||||
print(f" ✗ Error processing field: {str(e)}")
|
|
||||||
return None
|
|
||||||
|
|
||||||
|
|
||||||
def compute_accuracy_metrics(all_results):
|
|
||||||
"""Compute accuracy metrics across all fields."""
|
|
||||||
from multi_year_harvest_detection import detect_actual_harvest_dates
|
|
||||||
|
|
||||||
all_errors = []
|
|
||||||
summary_data = []
|
|
||||||
|
|
||||||
for field_result in all_results:
|
|
||||||
if field_result is None:
|
|
||||||
continue
|
|
||||||
|
|
||||||
field_id = field_result['field_id']
|
|
||||||
detected_harvests = field_result['detected_harvests']
|
|
||||||
full_data = field_result['full_data']
|
|
||||||
|
|
||||||
# Get actual harvests
|
|
||||||
actual_harvest_days = detect_actual_harvest_dates(full_data)
|
|
||||||
|
|
||||||
if not detected_harvests or not actual_harvest_days:
|
|
||||||
continue
|
|
||||||
|
|
||||||
# Calculate errors
|
|
||||||
errors = []
|
|
||||||
for det_day, det_date, det_prob in detected_harvests:
|
|
||||||
# Find nearest actual harvest
|
|
||||||
diffs = [abs(det_day - act_day) for act_day in actual_harvest_days]
|
|
||||||
min_error = min(diffs)
|
|
||||||
errors.append(min_error)
|
|
||||||
all_errors.append(min_error)
|
|
||||||
|
|
||||||
summary_data.append({
|
|
||||||
'field_id': field_id,
|
|
||||||
'detected_day': det_day,
|
|
||||||
'detected_date': det_date if isinstance(det_date, str) else det_date.strftime('%Y-%m-%d'),
|
|
||||||
'detected_prob': det_prob,
|
|
||||||
'error_days': min_error
|
|
||||||
})
|
|
||||||
|
|
||||||
print(f"\nField {field_id}:")
|
|
||||||
print(f" Detections: {len(detected_harvests)}")
|
|
||||||
if errors:
|
|
||||||
print(f" Mean error: {np.mean(errors):.1f} days")
|
|
||||||
print(f" Std dev: {np.std(errors):.1f} days")
|
|
||||||
print(f" Min/Max: {min(errors):.0f}/{max(errors):.0f} days")
|
|
||||||
|
|
||||||
return all_errors, pd.DataFrame(summary_data)
|
|
||||||
|
|
||||||
|
|
||||||
def main():
|
|
||||||
print("="*80)
|
|
||||||
print("BATCH HARVEST DETECTION - ALL FIELDS")
|
|
||||||
print("="*80)
|
|
||||||
|
|
||||||
# Load model
|
|
||||||
print("\n[1/3] Loading Model 307...")
|
|
||||||
model, config, scalers = load_model_and_config()
|
|
||||||
|
|
||||||
# Load all data
|
|
||||||
print("\n[2/3] Loading data...")
|
|
||||||
df = load_harvest_data(DATA_FILE)
|
|
||||||
print(f"Total rows: {len(df)}")
|
|
||||||
|
|
||||||
# Filter out Chemba fields
|
|
||||||
df = df[df['client'] != 'chemba'].copy()
|
|
||||||
print(f"After filtering out Chemba: {len(df)} rows")
|
|
||||||
|
|
||||||
# Get all unique fields (remove NaN)
|
|
||||||
fields = sorted([f for f in df['field'].unique() if pd.notna(f)])
|
|
||||||
print(f"Fields to process: {len(fields)}")
|
|
||||||
print(f" {fields}")
|
|
||||||
|
|
||||||
# Process each field
|
|
||||||
print("\n[3/3] Running detection on all fields...")
|
|
||||||
all_results = []
|
|
||||||
|
|
||||||
for field_id in fields:
|
|
||||||
result = run_field_detection(field_id, df, model, scalers, config)
|
|
||||||
if result is not None:
|
|
||||||
all_results.append(result)
|
|
||||||
|
|
||||||
# Compute accuracy metrics
|
|
||||||
print("\n" + "="*80)
|
|
||||||
print("ACCURACY SUMMARY")
|
|
||||||
print("="*80)
|
|
||||||
|
|
||||||
all_errors, summary_df = compute_accuracy_metrics(all_results)
|
|
||||||
|
|
||||||
if all_errors:
|
|
||||||
all_errors = np.array(all_errors)
|
|
||||||
print(f"\nOverall Statistics (across all fields):")
|
|
||||||
print(f" Total detections: {len(all_errors)}")
|
|
||||||
print(f" Mean error: {np.mean(all_errors):.2f} days")
|
|
||||||
print(f" Median error: {np.median(all_errors):.2f} days")
|
|
||||||
print(f" Std dev: {np.std(all_errors):.2f} days")
|
|
||||||
print(f" Min error: {np.min(all_errors):.0f} days")
|
|
||||||
print(f" Max error: {np.max(all_errors):.0f} days")
|
|
||||||
|
|
||||||
# Percentiles
|
|
||||||
print(f"\n Percentiles:")
|
|
||||||
for p in [25, 50, 75, 90, 95]:
|
|
||||||
print(f" {p}th: {np.percentile(all_errors, p):.1f} days")
|
|
||||||
|
|
||||||
# Within threshold
|
|
||||||
thresholds = [3, 7, 14, 21, 30]
|
|
||||||
print(f"\n Within threshold:")
|
|
||||||
for threshold in thresholds:
|
|
||||||
pct = 100 * np.sum(all_errors <= threshold) / len(all_errors)
|
|
||||||
print(f" ≤ {threshold} days: {pct:.1f}% ({np.sum(all_errors <= threshold)}/{len(all_errors)})")
|
|
||||||
|
|
||||||
# Export summary
|
|
||||||
summary_file = OUTPUT_DIR / "batch_accuracy_summary.csv"
|
|
||||||
summary_df.to_csv(summary_file, index=False)
|
|
||||||
print(f"\nSummary CSV: {summary_file}")
|
|
||||||
print("\nFirst 20 rows:")
|
|
||||||
print(summary_df.head(20).to_string(index=False))
|
|
||||||
|
|
||||||
# Plot error distribution
|
|
||||||
fig, axes = plt.subplots(1, 2, figsize=(14, 5))
|
|
||||||
|
|
||||||
# Histogram
|
|
||||||
axes[0].hist(all_errors, bins=20, color='steelblue', edgecolor='black', alpha=0.7)
|
|
||||||
axes[0].axvline(np.mean(all_errors), color='red', linestyle='--', linewidth=2, label=f'Mean: {np.mean(all_errors):.1f}d')
|
|
||||||
axes[0].axvline(np.median(all_errors), color='green', linestyle='--', linewidth=2, label=f'Median: {np.median(all_errors):.1f}d')
|
|
||||||
axes[0].set_xlabel('Error (days)', fontsize=12, fontweight='bold')
|
|
||||||
axes[0].set_ylabel('Frequency', fontsize=12, fontweight='bold')
|
|
||||||
axes[0].set_title('Distribution of Detection Errors', fontsize=13, fontweight='bold')
|
|
||||||
axes[0].legend()
|
|
||||||
axes[0].grid(alpha=0.3)
|
|
||||||
|
|
||||||
# Cumulative distribution
|
|
||||||
sorted_errors = np.sort(all_errors)
|
|
||||||
cumulative = np.arange(1, len(sorted_errors)+1) / len(sorted_errors) * 100
|
|
||||||
axes[1].plot(sorted_errors, cumulative, marker='o', linestyle='-', color='steelblue', linewidth=2, markersize=5)
|
|
||||||
axes[1].axhline(50, color='gray', linestyle=':', alpha=0.5)
|
|
||||||
axes[1].axhline(90, color='gray', linestyle=':', alpha=0.5)
|
|
||||||
axes[1].axvline(7, color='green', linestyle='--', alpha=0.5, linewidth=2, label='7-day target')
|
|
||||||
axes[1].axvline(14, color='orange', linestyle='--', alpha=0.5, linewidth=2, label='14-day acceptable')
|
|
||||||
axes[1].set_xlabel('Error (days)', fontsize=12, fontweight='bold')
|
|
||||||
axes[1].set_ylabel('Cumulative %', fontsize=12, fontweight='bold')
|
|
||||||
axes[1].set_title('Cumulative Distribution of Errors', fontsize=13, fontweight='bold')
|
|
||||||
axes[1].legend()
|
|
||||||
axes[1].grid(alpha=0.3)
|
|
||||||
|
|
||||||
plt.tight_layout()
|
|
||||||
plot_file = OUTPUT_DIR / "error_distribution.png"
|
|
||||||
plt.savefig(plot_file, dpi=100, bbox_inches='tight')
|
|
||||||
print(f"Error distribution plot: {plot_file}")
|
|
||||||
plt.close()
|
|
||||||
|
|
||||||
|
|
||||||
if __name__ == "__main__":
|
|
||||||
main()
|
|
||||||
|
|
@ -1,656 +0,0 @@
|
||||||
"""
|
|
||||||
Multi-Year Harvest Detection: Detect multiple harvest dates in continuous 5-year CI sequences
|
|
||||||
|
|
||||||
Strategy:
|
|
||||||
1. Load full CI sequence for a field (no truncation)
|
|
||||||
2. Run inference on every 7 days across the entire sequence
|
|
||||||
3. Create synthetic DOY (modulo 365) for seasonal context
|
|
||||||
4. Detect harvest spikes (detected_prob > threshold)
|
|
||||||
5. Implement state-reset logic: after harvest detected, reset expectations
|
|
||||||
6. Cluster spikes to estimate multiple harvest dates
|
|
||||||
7. Visualize with CI overlay to validate
|
|
||||||
"""
|
|
||||||
|
|
||||||
import pandas as pd
|
|
||||||
import numpy as np
|
|
||||||
import torch
|
|
||||||
from pathlib import Path
|
|
||||||
import matplotlib.pyplot as plt
|
|
||||||
from datetime import datetime, timedelta
|
|
||||||
import sys
|
|
||||||
|
|
||||||
sys.path.insert(0, str(Path.cwd() / 'src'))
|
|
||||||
|
|
||||||
from data_loader import load_harvest_data
|
|
||||||
from feature_engineering import extract_features
|
|
||||||
from models import create_model
|
|
||||||
import pickle
|
|
||||||
import yaml
|
|
||||||
|
|
||||||
# Configuration
|
|
||||||
DETECTED_THRESHOLD = 0.2 # Threshold for multi-year detection
|
|
||||||
FIELD_TO_TEST = '00300'
|
|
||||||
SKIP_FIRST_DAYS = 100 # Skip first N days to simulate mid-season start (0 = full sequence)
|
|
||||||
|
|
||||||
RESULTS_DIR = Path("results/307_dropout02_with_doy_ORIGINAL")
|
|
||||||
DATA_FILE = Path("../lstm_complete_data.csv")
|
|
||||||
CONFIG_FILE = RESULTS_DIR / "config.json"
|
|
||||||
MODEL_FILE = RESULTS_DIR / "model.pt"
|
|
||||||
SCALERS_FILE = RESULTS_DIR / "scalers.pkl"
|
|
||||||
|
|
||||||
DEVICE = torch.device("cuda" if torch.cuda.is_available() else "cpu")
|
|
||||||
print(f"Using device: {DEVICE}")
|
|
||||||
|
|
||||||
|
|
||||||
def load_model_and_config():
|
|
||||||
"""Load Model 307 architecture and weights."""
|
|
||||||
print(f"Loading model config from {CONFIG_FILE}")
|
|
||||||
with open(CONFIG_FILE) as f:
|
|
||||||
config = yaml.safe_load(f)
|
|
||||||
|
|
||||||
print(f"Loading model weights from {MODEL_FILE}")
|
|
||||||
model = create_model(
|
|
||||||
model_type=config['model']['type'],
|
|
||||||
input_size=len(config['features']),
|
|
||||||
hidden_size=config['model']['hidden_size'],
|
|
||||||
num_layers=config['model']['num_layers'],
|
|
||||||
dropout=config['model']['dropout'],
|
|
||||||
device=DEVICE
|
|
||||||
)
|
|
||||||
model.load_state_dict(torch.load(MODEL_FILE, map_location=DEVICE))
|
|
||||||
model.eval()
|
|
||||||
|
|
||||||
print(f"Loading feature scalers from {SCALERS_FILE}")
|
|
||||||
with open(SCALERS_FILE, 'rb') as f:
|
|
||||||
scalers = pickle.load(f)
|
|
||||||
|
|
||||||
return model, config, scalers
|
|
||||||
|
|
||||||
|
|
||||||
def predict_on_truncated_sequence(model, data_df, truncate_day, scalers, config):
|
|
||||||
"""Run inference on sequence truncated at specific day."""
|
|
||||||
if truncate_day >= len(data_df):
|
|
||||||
return None, None
|
|
||||||
|
|
||||||
trunc_df = data_df.iloc[:truncate_day+1].copy()
|
|
||||||
|
|
||||||
features = config['features']
|
|
||||||
ci_column = config['data']['ci_column']
|
|
||||||
feat_array = extract_features(trunc_df, features, ci_column)
|
|
||||||
|
|
||||||
# Apply scalers
|
|
||||||
for fi, scaler in enumerate(scalers):
|
|
||||||
try:
|
|
||||||
feat_array[:, fi] = scaler.transform(feat_array[:, fi].reshape(-1, 1)).flatten()
|
|
||||||
except Exception:
|
|
||||||
pass
|
|
||||||
|
|
||||||
with torch.no_grad():
|
|
||||||
x_tensor = torch.tensor(feat_array, dtype=torch.float32).unsqueeze(0).to(DEVICE)
|
|
||||||
out_imm, out_det = model(x_tensor)
|
|
||||||
imminent_prob = out_imm.squeeze(0)[-1].cpu().item()
|
|
||||||
detected_prob = out_det.squeeze(0)[-1].cpu().item()
|
|
||||||
|
|
||||||
return imminent_prob, detected_prob
|
|
||||||
|
|
||||||
|
|
||||||
def predict_with_state_reset(model, data_df, season_anchor_day, end_day, scalers, config, window_size=180):
|
|
||||||
"""
|
|
||||||
Run inference with DOY reset relative to season anchor point.
|
|
||||||
|
|
||||||
The model was trained on sequences with DOY cycling 1-365 within a season.
|
|
||||||
To use multi-year data, we anchor to harvest detection points and reset DOY.
|
|
||||||
|
|
||||||
Args:
|
|
||||||
model: LSTM model
|
|
||||||
data_df: Full dataframe
|
|
||||||
season_anchor_day: Day that marks the start of this season (DOY 1 for model)
|
|
||||||
end_day: Day to predict at
|
|
||||||
scalers: Feature scalers
|
|
||||||
config: Model config
|
|
||||||
window_size: Max history to include (180-200 days typical)
|
|
||||||
|
|
||||||
Returns:
|
|
||||||
(imminent_prob, detected_prob) for end_day
|
|
||||||
"""
|
|
||||||
if end_day >= len(data_df) or season_anchor_day > end_day:
|
|
||||||
return None, None
|
|
||||||
|
|
||||||
# Create lookback window: last window_size days before end_day, but don't go before season start
|
|
||||||
lookback_start = max(0, end_day - window_size)
|
|
||||||
trunc_df = data_df.iloc[lookback_start:end_day+1].copy()
|
|
||||||
|
|
||||||
# RESET DOY relative to season anchor:
|
|
||||||
# season_anchor_day = DOY 1, season_anchor_day+1 = DOY 2, etc.
|
|
||||||
# This gives the model the seasonal context it was trained on
|
|
||||||
if 'DOY' in trunc_df.columns:
|
|
||||||
days_from_anchor = np.arange(len(trunc_df)) + (lookback_start - season_anchor_day)
|
|
||||||
trunc_df['DOY'] = (days_from_anchor % 365) + 1 # DOY 1-365 cycling
|
|
||||||
|
|
||||||
features = config['features']
|
|
||||||
ci_column = config['data']['ci_column']
|
|
||||||
feat_array = extract_features(trunc_df, features, ci_column)
|
|
||||||
|
|
||||||
# Apply scalers
|
|
||||||
for fi, scaler in enumerate(scalers):
|
|
||||||
try:
|
|
||||||
feat_array[:, fi] = scaler.transform(feat_array[:, fi].reshape(-1, 1)).flatten()
|
|
||||||
except Exception:
|
|
||||||
pass
|
|
||||||
|
|
||||||
with torch.no_grad():
|
|
||||||
x_tensor = torch.tensor(feat_array, dtype=torch.float32).unsqueeze(0).to(DEVICE)
|
|
||||||
out_imm, out_det = model(x_tensor)
|
|
||||||
imminent_prob = out_imm.squeeze(0)[-1].cpu().item()
|
|
||||||
detected_prob = out_det.squeeze(0)[-1].cpu().item()
|
|
||||||
|
|
||||||
return imminent_prob, detected_prob
|
|
||||||
|
|
||||||
|
|
||||||
def detect_harvest_spikes(detected_probs, threshold=DETECTED_THRESHOLD, min_cluster_size=3):
|
|
||||||
"""
|
|
||||||
Detect harvest spikes in detected_prob time series.
|
|
||||||
|
|
||||||
Returns:
|
|
||||||
List of (spike_center_day, peak_prob) tuples
|
|
||||||
"""
|
|
||||||
spikes = []
|
|
||||||
in_spike = False
|
|
||||||
spike_start = None
|
|
||||||
spike_probs = []
|
|
||||||
|
|
||||||
for day, prob in enumerate(detected_probs):
|
|
||||||
if prob > threshold:
|
|
||||||
if not in_spike:
|
|
||||||
in_spike = True
|
|
||||||
spike_start = day
|
|
||||||
spike_probs = [prob]
|
|
||||||
else:
|
|
||||||
spike_probs.append(prob)
|
|
||||||
else:
|
|
||||||
if in_spike and len(spike_probs) >= min_cluster_size:
|
|
||||||
# Spike ended, record it
|
|
||||||
spike_center = spike_start + np.argmax(spike_probs)
|
|
||||||
peak_prob = np.max(spike_probs)
|
|
||||||
spikes.append((spike_center, peak_prob))
|
|
||||||
in_spike = False
|
|
||||||
spike_probs = []
|
|
||||||
|
|
||||||
# Handle spike at end of sequence
|
|
||||||
if in_spike and len(spike_probs) >= min_cluster_size:
|
|
||||||
spike_center = spike_start + np.argmax(spike_probs)
|
|
||||||
peak_prob = np.max(spike_probs)
|
|
||||||
spikes.append((spike_center, peak_prob))
|
|
||||||
|
|
||||||
return spikes
|
|
||||||
|
|
||||||
|
|
||||||
def extract_harvest_dates(detected_probs, check_days, data_df, threshold=DETECTED_THRESHOLD, min_days_between=100):
|
|
||||||
"""
|
|
||||||
Extract estimated harvest dates from detected probability spikes.
|
|
||||||
|
|
||||||
Args:
|
|
||||||
detected_probs: Array of detected probabilities at check days
|
|
||||||
check_days: Array of days at which predictions were made
|
|
||||||
data_df: Full sequence dataframe (for date mapping)
|
|
||||||
threshold: Detection threshold
|
|
||||||
min_days_between: Minimum days between harvests (to avoid duplicates)
|
|
||||||
|
|
||||||
Returns:
|
|
||||||
List of (day, date, peak_prob) tuples for estimated harvests
|
|
||||||
"""
|
|
||||||
spikes = detect_harvest_spikes(detected_probs, threshold=threshold, min_cluster_size=3)
|
|
||||||
|
|
||||||
if not spikes:
|
|
||||||
return []
|
|
||||||
|
|
||||||
# Filter: only keep spikes that are at least min_days_between apart
|
|
||||||
filtered_spikes = []
|
|
||||||
for spike_day, peak_prob in spikes:
|
|
||||||
if not filtered_spikes:
|
|
||||||
filtered_spikes.append((spike_day, peak_prob))
|
|
||||||
else:
|
|
||||||
last_day = filtered_spikes[-1][0]
|
|
||||||
if spike_day - last_day >= min_days_between:
|
|
||||||
filtered_spikes.append((spike_day, peak_prob))
|
|
||||||
|
|
||||||
# Map days to dates
|
|
||||||
harvest_dates = []
|
|
||||||
for spike_day, peak_prob in filtered_spikes:
|
|
||||||
# Find closest check day to the spike
|
|
||||||
closest_check_idx = np.argmin(np.abs(check_days - spike_day))
|
|
||||||
closest_check_day = check_days[closest_check_idx]
|
|
||||||
|
|
||||||
if closest_check_day < len(data_df):
|
|
||||||
date = data_df.iloc[closest_check_day]['Date']
|
|
||||||
harvest_dates.append((closest_check_day, date, peak_prob))
|
|
||||||
|
|
||||||
return harvest_dates
|
|
||||||
|
|
||||||
|
|
||||||
def run_iterative_harvest_detection(field_name, data_df, model, scalers, config):
|
|
||||||
"""
|
|
||||||
Iterative harvest detection with multi-day confirmation.
|
|
||||||
|
|
||||||
Strategy:
|
|
||||||
1. Start from day 0
|
|
||||||
2. Run inference every 7 days
|
|
||||||
3. Collect days where detected_prob crosses threshold
|
|
||||||
4. Once we have 2-3 consecutive confirmations, declare harvest
|
|
||||||
5. Use FIRST confirmed day as anchor point for DOY reset
|
|
||||||
6. Continue from day after last confirmation
|
|
||||||
|
|
||||||
Args:
|
|
||||||
field_name: Field ID
|
|
||||||
data_df: Full CI sequence (sorted by Date)
|
|
||||||
model: Loaded LSTM model
|
|
||||||
scalers: Feature scalers
|
|
||||||
config: Model config
|
|
||||||
|
|
||||||
Returns:
|
|
||||||
results_df: DataFrame with predictions
|
|
||||||
detected_harvests: List of (day, date, peak_prob) tuples
|
|
||||||
"""
|
|
||||||
print(f"\nProcessing field {field_name} with iterative detection (multi-day confirmation)...")
|
|
||||||
print(f"Sequence length: {len(data_df)} days")
|
|
||||||
|
|
||||||
data_df = data_df.sort_values('Date').reset_index(drop=True)
|
|
||||||
|
|
||||||
results = []
|
|
||||||
detected_harvests = []
|
|
||||||
harvest_event_id = 0
|
|
||||||
|
|
||||||
current_start = 0
|
|
||||||
min_confirmations = 2 # Need 2+ consecutive days above threshold
|
|
||||||
|
|
||||||
while current_start < len(data_df):
|
|
||||||
print(f"\n--- Harvest Event {harvest_event_id} (starting from day {current_start}) ---")
|
|
||||||
|
|
||||||
confirmation_cluster = [] # Track consecutive days above threshold
|
|
||||||
harvest_first_day = None
|
|
||||||
peak_prob_in_event = 0
|
|
||||||
|
|
||||||
# Run predictions for this season until harvest confirmed
|
|
||||||
checks_done = 0
|
|
||||||
max_checks = 1000 # Safety limit to prevent infinite loops
|
|
||||||
|
|
||||||
for offset_day in range(7, len(data_df) - current_start, 7):
|
|
||||||
check_day = current_start + offset_day
|
|
||||||
checks_done += 1
|
|
||||||
|
|
||||||
if check_day >= len(data_df) or checks_done > max_checks:
|
|
||||||
break
|
|
||||||
|
|
||||||
# Run inference with DOY reset
|
|
||||||
imminent_prob, detected_prob = predict_with_state_reset(
|
|
||||||
model, data_df, current_start, check_day, scalers, config, window_size=200
|
|
||||||
)
|
|
||||||
|
|
||||||
if imminent_prob is None:
|
|
||||||
continue
|
|
||||||
|
|
||||||
check_row = data_df.iloc[check_day]
|
|
||||||
|
|
||||||
results.append({
|
|
||||||
'day': check_day,
|
|
||||||
'date': check_row['Date'],
|
|
||||||
'imminent_prob': imminent_prob,
|
|
||||||
'detected_prob': detected_prob,
|
|
||||||
'harvest_event_id': harvest_event_id,
|
|
||||||
'ci_raw': check_row['FitData'] if 'FitData' in check_row else None,
|
|
||||||
})
|
|
||||||
|
|
||||||
# Check if above threshold
|
|
||||||
if detected_prob > DETECTED_THRESHOLD:
|
|
||||||
confirmation_cluster.append((check_day, detected_prob))
|
|
||||||
peak_prob_in_event = max(peak_prob_in_event, detected_prob)
|
|
||||||
|
|
||||||
# If this is first confirmation, record it
|
|
||||||
if harvest_first_day is None:
|
|
||||||
harvest_first_day = check_day
|
|
||||||
else:
|
|
||||||
# Reset cluster if we drop below threshold (need consecutive days)
|
|
||||||
if len(confirmation_cluster) < min_confirmations and harvest_first_day is not None:
|
|
||||||
print(f" ⊘ Confirmation cluster broken after {len(confirmation_cluster)} days, resetting")
|
|
||||||
confirmation_cluster = []
|
|
||||||
harvest_first_day = None
|
|
||||||
|
|
||||||
# Check if we have enough confirmations
|
|
||||||
if len(confirmation_cluster) >= min_confirmations and harvest_first_day is not None:
|
|
||||||
print(f" ✓ Harvest CONFIRMED at day {harvest_first_day} ({data_df.iloc[harvest_first_day]['Date']}) with peak prob={peak_prob_in_event:.4f}")
|
|
||||||
print(f" (Confirmed over {len(confirmation_cluster)} consecutive checks)")
|
|
||||||
detected_harvests.append((harvest_first_day, data_df.iloc[harvest_first_day]['Date'], peak_prob_in_event))
|
|
||||||
|
|
||||||
# Move to next season: start right after last confirmation (use first day as anchor)
|
|
||||||
current_start = harvest_first_day + 1
|
|
||||||
harvest_event_id += 1
|
|
||||||
break
|
|
||||||
|
|
||||||
# If no harvest detected in this pass, stop
|
|
||||||
if harvest_first_day is None:
|
|
||||||
print(f" • No harvest confirmed in this window, moving to end")
|
|
||||||
break
|
|
||||||
|
|
||||||
results_df = pd.DataFrame(results)
|
|
||||||
print(f"\n✓ Iterative detection complete: found {len(detected_harvests)} harvests")
|
|
||||||
return results_df, detected_harvests, data_df
|
|
||||||
"""
|
|
||||||
Run inference on full multi-year sequence with state resets.
|
|
||||||
|
|
||||||
Strategy:
|
|
||||||
1. Detect CI patterns to identify potential season boundaries
|
|
||||||
2. For each potential season, run inference with limited lookback window
|
|
||||||
3. This simulates fresh model state for each new season
|
|
||||||
|
|
||||||
Args:
|
|
||||||
field_name: Field ID
|
|
||||||
data_df: Full CI sequence (sorted by Date)
|
|
||||||
model: Loaded LSTM model
|
|
||||||
scalers: Feature scalers
|
|
||||||
config: Model config
|
|
||||||
|
|
||||||
Returns:
|
|
||||||
results_df: DataFrame with check_day, date, detected_prob, season_id
|
|
||||||
estimated_harvests: List of (day, date, peak_prob) tuples
|
|
||||||
"""
|
|
||||||
print(f"\nProcessing field {field_name}...")
|
|
||||||
print(f"Sequence length: {len(data_df)} days")
|
|
||||||
|
|
||||||
data_df = data_df.sort_values('Date').reset_index(drop=True)
|
|
||||||
|
|
||||||
# Strategy 1: Detect potential season boundaries by looking for CI resets (low values)
|
|
||||||
# CI typically resets to low (~0.5-1.0) after harvest
|
|
||||||
ci_vals = data_df['FitData'].values if 'FitData' in data_df.columns else None
|
|
||||||
|
|
||||||
season_boundaries = [0] # Start of sequence
|
|
||||||
|
|
||||||
if ci_vals is not None:
|
|
||||||
# Find points where CI is low (< 1.5) after being high (> 2.0)
|
|
||||||
# This suggests harvest + new season start
|
|
||||||
for i in range(1, len(ci_vals)):
|
|
||||||
if ci_vals[i] < 1.5 and i > 100: # Low CI, enough data before
|
|
||||||
# Check if there was high CI before (last 30 days)
|
|
||||||
prev_ci_max = np.max(ci_vals[max(0, i-30):i])
|
|
||||||
if prev_ci_max > 2.5:
|
|
||||||
# Potential season boundary
|
|
||||||
season_boundaries.append(i)
|
|
||||||
|
|
||||||
# Remove duplicates and sort
|
|
||||||
season_boundaries = sorted(set(season_boundaries))
|
|
||||||
print(f"Detected {len(season_boundaries)} potential season boundaries at days: {season_boundaries[:10]}...")
|
|
||||||
|
|
||||||
check_days = list(range(7, len(data_df), 7)) # Every 7 days
|
|
||||||
print(f"Running inference at {len(check_days)} check points...")
|
|
||||||
|
|
||||||
results = []
|
|
||||||
|
|
||||||
for check_day in check_days:
|
|
||||||
# Determine which season this check_day falls into
|
|
||||||
season_id = 0
|
|
||||||
for sb_idx, boundary in enumerate(season_boundaries[1:], 1):
|
|
||||||
if check_day >= boundary:
|
|
||||||
season_id = sb_idx
|
|
||||||
|
|
||||||
# Use state-reset inference: only look back from current season boundary
|
|
||||||
season_start = season_boundaries[season_id]
|
|
||||||
imminent_prob, detected_prob = predict_with_state_reset(
|
|
||||||
model, data_df, season_start, check_day, scalers, config, window_size=200
|
|
||||||
)
|
|
||||||
|
|
||||||
if imminent_prob is None:
|
|
||||||
continue
|
|
||||||
|
|
||||||
check_row = data_df.iloc[check_day]
|
|
||||||
|
|
||||||
results.append({
|
|
||||||
'day': check_day,
|
|
||||||
'date': check_row['Date'],
|
|
||||||
'imminent_prob': imminent_prob,
|
|
||||||
'detected_prob': detected_prob,
|
|
||||||
'season_id': season_id,
|
|
||||||
'ci_raw': check_row['FitData'] if 'FitData' in check_row else None,
|
|
||||||
})
|
|
||||||
|
|
||||||
results_df = pd.DataFrame(results)
|
|
||||||
|
|
||||||
# Extract harvest spikes (now with state reset, should see proper spikes)
|
|
||||||
detected_probs = results_df['detected_prob'].values
|
|
||||||
estimated_harvests = extract_harvest_dates(detected_probs, np.array(check_days), data_df,
|
|
||||||
threshold=DETECTED_THRESHOLD, min_days_between=100)
|
|
||||||
|
|
||||||
print(f"\nEstimated {len(estimated_harvests)} harvest events:")
|
|
||||||
for day, date, prob in estimated_harvests:
|
|
||||||
print(f" Day {day}: {date} (prob={prob:.3f})")
|
|
||||||
|
|
||||||
return results_df, estimated_harvests, data_df
|
|
||||||
|
|
||||||
|
|
||||||
def detect_actual_harvest_dates(data_df):
|
|
||||||
"""
|
|
||||||
Detect actual harvest dates by finding DOY resets.
|
|
||||||
When DOY drops from high (>300) to low (<50), a harvest occurred.
|
|
||||||
|
|
||||||
Returns list of day indices where harvest occurred.
|
|
||||||
"""
|
|
||||||
if 'DOY' not in data_df.columns:
|
|
||||||
return []
|
|
||||||
|
|
||||||
doy = data_df['DOY'].values
|
|
||||||
harvest_days = []
|
|
||||||
|
|
||||||
for i in range(1, len(doy)):
|
|
||||||
# Check if DOY reset (high to low transition)
|
|
||||||
if doy[i-1] > 300 and doy[i] < 50:
|
|
||||||
# Harvest occurred around this transition
|
|
||||||
harvest_days.append(i-1) # Last day of previous season
|
|
||||||
|
|
||||||
return harvest_days
|
|
||||||
|
|
||||||
|
|
||||||
def visualize_multi_year(field_name, results_df, estimated_harvests, full_data_df, output_dir="multi_year_analysis"):
|
|
||||||
"""Generate visualization of detected_prob and CI over full multi-year sequence."""
|
|
||||||
output_dir = Path(output_dir)
|
|
||||||
output_dir.mkdir(exist_ok=True)
|
|
||||||
|
|
||||||
fig, (ax1, ax2) = plt.subplots(2, 1, figsize=(20, 10))
|
|
||||||
|
|
||||||
# Plot 1: detected_prob over time with harvest spikes
|
|
||||||
ax1.plot(results_df['day'], results_df['detected_prob'], 'o-', color='red', label='Detected Prob', linewidth=2, markersize=4)
|
|
||||||
ax1.axhline(DETECTED_THRESHOLD, color='darkred', linestyle='--', linewidth=2, alpha=0.7, label=f'Threshold ({DETECTED_THRESHOLD})')
|
|
||||||
|
|
||||||
# Mark estimated harvests (from model detection)
|
|
||||||
for day, date, prob in estimated_harvests:
|
|
||||||
ax1.scatter(day, prob, s=300, color='darkgreen', marker='*', edgecolors='black', linewidth=2, zorder=5)
|
|
||||||
ax1.axvline(day, color='darkgreen', linestyle=':', alpha=0.5, linewidth=1.5, label='Estimated Harvest')
|
|
||||||
|
|
||||||
# Mark actual harvest dates if present in data
|
|
||||||
if 'harvest_detected' in full_data_df.columns:
|
|
||||||
actual_harvest_days = np.where(full_data_df['harvest_detected'] == 1)[0]
|
|
||||||
print(f"\n✓ Found {len(actual_harvest_days)} actual harvest dates in data: {actual_harvest_days.tolist()}")
|
|
||||||
for harvest_day in actual_harvest_days:
|
|
||||||
ax1.axvline(harvest_day, color='black', linestyle='-', alpha=0.9, linewidth=4, label='Actual Harvest')
|
|
||||||
else:
|
|
||||||
# Detect from DOY resets instead
|
|
||||||
actual_harvest_days = detect_actual_harvest_dates(full_data_df)
|
|
||||||
print(f"\n✓ Detected {len(actual_harvest_days)} actual harvest dates from DOY resets: {actual_harvest_days}")
|
|
||||||
for harvest_day in actual_harvest_days:
|
|
||||||
ax1.axvline(harvest_day, color='black', linestyle='--', alpha=0.8, linewidth=3, label='Actual Harvest')
|
|
||||||
|
|
||||||
ax1.set_xlabel('Day in Sequence', fontsize=12, fontweight='bold')
|
|
||||||
ax1.set_ylabel('Detected Probability', fontsize=12, fontweight='bold')
|
|
||||||
ax1.set_ylim(-0.05, 1.05)
|
|
||||||
ax1.grid(alpha=0.3)
|
|
||||||
# Remove duplicate labels from legend
|
|
||||||
handles, labels = ax1.get_legend_handles_labels()
|
|
||||||
by_label = dict(zip(labels, handles))
|
|
||||||
ax1.legend(by_label.values(), by_label.keys(), fontsize=10)
|
|
||||||
ax1.set_title(f'Field {field_name} - Multi-Year Harvest Detection (Detected Signal)', fontsize=13, fontweight='bold')
|
|
||||||
|
|
||||||
# Plot 2: CI over full sequence with harvest markers
|
|
||||||
days_idx = np.arange(len(full_data_df))
|
|
||||||
ci_raw = full_data_df['FitData'].values if 'FitData' in full_data_df.columns else None
|
|
||||||
|
|
||||||
if ci_raw is not None:
|
|
||||||
ax2.plot(days_idx, ci_raw, color='seagreen', label='Raw CI', linewidth=1, alpha=0.5, linestyle=':')
|
|
||||||
|
|
||||||
# Compute 7-day moving average
|
|
||||||
ci_7d_ma = full_data_df['FitData'].rolling(window=7, min_periods=1).mean().values
|
|
||||||
ax2.plot(days_idx, ci_7d_ma, color='darkgreen', label='7-day MA', linewidth=2, alpha=0.8)
|
|
||||||
|
|
||||||
# Mark estimated harvests on CI plot
|
|
||||||
for day, date, prob in estimated_harvests:
|
|
||||||
if day < len(full_data_df):
|
|
||||||
ci_val = full_data_df.iloc[day]['FitData']
|
|
||||||
ax2.scatter(day, ci_val, s=300, color='red', marker='*', edgecolors='black', linewidth=2, zorder=5, label='Estimated Harvest')
|
|
||||||
ax2.axvline(day, color='red', linestyle=':', alpha=0.5, linewidth=1.5)
|
|
||||||
|
|
||||||
# Mark actual harvest dates on CI plot
|
|
||||||
if 'harvest_detected' in full_data_df.columns:
|
|
||||||
actual_harvest_days = np.where(full_data_df['harvest_detected'] == 1)[0]
|
|
||||||
for harvest_day in actual_harvest_days:
|
|
||||||
if harvest_day < len(full_data_df):
|
|
||||||
ci_val = full_data_df.iloc[harvest_day]['FitData']
|
|
||||||
ax2.scatter(harvest_day, ci_val, s=250, color='black', marker='X', edgecolors='white', linewidth=2, zorder=6, label='Actual Harvest')
|
|
||||||
ax2.axvline(harvest_day, color='black', linestyle='-', alpha=0.9, linewidth=4)
|
|
||||||
else:
|
|
||||||
# Detect from DOY resets instead
|
|
||||||
actual_harvest_days = detect_actual_harvest_dates(full_data_df)
|
|
||||||
for harvest_day in actual_harvest_days:
|
|
||||||
if harvest_day < len(full_data_df):
|
|
||||||
ci_val = full_data_df.iloc[harvest_day]['FitData']
|
|
||||||
ax2.scatter(harvest_day, ci_val, s=250, color='black', marker='X', edgecolors='white', linewidth=2, zorder=6, label='Actual Harvest')
|
|
||||||
ax2.axvline(harvest_day, color='black', linestyle='--', alpha=0.8, linewidth=3)
|
|
||||||
|
|
||||||
ax2.set_xlabel('Day in Sequence', fontsize=12, fontweight='bold')
|
|
||||||
ax2.set_ylabel('CI Value', fontsize=12, fontweight='bold')
|
|
||||||
ax2.grid(alpha=0.3)
|
|
||||||
# Remove duplicate labels from legend
|
|
||||||
handles, labels = ax2.get_legend_handles_labels()
|
|
||||||
by_label = dict(zip(labels, handles))
|
|
||||||
ax2.legend(by_label.values(), by_label.keys(), fontsize=10)
|
|
||||||
ax2.set_title(f'Field {field_name} - CI Sequence with Estimated Harvest Dates', fontsize=13, fontweight='bold')
|
|
||||||
|
|
||||||
plt.tight_layout()
|
|
||||||
output_file = output_dir / f"multi_year_harvest_detection_{field_name}.png"
|
|
||||||
plt.savefig(output_file, dpi=100, bbox_inches='tight')
|
|
||||||
print(f"\nVisualization saved: {output_file}")
|
|
||||||
plt.close()
|
|
||||||
|
|
||||||
|
|
||||||
def export_results(field_name, results_df, detected_harvests, data_df, output_dir="multi_year_analysis"):
|
|
||||||
"""
|
|
||||||
Export results to CSV with harvest dates, DOY, and comparison to actual harvests.
|
|
||||||
|
|
||||||
Args:
|
|
||||||
field_name: Field ID
|
|
||||||
results_df: Full inference results
|
|
||||||
detected_harvests: List of (day, date, prob) tuples from model
|
|
||||||
data_df: Full data with potential actual harvest information
|
|
||||||
output_dir: Output directory
|
|
||||||
"""
|
|
||||||
output_dir = Path(output_dir)
|
|
||||||
output_dir.mkdir(exist_ok=True)
|
|
||||||
|
|
||||||
# Export full inference results
|
|
||||||
results_file = output_dir / f"inference_results_{field_name}.csv"
|
|
||||||
results_df.to_csv(results_file, index=False)
|
|
||||||
print(f"Inference results: {results_file}")
|
|
||||||
|
|
||||||
# Detect actual harvests from DOY resets
|
|
||||||
actual_harvest_days = detect_actual_harvest_dates(data_df)
|
|
||||||
print(f" Actual harvests detected from DOY resets: {actual_harvest_days}")
|
|
||||||
|
|
||||||
# Export detected harvests with DOY, date, and comparison to actual
|
|
||||||
if detected_harvests:
|
|
||||||
harvests_data = []
|
|
||||||
for day, date, prob in detected_harvests:
|
|
||||||
# Parse date and calculate DOY
|
|
||||||
if isinstance(date, str):
|
|
||||||
date_obj = pd.to_datetime(date)
|
|
||||||
else:
|
|
||||||
date_obj = date
|
|
||||||
doy = date_obj.dayofyear
|
|
||||||
year = date_obj.year
|
|
||||||
|
|
||||||
# Find nearest actual harvest and calculate days difference
|
|
||||||
nearest_actual_day = None
|
|
||||||
days_from_actual = None
|
|
||||||
actual_harvest_date = None
|
|
||||||
|
|
||||||
if actual_harvest_days:
|
|
||||||
# Find closest actual harvest
|
|
||||||
differences = [abs(day - actual_day) for actual_day in actual_harvest_days]
|
|
||||||
min_idx = np.argmin(differences)
|
|
||||||
nearest_actual_day = actual_harvest_days[min_idx]
|
|
||||||
days_from_actual = day - nearest_actual_day # Negative = before actual, positive = after
|
|
||||||
|
|
||||||
if nearest_actual_day < len(data_df):
|
|
||||||
actual_date_obj = data_df.iloc[nearest_actual_day]['Date']
|
|
||||||
if isinstance(actual_date_obj, str):
|
|
||||||
actual_date_obj = pd.to_datetime(actual_date_obj)
|
|
||||||
actual_harvest_date = actual_date_obj.strftime('%Y-%m-%d')
|
|
||||||
|
|
||||||
harvests_data.append({
|
|
||||||
'day_in_sequence': day,
|
|
||||||
'detected_date': date_obj.strftime('%Y-%m-%d'),
|
|
||||||
'doy': doy,
|
|
||||||
'year': year,
|
|
||||||
'peak_prob': prob,
|
|
||||||
'nearest_actual_harvest_date': actual_harvest_date,
|
|
||||||
'days_from_actual_harvest': days_from_actual
|
|
||||||
})
|
|
||||||
|
|
||||||
harvests_df = pd.DataFrame(harvests_data)
|
|
||||||
harvests_file = output_dir / f"detected_harvests_{field_name}.csv"
|
|
||||||
harvests_df.to_csv(harvests_file, index=False)
|
|
||||||
print(f"\nDetected Harvests Summary:")
|
|
||||||
print(harvests_df.to_string(index=False))
|
|
||||||
print(f"\nHarvest log saved: {harvests_file}")
|
|
||||||
|
|
||||||
|
|
||||||
def main():
|
|
||||||
print("="*80)
|
|
||||||
print("MULTI-YEAR HARVEST DETECTION: Field 00300 Full Sequence Test")
|
|
||||||
print("="*80)
|
|
||||||
|
|
||||||
# Load model
|
|
||||||
print("\n[1/4] Loading Model 307...")
|
|
||||||
model, config, scalers = load_model_and_config()
|
|
||||||
|
|
||||||
# Load all data
|
|
||||||
print("\n[2/4] Loading all data...")
|
|
||||||
df = load_harvest_data(DATA_FILE)
|
|
||||||
print(f"Total rows: {len(df)}")
|
|
||||||
|
|
||||||
# Filter to target field
|
|
||||||
field_data = df[df['field'] == FIELD_TO_TEST].copy()
|
|
||||||
if len(field_data) == 0:
|
|
||||||
print(f"ERROR: Field {FIELD_TO_TEST} not found!")
|
|
||||||
return
|
|
||||||
|
|
||||||
print(f"Field {FIELD_TO_TEST} data: {len(field_data)} rows")
|
|
||||||
|
|
||||||
# Skip first N days if specified
|
|
||||||
if SKIP_FIRST_DAYS > 0:
|
|
||||||
print(f"\n⚠ Skipping first {SKIP_FIRST_DAYS} days to simulate mid-season start")
|
|
||||||
field_data = field_data.iloc[SKIP_FIRST_DAYS:].reset_index(drop=True)
|
|
||||||
print(f"Remaining data: {len(field_data)} rows")
|
|
||||||
|
|
||||||
print(f"\nData range: {field_data['Date'].min()} to {field_data['Date'].max()}")
|
|
||||||
|
|
||||||
# Run inference
|
|
||||||
print("\n[3/4] Running iterative harvest detection...")
|
|
||||||
results_df, detected_harvests, full_data = run_iterative_harvest_detection(
|
|
||||||
FIELD_TO_TEST, field_data, model, scalers, config
|
|
||||||
)
|
|
||||||
|
|
||||||
# Generate outputs
|
|
||||||
print("\n[4/4] Generating outputs...")
|
|
||||||
visualize_multi_year(FIELD_TO_TEST, results_df, detected_harvests, full_data)
|
|
||||||
export_results(FIELD_TO_TEST, results_df, detected_harvests, full_data)
|
|
||||||
|
|
||||||
print(f"\n✓ Multi-year harvest detection complete!")
|
|
||||||
|
|
||||||
|
|
||||||
if __name__ == "__main__":
|
|
||||||
main()
|
|
||||||
|
Before Width: | Height: | Size: 272 KiB |
|
|
@ -1,104 +0,0 @@
|
||||||
"""
|
|
||||||
Summarize batch harvest detection results.
|
|
||||||
Reads all detected_harvests_*.csv files and computes accuracy metrics.
|
|
||||||
"""
|
|
||||||
|
|
||||||
import numpy as np
|
|
||||||
import pandas as pd
|
|
||||||
from pathlib import Path
|
|
||||||
|
|
||||||
BATCH_DIR = Path("multi_year_analysis_batch")
|
|
||||||
|
|
||||||
def main():
|
|
||||||
# Find all detected_harvests CSV files
|
|
||||||
harvest_files = sorted(BATCH_DIR.glob("detected_harvests_*.csv"))
|
|
||||||
|
|
||||||
print(f"Found {len(harvest_files)} field results")
|
|
||||||
|
|
||||||
all_errors = []
|
|
||||||
field_summaries = []
|
|
||||||
|
|
||||||
for filepath in harvest_files:
|
|
||||||
try:
|
|
||||||
df = pd.read_csv(filepath)
|
|
||||||
if len(df) == 0:
|
|
||||||
continue
|
|
||||||
|
|
||||||
field_id = filepath.stem.replace("detected_harvests_", "")
|
|
||||||
errors = df['days_from_actual_harvest'].values
|
|
||||||
|
|
||||||
field_summaries.append({
|
|
||||||
'field': field_id,
|
|
||||||
'detections': len(errors),
|
|
||||||
'mean_error': np.mean(np.abs(errors)), # Use absolute value
|
|
||||||
'median_error': np.median(np.abs(errors)),
|
|
||||||
'std_dev': np.std(np.abs(errors)),
|
|
||||||
'min_error': np.min(np.abs(errors)),
|
|
||||||
'max_error': np.max(np.abs(errors)),
|
|
||||||
'early_detections': np.sum(errors < 0), # How many predicted early
|
|
||||||
'late_detections': np.sum(errors > 0), # How many predicted late
|
|
||||||
})
|
|
||||||
|
|
||||||
all_errors.extend(np.abs(errors))
|
|
||||||
except Exception as e:
|
|
||||||
print(f" Error reading {filepath}: {e}")
|
|
||||||
continue
|
|
||||||
|
|
||||||
# Convert to array for statistics
|
|
||||||
all_errors = np.array(all_errors)
|
|
||||||
|
|
||||||
# Remove extreme outliers (>180 days off - likely data quality issues)
|
|
||||||
all_errors_filtered = all_errors[all_errors <= 180]
|
|
||||||
|
|
||||||
print("\n" + "="*80)
|
|
||||||
print("OVERALL ACCURACY STATISTICS")
|
|
||||||
print("="*80)
|
|
||||||
print(f"Total detections across all fields: {len(all_errors)}")
|
|
||||||
print(f" (Filtered to: {len(all_errors_filtered)} detections ≤180 days error)")
|
|
||||||
print(f"Total fields processed: {len(field_summaries)}")
|
|
||||||
print(f"\nMean error: {np.mean(all_errors_filtered):.2f} days")
|
|
||||||
print(f"Median error: {np.median(all_errors_filtered):.2f} days")
|
|
||||||
print(f"Std dev: {np.std(all_errors_filtered):.2f} days")
|
|
||||||
print(f"Min error: {np.min(all_errors_filtered):.0f} days")
|
|
||||||
print(f"Max error: {np.max(all_errors_filtered):.0f} days")
|
|
||||||
|
|
||||||
print(f"\nPercentiles:")
|
|
||||||
for p in [10, 25, 50, 75, 90, 95]:
|
|
||||||
print(f" {p}th: {np.percentile(all_errors_filtered, p):.1f} days")
|
|
||||||
|
|
||||||
print(f"\nWithin threshold:")
|
|
||||||
for threshold in [3, 7, 14, 21, 30]:
|
|
||||||
count = np.sum(all_errors_filtered <= threshold)
|
|
||||||
pct = 100 * count / len(all_errors_filtered)
|
|
||||||
print(f" ≤ {threshold} days: {pct:.1f}% ({count}/{len(all_errors_filtered)})")
|
|
||||||
|
|
||||||
# Field-level summary
|
|
||||||
print(f"\n" + "="*80)
|
|
||||||
print("TOP 15 BEST PERFORMING FIELDS (lowest mean error)")
|
|
||||||
print("="*80)
|
|
||||||
df_fields = pd.DataFrame(field_summaries)
|
|
||||||
df_fields = df_fields.sort_values('mean_error')
|
|
||||||
print(df_fields.head(15).to_string(index=False))
|
|
||||||
|
|
||||||
print(f"\n" + "="*80)
|
|
||||||
print("FIELDS WITH HIGHEST ERRORS")
|
|
||||||
print("="*80)
|
|
||||||
df_fields = df_fields.sort_values('mean_error', ascending=False)
|
|
||||||
print(df_fields.head(15).to_string(index=False))
|
|
||||||
|
|
||||||
# Save summary
|
|
||||||
summary_file = BATCH_DIR / "accuracy_summary.csv"
|
|
||||||
df_fields.to_csv(summary_file, index=False)
|
|
||||||
print(f"\n✓ Summary saved to: {summary_file}")
|
|
||||||
|
|
||||||
# Statistics by number of detections
|
|
||||||
print(f"\n" + "="*80)
|
|
||||||
print("FIELDS BY NUMBER OF DETECTIONS")
|
|
||||||
print("="*80)
|
|
||||||
det_counts = df_fields['detections'].value_counts().sort_index(ascending=False)
|
|
||||||
for num_det, count in det_counts.items():
|
|
||||||
avg_error = df_fields[df_fields['detections'] == num_det]['mean_error'].mean()
|
|
||||||
print(f" {num_det} detections: {count} fields (avg error: {avg_error:.2f} days)")
|
|
||||||
|
|
||||||
if __name__ == "__main__":
|
|
||||||
main()
|
|
||||||
|
|
@ -1,157 +0,0 @@
|
||||||
"""
|
|
||||||
Phase 2 Debug: Check probability values in season windows
|
|
||||||
"""
|
|
||||||
|
|
||||||
import sys
|
|
||||||
import numpy as np
|
|
||||||
import pandas as pd
|
|
||||||
from pathlib import Path
|
|
||||||
import torch
|
|
||||||
import matplotlib.pyplot as plt
|
|
||||||
|
|
||||||
sys.path.insert(0, str(Path(__file__).parent))
|
|
||||||
sys.path.insert(0, str(Path(__file__).parent / 'src'))
|
|
||||||
|
|
||||||
from multi_year_harvest_detection import (
|
|
||||||
load_model_and_config, load_harvest_data,
|
|
||||||
detect_actual_harvest_dates, DATA_FILE, DEVICE
|
|
||||||
)
|
|
||||||
from feature_engineering import extract_features
|
|
||||||
|
|
||||||
OUTPUT_DIR = Path("phase2_refinement")
|
|
||||||
OUTPUT_DIR.mkdir(exist_ok=True)
|
|
||||||
|
|
||||||
def predict_season_window_debug(model, window_df, season_start_day, scalers, config):
|
|
||||||
"""Run inference and return all probabilities for debugging."""
|
|
||||||
results = []
|
|
||||||
|
|
||||||
for i in range(len(window_df)):
|
|
||||||
lookback_df = window_df.iloc[:i+1].copy()
|
|
||||||
|
|
||||||
# Reset DOY
|
|
||||||
days_from_start = np.arange(len(lookback_df))
|
|
||||||
lookback_df['DOY'] = (days_from_start % 365) + 1
|
|
||||||
|
|
||||||
# Extract features
|
|
||||||
features = extract_features(lookback_df, config['features'], config['data']['ci_column'])
|
|
||||||
if features is None or np.any(np.isnan(features)):
|
|
||||||
results.append(np.nan)
|
|
||||||
continue
|
|
||||||
|
|
||||||
# Normalize
|
|
||||||
features_scaled = features.copy()
|
|
||||||
for fi in range(len(features_scaled[0])):
|
|
||||||
try:
|
|
||||||
features_scaled[:, fi] = scalers[fi].transform(features_scaled[:, fi].reshape(-1, 1)).flatten()
|
|
||||||
except:
|
|
||||||
pass
|
|
||||||
|
|
||||||
# Inference
|
|
||||||
window_size = 200
|
|
||||||
if len(features_scaled) < window_size:
|
|
||||||
pad_width = window_size - len(features_scaled)
|
|
||||||
features_scaled = np.pad(features_scaled, ((pad_width, 0), (0, 0)), mode='edge')
|
|
||||||
|
|
||||||
X = torch.FloatTensor(features_scaled[-window_size:]).unsqueeze(0).to(DEVICE)
|
|
||||||
with torch.no_grad():
|
|
||||||
outputs = model(X)
|
|
||||||
|
|
||||||
if isinstance(outputs, tuple):
|
|
||||||
detected_tensor = outputs[1]
|
|
||||||
if detected_tensor.dim() == 3:
|
|
||||||
detected_prob = detected_tensor[0, -1, 0].item()
|
|
||||||
else:
|
|
||||||
detected_prob = detected_tensor[0, -1].item()
|
|
||||||
else:
|
|
||||||
detected_prob = outputs[0, 1].item()
|
|
||||||
|
|
||||||
results.append(detected_prob)
|
|
||||||
|
|
||||||
return np.array(results)
|
|
||||||
|
|
||||||
def main():
|
|
||||||
print("Phase 2 Debug: Checking probability distributions")
|
|
||||||
|
|
||||||
# Load model
|
|
||||||
print("Loading Model 307...")
|
|
||||||
model, config, scalers = load_model_and_config()
|
|
||||||
|
|
||||||
# Load data
|
|
||||||
print("Loading data...")
|
|
||||||
full_data = load_harvest_data(DATA_FILE)
|
|
||||||
|
|
||||||
# Get field 00300
|
|
||||||
field_id = "00300"
|
|
||||||
field_data = full_data[full_data['field'] == field_id].copy()
|
|
||||||
field_data = field_data.sort_values('Date').reset_index(drop=True)
|
|
||||||
|
|
||||||
# Load phase 1 results
|
|
||||||
phase1_df = pd.read_csv(Path("multi_year_analysis_batch") / f"detected_harvests_{field_id}.csv")
|
|
||||||
|
|
||||||
# Get actual harvests
|
|
||||||
actual_harvest_days = detect_actual_harvest_dates(field_data)
|
|
||||||
|
|
||||||
print(f"\nField {field_id}: {len(field_data)} rows")
|
|
||||||
print(f"Actual harvests: {actual_harvest_days}")
|
|
||||||
|
|
||||||
# Process first harvest only
|
|
||||||
row = phase1_df.iloc[0]
|
|
||||||
est_harvest_day = row['day_in_sequence']
|
|
||||||
actual_day = actual_harvest_days[0] if len(actual_harvest_days) > 0 else None
|
|
||||||
|
|
||||||
# Extract season window
|
|
||||||
prev_harvest_day = None
|
|
||||||
season_start = max(0, est_harvest_day - 40) if prev_harvest_day is None else prev_harvest_day - 40
|
|
||||||
season_end = min(len(field_data) - 1, est_harvest_day + 40)
|
|
||||||
window_df = field_data.iloc[season_start:season_end+1].copy()
|
|
||||||
|
|
||||||
print(f"\n--- Harvest {row['detected_date']} ---")
|
|
||||||
print(f" Phase 1 day: {est_harvest_day}")
|
|
||||||
print(f" Actual day: {actual_day}")
|
|
||||||
print(f" Season window: [{season_start}:{season_end}] ({len(window_df)} days)")
|
|
||||||
|
|
||||||
# Get probabilities
|
|
||||||
print(f"\nRunning inference on window...")
|
|
||||||
detected_probs = predict_season_window_debug(model, window_df, season_start, scalers, config)
|
|
||||||
|
|
||||||
print(f"Probability statistics:")
|
|
||||||
print(f" Min: {np.nanmin(detected_probs):.4f}")
|
|
||||||
print(f" Max: {np.nanmax(detected_probs):.4f}")
|
|
||||||
print(f" Mean: {np.nanmean(detected_probs):.4f}")
|
|
||||||
print(f" Median: {np.nanmedian(detected_probs):.4f}")
|
|
||||||
print(f" Days > 0.2: {np.sum(detected_probs > 0.2)}")
|
|
||||||
print(f" Days > 0.3: {np.sum(detected_probs > 0.3)}")
|
|
||||||
print(f" Days > 0.4: {np.sum(detected_probs > 0.4)}")
|
|
||||||
print(f" Days > 0.5: {np.sum(detected_probs > 0.5)}")
|
|
||||||
|
|
||||||
# Plot
|
|
||||||
fig, ax = plt.subplots(figsize=(14, 6))
|
|
||||||
window_days = np.arange(len(detected_probs))
|
|
||||||
ax.plot(window_days, detected_probs, 'o-', color='steelblue', linewidth=2, markersize=6, label='Detected Prob')
|
|
||||||
ax.axhline(0.5, color='red', linestyle='--', linewidth=2, alpha=0.7, label='0.5 Threshold')
|
|
||||||
ax.axhline(0.4, color='orange', linestyle='--', linewidth=1.5, alpha=0.5, label='0.4 Threshold')
|
|
||||||
ax.axhline(0.2, color='green', linestyle='--', linewidth=1.5, alpha=0.5, label='0.2 Threshold (Phase 1)')
|
|
||||||
|
|
||||||
# Mark actual harvest (relative to window)
|
|
||||||
if actual_day is not None:
|
|
||||||
rel_actual_day = actual_day - season_start
|
|
||||||
if 0 <= rel_actual_day < len(window_df):
|
|
||||||
ax.scatter(rel_actual_day, detected_probs[rel_actual_day], s=300, color='red', marker='*',
|
|
||||||
edgecolors='black', linewidth=2, zorder=5, label=f'Actual harvest (day {actual_day})')
|
|
||||||
|
|
||||||
ax.set_xlabel('Day in Season Window', fontsize=12, fontweight='bold')
|
|
||||||
ax.set_ylabel('Detected Probability', fontsize=12, fontweight='bold')
|
|
||||||
ax.set_title(f'Phase 2 Probability Curve: Field {field_id}, Harvest {row["detected_date"]}',
|
|
||||||
fontsize=13, fontweight='bold')
|
|
||||||
ax.legend()
|
|
||||||
ax.grid(alpha=0.3)
|
|
||||||
ax.set_ylim(-0.05, 1.05)
|
|
||||||
|
|
||||||
plt.tight_layout()
|
|
||||||
plot_file = OUTPUT_DIR / f"phase2_debug_{field_id}_harvest0.png"
|
|
||||||
plt.savefig(plot_file, dpi=100, bbox_inches='tight')
|
|
||||||
print(f"\nPlot saved: {plot_file}")
|
|
||||||
plt.close()
|
|
||||||
|
|
||||||
if __name__ == "__main__":
|
|
||||||
main()
|
|
||||||
|
|
@ -1,338 +0,0 @@
|
||||||
"""
|
|
||||||
Phase 2: Harvest Date Refinement
|
|
||||||
For each Phase 1 estimated harvest, extract full season (+40d before/after)
|
|
||||||
and find precise harvest date where detected_prob >= 0.5 (sustained).
|
|
||||||
"""
|
|
||||||
|
|
||||||
import sys
|
|
||||||
import numpy as np
|
|
||||||
import pandas as pd
|
|
||||||
from pathlib import Path
|
|
||||||
import torch
|
|
||||||
|
|
||||||
sys.path.insert(0, str(Path(__file__).parent))
|
|
||||||
sys.path.insert(0, str(Path(__file__).parent / 'src'))
|
|
||||||
|
|
||||||
from multi_year_harvest_detection import (
|
|
||||||
load_model_and_config, load_harvest_data,
|
|
||||||
detect_actual_harvest_dates, DATA_FILE, DEVICE
|
|
||||||
)
|
|
||||||
from feature_engineering import extract_features
|
|
||||||
|
|
||||||
OUTPUT_DIR = Path("phase2_refinement")
|
|
||||||
OUTPUT_DIR.mkdir(exist_ok=True)
|
|
||||||
|
|
||||||
def extract_season_window(data_df, prev_harvest_day, est_harvest_day, margin=40):
|
|
||||||
"""
|
|
||||||
Extract season window: [prev_harvest - margin : est_harvest + margin]
|
|
||||||
If prev_harvest is None, use first data point.
|
|
||||||
|
|
||||||
Returns:
|
|
||||||
(window_start_idx, window_end_idx, window_df)
|
|
||||||
"""
|
|
||||||
season_start = max(0, prev_harvest_day - margin) if prev_harvest_day is not None else 0
|
|
||||||
season_end = min(len(data_df) - 1, est_harvest_day + margin)
|
|
||||||
|
|
||||||
window_df = data_df.iloc[season_start:season_end+1].copy()
|
|
||||||
return season_start, season_end, window_df
|
|
||||||
|
|
||||||
|
|
||||||
def predict_season_window(model, window_df, season_start_day, scalers, config):
|
|
||||||
"""
|
|
||||||
Run inference on season window with DOY reset.
|
|
||||||
Returns array of detected_prob values for each row.
|
|
||||||
"""
|
|
||||||
results = []
|
|
||||||
|
|
||||||
for i in range(len(window_df)):
|
|
||||||
check_day = season_start_day + i
|
|
||||||
|
|
||||||
# Prepare lookback window (use all available data up to check_day)
|
|
||||||
lookback_df = window_df.iloc[:i+1].copy()
|
|
||||||
|
|
||||||
# Reset DOY relative to season start
|
|
||||||
days_from_start = np.arange(len(lookback_df))
|
|
||||||
lookback_df['DOY'] = (days_from_start % 365) + 1
|
|
||||||
|
|
||||||
# Extract features
|
|
||||||
features = extract_features(lookback_df, config['features'], config['data']['ci_column'])
|
|
||||||
if features is None or np.any(np.isnan(features)):
|
|
||||||
results.append(np.nan)
|
|
||||||
continue
|
|
||||||
|
|
||||||
# Normalize features
|
|
||||||
features_scaled = features.copy()
|
|
||||||
for fi in range(len(features_scaled[0])):
|
|
||||||
try:
|
|
||||||
features_scaled[:, fi] = scalers[fi].transform(features_scaled[:, fi].reshape(-1, 1)).flatten()
|
|
||||||
except Exception:
|
|
||||||
pass
|
|
||||||
|
|
||||||
# Pad to window size
|
|
||||||
window_size = 200
|
|
||||||
if len(features_scaled) < window_size:
|
|
||||||
pad_width = window_size - len(features_scaled)
|
|
||||||
features_scaled = np.pad(features_scaled, ((pad_width, 0), (0, 0)), mode='edge')
|
|
||||||
|
|
||||||
# Inference
|
|
||||||
X = torch.FloatTensor(features_scaled[-window_size:]).unsqueeze(0).to(DEVICE)
|
|
||||||
with torch.no_grad():
|
|
||||||
outputs = model(X)
|
|
||||||
|
|
||||||
# Handle tuple output (imminent, detected) - get last timestep
|
|
||||||
if isinstance(outputs, tuple):
|
|
||||||
detected_tensor = outputs[1] # [batch, seq_len] or [batch, seq_len, 1]
|
|
||||||
if detected_tensor.dim() == 3:
|
|
||||||
detected_prob = detected_tensor[0, -1, 0].item()
|
|
||||||
else:
|
|
||||||
detected_prob = detected_tensor[0, -1].item()
|
|
||||||
else:
|
|
||||||
detected_prob = outputs[0, 1].item()
|
|
||||||
|
|
||||||
results.append(detected_prob)
|
|
||||||
|
|
||||||
return np.array(results)
|
|
||||||
|
|
||||||
|
|
||||||
def find_sustained_threshold_crossing(detected_probs, threshold=0.4, min_sustained=2):
|
|
||||||
"""
|
|
||||||
Find first time detected_prob stays >= threshold for min_sustained consecutive readings.
|
|
||||||
|
|
||||||
Returns:
|
|
||||||
(day_index, sustained_day_count, peak_prob_in_window)
|
|
||||||
"""
|
|
||||||
crossing_days = []
|
|
||||||
current_streak = 0
|
|
||||||
streak_start = None
|
|
||||||
|
|
||||||
for i, prob in enumerate(detected_probs):
|
|
||||||
if prob >= threshold:
|
|
||||||
if current_streak == 0:
|
|
||||||
streak_start = i
|
|
||||||
current_streak += 1
|
|
||||||
|
|
||||||
if current_streak >= min_sustained:
|
|
||||||
# Return the first day of the streak
|
|
||||||
return streak_start, current_streak, np.max(detected_probs[streak_start:i+1])
|
|
||||||
else:
|
|
||||||
current_streak = 0
|
|
||||||
|
|
||||||
# No sustained crossing found
|
|
||||||
return None, None, None
|
|
||||||
|
|
||||||
|
|
||||||
def process_field_refinement(field_id, phase1_harvests_df, full_data_df, model, scalers, config):
|
|
||||||
"""
|
|
||||||
Refine Phase 1 harvest dates using Phase 2 logic.
|
|
||||||
|
|
||||||
CRITICAL: Use Phase 1 ESTIMATES to define season boundaries, NOT actual harvest dates.
|
|
||||||
This simulates production environment where actual dates are unknown.
|
|
||||||
|
|
||||||
Args:
|
|
||||||
field_id: Field identifier
|
|
||||||
phase1_harvests_df: DataFrame with columns [day_in_sequence, detected_date, nearest_actual_harvest_date, ...]
|
|
||||||
full_data_df: Full sequence data
|
|
||||||
model, scalers, config: Model info
|
|
||||||
|
|
||||||
Returns:
|
|
||||||
refinements_list: List of dicts with phase1/phase2/actual comparisons
|
|
||||||
"""
|
|
||||||
refinements = []
|
|
||||||
|
|
||||||
# Get actual harvest dates from DOY resets (FOR VALIDATION ONLY - NOT USED IN LOGIC)
|
|
||||||
actual_harvest_days = detect_actual_harvest_dates(full_data_df)
|
|
||||||
|
|
||||||
# Create list of Phase 1 estimates to use as season boundaries (production-realistic)
|
|
||||||
phase1_list = phase1_harvests_df['day_in_sequence'].tolist()
|
|
||||||
|
|
||||||
for idx, row in phase1_harvests_df.iterrows():
|
|
||||||
current_phase1_day = row['day_in_sequence']
|
|
||||||
current_phase1_date = row['detected_date']
|
|
||||||
|
|
||||||
# Get actual harvest date for validation purposes ONLY (not used in logic)
|
|
||||||
if pd.notna(row['nearest_actual_harvest_date']):
|
|
||||||
actual_date_str = row['nearest_actual_harvest_date']
|
|
||||||
actual_date = pd.to_datetime(actual_date_str)
|
|
||||||
# Find actual day in sequence for comparison
|
|
||||||
actual_day = None
|
|
||||||
for act_day in actual_harvest_days:
|
|
||||||
if act_day < len(full_data_df):
|
|
||||||
data_date = full_data_df.iloc[act_day]['Date']
|
|
||||||
if isinstance(data_date, str):
|
|
||||||
data_date = pd.to_datetime(data_date)
|
|
||||||
if abs((data_date - actual_date).days) < 2:
|
|
||||||
actual_day = act_day
|
|
||||||
break
|
|
||||||
else:
|
|
||||||
actual_date = None
|
|
||||||
actual_day = None
|
|
||||||
|
|
||||||
# PRODUCTION LOGIC: Use Phase 1 estimates to define season boundaries
|
|
||||||
# Season N window: [Phase1_Est_(N-1) - 40 : Phase1_Est_N + 40]
|
|
||||||
if idx > 0:
|
|
||||||
# Previous season's Phase 1 estimate
|
|
||||||
prev_phase1_day = phase1_list[idx - 1]
|
|
||||||
season_start = max(0, prev_phase1_day - 40)
|
|
||||||
else:
|
|
||||||
# First season: start from beginning (or day 0 - 40)
|
|
||||||
season_start = 0
|
|
||||||
|
|
||||||
# Current season's Phase 1 estimate + 40 days buffer
|
|
||||||
season_end = min(len(full_data_df) - 1, current_phase1_day + 40)
|
|
||||||
|
|
||||||
window_df = full_data_df.iloc[season_start:season_end+1].copy()
|
|
||||||
|
|
||||||
if len(window_df) < 50:
|
|
||||||
print(f" ⚠ Field {field_id} harvest {idx}: window too small ({len(window_df)} days), skipping")
|
|
||||||
continue
|
|
||||||
|
|
||||||
# Log the window details
|
|
||||||
print(f" Harvest {idx}: Phase1_Est={current_phase1_day} (day_in_seq)")
|
|
||||||
if idx > 0:
|
|
||||||
print(f" PRODUCTION WINDOW: [Phase1_Est_{idx-1}({prev_phase1_day})-40={season_start} : Phase1_Est_{idx}({current_phase1_day})+40={season_end}]")
|
|
||||||
else:
|
|
||||||
print(f" FIRST SEASON WINDOW: [0 : Phase1_Est_0({current_phase1_day})+40={season_end}]")
|
|
||||||
print(f" Window size: {len(window_df)} days")
|
|
||||||
|
|
||||||
# Run inference on window
|
|
||||||
detected_probs = predict_season_window(model, window_df, season_start, scalers, config)
|
|
||||||
|
|
||||||
# Find 0.4 threshold crossing (Phase 1 probs max ~0.46)
|
|
||||||
crossing_day_rel, streak_len, peak_prob = find_sustained_threshold_crossing(
|
|
||||||
detected_probs, threshold=0.4, min_sustained=2
|
|
||||||
)
|
|
||||||
|
|
||||||
if crossing_day_rel is None:
|
|
||||||
print(f" No 0.4 threshold crossing found (max prob in window: {np.max(detected_probs):.4f})")
|
|
||||||
phase2_day = None
|
|
||||||
phase2_date = None
|
|
||||||
phase2_prob = None
|
|
||||||
else:
|
|
||||||
phase2_day = season_start + crossing_day_rel
|
|
||||||
phase2_date = full_data_df.iloc[phase2_day]['Date']
|
|
||||||
phase2_prob = peak_prob
|
|
||||||
if isinstance(phase2_date, str):
|
|
||||||
phase2_date = pd.to_datetime(phase2_date)
|
|
||||||
print(f" [OK] Phase 2 harvest at day {phase2_day} ({phase2_date.strftime('%Y-%m-%d')}) prob={phase2_prob:.4f}")
|
|
||||||
|
|
||||||
# Calculate errors
|
|
||||||
if isinstance(current_phase1_date, str):
|
|
||||||
current_phase1_date = pd.to_datetime(current_phase1_date)
|
|
||||||
|
|
||||||
error_phase1 = abs((actual_date - current_phase1_date).days) if actual_date else None
|
|
||||||
error_phase2 = abs((actual_date - phase2_date).days) if (actual_date and phase2_date) else None
|
|
||||||
improvement = (error_phase1 - error_phase2) if (error_phase1 and error_phase2) else None
|
|
||||||
|
|
||||||
refinements.append({
|
|
||||||
'field': field_id,
|
|
||||||
'harvest_idx': idx,
|
|
||||||
'phase1_day': current_phase1_day,
|
|
||||||
'phase1_date': current_phase1_date.strftime('%Y-%m-%d') if isinstance(current_phase1_date, pd.Timestamp) else current_phase1_date,
|
|
||||||
'phase1_prob': row['peak_prob'] if 'peak_prob' in row else None,
|
|
||||||
'phase2_day': phase2_day,
|
|
||||||
'phase2_date': phase2_date.strftime('%Y-%m-%d') if phase2_date else None,
|
|
||||||
'phase2_prob': phase2_prob,
|
|
||||||
'actual_day': actual_day,
|
|
||||||
'actual_date': actual_date.strftime('%Y-%m-%d') if actual_date else None,
|
|
||||||
'error_phase1': error_phase1,
|
|
||||||
'error_phase2': error_phase2,
|
|
||||||
'improvement': improvement,
|
|
||||||
})
|
|
||||||
|
|
||||||
return refinements
|
|
||||||
|
|
||||||
|
|
||||||
def main():
|
|
||||||
print("="*80)
|
|
||||||
print("PHASE 2: HARVEST DATE REFINEMENT")
|
|
||||||
print("="*80)
|
|
||||||
|
|
||||||
# Load model
|
|
||||||
print("\nLoading Model 307...")
|
|
||||||
model, config, scalers = load_model_and_config()
|
|
||||||
|
|
||||||
# Load all data
|
|
||||||
print("Loading data...")
|
|
||||||
full_data = load_harvest_data(DATA_FILE)
|
|
||||||
|
|
||||||
# Get unique fields with phase 1 results
|
|
||||||
batch_dir = Path("multi_year_analysis_batch")
|
|
||||||
phase1_files = sorted(batch_dir.glob("detected_harvests_*.csv"))
|
|
||||||
|
|
||||||
print(f"\nFound {len(phase1_files)} fields with Phase 1 results")
|
|
||||||
|
|
||||||
all_refinements = []
|
|
||||||
|
|
||||||
for phase1_file in phase1_files: # Process all fields
|
|
||||||
field_id = phase1_file.stem.replace("detected_harvests_", "")
|
|
||||||
|
|
||||||
# Get field data
|
|
||||||
field_data = full_data[full_data['field'] == field_id].copy()
|
|
||||||
if len(field_data) == 0:
|
|
||||||
continue
|
|
||||||
|
|
||||||
# Skip Chemba fields
|
|
||||||
if field_data['client'].iloc[0] == 'Chemba':
|
|
||||||
print(f"\n--- Field {field_id} (SKIP: Chemba) ---")
|
|
||||||
continue
|
|
||||||
|
|
||||||
field_data = field_data.sort_values('Date').reset_index(drop=True)
|
|
||||||
|
|
||||||
print(f"\n--- Field {field_id} ({len(field_data)} rows) ---")
|
|
||||||
|
|
||||||
# Load phase 1 results
|
|
||||||
phase1_df = pd.read_csv(phase1_file)
|
|
||||||
|
|
||||||
# Process refinements
|
|
||||||
refinements = process_field_refinement(
|
|
||||||
field_id, phase1_df, field_data, model, scalers, config
|
|
||||||
)
|
|
||||||
|
|
||||||
all_refinements.extend(refinements)
|
|
||||||
|
|
||||||
# Summary
|
|
||||||
print("\n" + "="*80)
|
|
||||||
print("PHASE 2 REFINEMENT RESULTS")
|
|
||||||
print("="*80)
|
|
||||||
|
|
||||||
if all_refinements:
|
|
||||||
results_df = pd.DataFrame(all_refinements)
|
|
||||||
|
|
||||||
# Save detailed results
|
|
||||||
results_file = OUTPUT_DIR / "phase2_refinement_detailed.csv"
|
|
||||||
results_df.to_csv(results_file, index=False)
|
|
||||||
print(f"\nDetailed results saved: {results_file}\n")
|
|
||||||
|
|
||||||
# Display comparison
|
|
||||||
print("Phase 1 vs Phase 2 vs Actual:")
|
|
||||||
print(results_df[['field', 'harvest_idx', 'phase1_date', 'phase2_date', 'actual_date',
|
|
||||||
'error_phase1', 'error_phase2', 'improvement']].to_string(index=False))
|
|
||||||
|
|
||||||
# Statistics
|
|
||||||
print(f"\n" + "="*80)
|
|
||||||
print("ACCURACY IMPROVEMENT")
|
|
||||||
print("="*80)
|
|
||||||
|
|
||||||
valid_p1 = results_df['error_phase1'].notna()
|
|
||||||
valid_p2 = results_df['error_phase2'].notna()
|
|
||||||
|
|
||||||
print(f"Phase 1 errors (N={valid_p1.sum()}):")
|
|
||||||
print(f" Mean: {results_df.loc[valid_p1, 'error_phase1'].mean():.2f} days")
|
|
||||||
print(f" Median: {results_df.loc[valid_p1, 'error_phase1'].median():.2f} days")
|
|
||||||
|
|
||||||
print(f"\nPhase 2 errors (N={valid_p2.sum()}):")
|
|
||||||
print(f" Mean: {results_df.loc[valid_p2, 'error_phase2'].mean():.2f} days")
|
|
||||||
print(f" Median: {results_df.loc[valid_p2, 'error_phase2'].median():.2f} days")
|
|
||||||
|
|
||||||
if valid_p2.sum() > 0:
|
|
||||||
improvement_valid = results_df[valid_p1 & valid_p2]['improvement']
|
|
||||||
print(f"\nImprovement (Phase 1 -> Phase 2):")
|
|
||||||
print(f" Mean: {improvement_valid.mean():.2f} days")
|
|
||||||
print(f" Median: {improvement_valid.median():.2f} days")
|
|
||||||
print(f" Better in: {(improvement_valid > 0).sum()}/{len(improvement_valid)} cases")
|
|
||||||
|
|
||||||
print(f"\n✓ Phase 2 refinement complete!")
|
|
||||||
|
|
||||||
|
|
||||||
if __name__ == "__main__":
|
|
||||||
main()
|
|
||||||
|
|
@ -1,512 +0,0 @@
|
||||||
"""
|
|
||||||
Production Simulation v2: Weekly Harvest Monitoring with Model 307 Live Inference
|
|
||||||
|
|
||||||
Simulates realistic weekly operational workflow:
|
|
||||||
1. Load training data and build field-season sequences
|
|
||||||
2. For each check day (100, 200, 300, 307, 314, ...), truncate sequence to that day
|
|
||||||
3. Run Model 307 inference on truncated sequence
|
|
||||||
4. Track predictions over time and validate against ground truth
|
|
||||||
5. Measure: self-correction, accuracy progression, false positives, missed harvests
|
|
||||||
"""
|
|
||||||
|
|
||||||
import pandas as pd
|
|
||||||
import numpy as np
|
|
||||||
import json
|
|
||||||
import torch
|
|
||||||
from pathlib import Path
|
|
||||||
import matplotlib.pyplot as plt
|
|
||||||
try:
|
|
||||||
from tqdm import tqdm
|
|
||||||
except ImportError:
|
|
||||||
def tqdm(x, **kw):
|
|
||||||
return x
|
|
||||||
import sys
|
|
||||||
|
|
||||||
sys.path.insert(0, str(Path.cwd() / 'src'))
|
|
||||||
|
|
||||||
from data_loader import load_harvest_data, build_sequences
|
|
||||||
from feature_engineering import extract_features
|
|
||||||
from models import create_model
|
|
||||||
import pickle
|
|
||||||
import yaml
|
|
||||||
|
|
||||||
# Configuration
|
|
||||||
IMMINENT_THRESHOLD = 0.4
|
|
||||||
DETECTED_THRESHOLD = 0.5
|
|
||||||
|
|
||||||
# Check intervals: 100, 200, 300, then 7-day intervals from 300 onwards
|
|
||||||
CHECK_DAYS = list(range(7, 550, 7))
|
|
||||||
|
|
||||||
# Test mode: set to a field name to test on single field, or None for all fields
|
|
||||||
TEST_SINGLE_FIELD = None # Change to None to run on all fields
|
|
||||||
|
|
||||||
RESULTS_DIR = Path("results/307_dropout02_with_doy_ORIGINAL")
|
|
||||||
DATA_FILE = Path("../lstm_complete_data.csv")
|
|
||||||
CONFIG_FILE = RESULTS_DIR / "config.json"
|
|
||||||
MODEL_FILE = RESULTS_DIR / "model.pt"
|
|
||||||
SCALERS_FILE = RESULTS_DIR / "scalers.pkl"
|
|
||||||
|
|
||||||
# Device
|
|
||||||
DEVICE = torch.device("cuda" if torch.cuda.is_available() else "cpu")
|
|
||||||
print(f"Using device: {DEVICE}")
|
|
||||||
|
|
||||||
|
|
||||||
def sanitize_filename(filename):
|
|
||||||
"""Remove invalid filename characters."""
|
|
||||||
invalid_chars = r'<>:"|?*\/'
|
|
||||||
for char in invalid_chars:
|
|
||||||
filename = filename.replace(char, '_')
|
|
||||||
return filename
|
|
||||||
|
|
||||||
|
|
||||||
def load_model_and_config():
|
|
||||||
"""Load Model 307 architecture and weights."""
|
|
||||||
print(f"Loading model config from {CONFIG_FILE}")
|
|
||||||
with open(CONFIG_FILE) as f:
|
|
||||||
config = yaml.safe_load(f)
|
|
||||||
|
|
||||||
print(f"Loading model weights from {MODEL_FILE}")
|
|
||||||
model = create_model(
|
|
||||||
model_type=config['model']['type'],
|
|
||||||
input_size=len(config['features']),
|
|
||||||
hidden_size=config['model']['hidden_size'],
|
|
||||||
num_layers=config['model']['num_layers'],
|
|
||||||
dropout=config['model']['dropout'],
|
|
||||||
device=DEVICE
|
|
||||||
)
|
|
||||||
model.load_state_dict(torch.load(MODEL_FILE, map_location=DEVICE))
|
|
||||||
model.eval()
|
|
||||||
|
|
||||||
print(f"Loading feature scalers from {SCALERS_FILE}")
|
|
||||||
with open(SCALERS_FILE, 'rb') as f:
|
|
||||||
scalers = pickle.load(f)
|
|
||||||
|
|
||||||
return model, config, scalers
|
|
||||||
|
|
||||||
|
|
||||||
def predict_on_truncated_sequence(model, data_df, truncate_day, scalers, config):
|
|
||||||
"""
|
|
||||||
Run Model 307 inference on a sequence truncated at a specific day.
|
|
||||||
|
|
||||||
Args:
|
|
||||||
model: Loaded LSTM model
|
|
||||||
data_df: DataFrame with sequence data (sorted by Date)
|
|
||||||
truncate_day: Day index to truncate sequence at
|
|
||||||
scalers: Feature scalers
|
|
||||||
config: Model config with feature info
|
|
||||||
|
|
||||||
Returns:
|
|
||||||
(imminent_prob, detected_prob) at last timestep, or (None, None) if failed
|
|
||||||
"""
|
|
||||||
if truncate_day >= len(data_df):
|
|
||||||
return None, None # Can't predict beyond available data
|
|
||||||
|
|
||||||
# Get truncated sequence
|
|
||||||
trunc_df = data_df.iloc[:truncate_day+1].copy()
|
|
||||||
|
|
||||||
# Extract features
|
|
||||||
features = config['features']
|
|
||||||
ci_column = config['data']['ci_column']
|
|
||||||
feat_array = extract_features(trunc_df, features, ci_column)
|
|
||||||
|
|
||||||
# Apply scalers
|
|
||||||
for fi, scaler in enumerate(scalers):
|
|
||||||
try:
|
|
||||||
feat_array[:, fi] = scaler.transform(feat_array[:, fi].reshape(-1, 1)).flatten()
|
|
||||||
except Exception:
|
|
||||||
pass # Leave as-is if scaler fails
|
|
||||||
|
|
||||||
# Run model inference
|
|
||||||
with torch.no_grad():
|
|
||||||
x_tensor = torch.tensor(feat_array, dtype=torch.float32).unsqueeze(0).to(DEVICE)
|
|
||||||
out_imm, out_det = model(x_tensor)
|
|
||||||
# Get last timestep probabilities
|
|
||||||
imminent_prob = out_imm.squeeze(0)[-1].cpu().item()
|
|
||||||
detected_prob = out_det.squeeze(0)[-1].cpu().item()
|
|
||||||
|
|
||||||
return imminent_prob, detected_prob
|
|
||||||
|
|
||||||
|
|
||||||
def simulate_weekly_checks(sequences, model, scalers, config):
|
|
||||||
"""
|
|
||||||
Simulate weekly production monitoring with live Model 307 inference.
|
|
||||||
|
|
||||||
For each sequence and each check day:
|
|
||||||
- Truncate to that day
|
|
||||||
- Run Model 307 inference
|
|
||||||
- Record predictions and compare to ground truth
|
|
||||||
"""
|
|
||||||
print("\nSimulating weekly monitoring with live Model 307 inference...")
|
|
||||||
print(f"Running inference on {len(sequences)} sequences x {len(CHECK_DAYS)} check days")
|
|
||||||
|
|
||||||
results = []
|
|
||||||
|
|
||||||
# Filter to single field if in test mode
|
|
||||||
seqs_to_process = sequences
|
|
||||||
if TEST_SINGLE_FIELD:
|
|
||||||
seqs_to_process = [s for s in sequences if s['field'] == TEST_SINGLE_FIELD]
|
|
||||||
if not seqs_to_process:
|
|
||||||
print(f"WARNING: Field '{TEST_SINGLE_FIELD}' not found!")
|
|
||||||
return pd.DataFrame(), []
|
|
||||||
print(f"TEST MODE: Processing {len(seqs_to_process)} sequence(s) for field '{TEST_SINGLE_FIELD}'")
|
|
||||||
|
|
||||||
# Process each sequence
|
|
||||||
for seq_idx, seq in enumerate(tqdm(seqs_to_process, desc="Sequences")):
|
|
||||||
field = seq['field']
|
|
||||||
season = seq['season'] # From sequence dict, not from data
|
|
||||||
data_df = seq['data'].sort_values('Date').reset_index(drop=True)
|
|
||||||
|
|
||||||
# Extract ground truth
|
|
||||||
harvest_rows = np.where(data_df.get('harvest_detected', pd.Series([0]*len(data_df))) == 1)[0]
|
|
||||||
actual_harvest_day = harvest_rows[0] if len(harvest_rows) > 0 else None
|
|
||||||
|
|
||||||
# Run predictions at each check day
|
|
||||||
for check_day in CHECK_DAYS:
|
|
||||||
if check_day >= len(data_df):
|
|
||||||
continue # Skip if sequence is shorter
|
|
||||||
|
|
||||||
# Get Model 307 prediction at this check day
|
|
||||||
imminent_prob, detected_prob = predict_on_truncated_sequence(
|
|
||||||
model, data_df, check_day, scalers, config
|
|
||||||
)
|
|
||||||
|
|
||||||
if imminent_prob is None:
|
|
||||||
continue
|
|
||||||
|
|
||||||
check_row = data_df.iloc[check_day]
|
|
||||||
|
|
||||||
result = {
|
|
||||||
'field': field,
|
|
||||||
'season': season,
|
|
||||||
'check_day': check_day,
|
|
||||||
'check_date': check_row['Date'],
|
|
||||||
'imminent_prob_pred': imminent_prob,
|
|
||||||
'detected_prob_pred': detected_prob,
|
|
||||||
'imminent_signal': imminent_prob > IMMINENT_THRESHOLD,
|
|
||||||
'detected_signal': detected_prob > DETECTED_THRESHOLD,
|
|
||||||
'actual_harvest_day': actual_harvest_day,
|
|
||||||
'harvest_status': 'unknown',
|
|
||||||
'days_until_harvest': None,
|
|
||||||
}
|
|
||||||
|
|
||||||
# Calculate days until harvest
|
|
||||||
if actual_harvest_day is not None:
|
|
||||||
days_until = actual_harvest_day - check_day
|
|
||||||
result['days_until_harvest'] = days_until
|
|
||||||
|
|
||||||
if days_until > 14:
|
|
||||||
result['harvest_status'] = 'early'
|
|
||||||
elif days_until > 3:
|
|
||||||
result['harvest_status'] = 'approaching'
|
|
||||||
elif days_until > 0:
|
|
||||||
result['harvest_status'] = 'imminent'
|
|
||||||
elif days_until == 0:
|
|
||||||
result['harvest_status'] = 'today'
|
|
||||||
else:
|
|
||||||
result['harvest_status'] = 'past'
|
|
||||||
|
|
||||||
results.append(result)
|
|
||||||
|
|
||||||
return pd.DataFrame(results), seqs_to_process
|
|
||||||
|
|
||||||
|
|
||||||
def generate_timeline_visualization(monitoring_df, sequences, output_dir_path="production_timeline"):
|
|
||||||
"""Generate per-field visualization showing predictions and CI on same plot with dual axes."""
|
|
||||||
output_dir = Path(output_dir_path)
|
|
||||||
output_dir.mkdir(exist_ok=True)
|
|
||||||
|
|
||||||
print(f"\nGenerating per-field prediction timelines...")
|
|
||||||
|
|
||||||
# Group by field
|
|
||||||
for field_name in monitoring_df['field'].unique():
|
|
||||||
field_df = monitoring_df[monitoring_df['field'] == field_name]
|
|
||||||
field_sequences = [s for s in sequences if s['field'] == field_name]
|
|
||||||
|
|
||||||
if not field_sequences:
|
|
||||||
continue
|
|
||||||
|
|
||||||
# Create subplots - one per season
|
|
||||||
n_models = len(field_sequences)
|
|
||||||
fig, axes = plt.subplots(n_models, 1, figsize=(16, 5 * n_models))
|
|
||||||
if n_models == 1:
|
|
||||||
axes = [axes]
|
|
||||||
|
|
||||||
for ax_idx, seq in enumerate(field_sequences):
|
|
||||||
ax1 = axes[ax_idx]
|
|
||||||
season = seq['season']
|
|
||||||
data_df = seq['data'].sort_values('Date').reset_index(drop=True)
|
|
||||||
|
|
||||||
# Get predictions for this model at check days
|
|
||||||
model_preds = field_df[field_df['season'] == season].sort_values('check_day')
|
|
||||||
|
|
||||||
if len(model_preds) == 0:
|
|
||||||
continue
|
|
||||||
|
|
||||||
check_days = model_preds['check_day'].values
|
|
||||||
imminent_probs = model_preds['imminent_prob_pred'].values
|
|
||||||
detected_probs = model_preds['detected_prob_pred'].values
|
|
||||||
imminent_signals = model_preds['imminent_signal'].values
|
|
||||||
detected_signals = model_preds['detected_signal'].values
|
|
||||||
|
|
||||||
# Plot prediction progression on left y-axis
|
|
||||||
ax1.plot(check_days, imminent_probs, 'o-', color='orange', label='Imminent Prob', linewidth=2, markersize=8)
|
|
||||||
ax1.plot(check_days, detected_probs, 's-', color='red', label='Detected Prob', linewidth=2, markersize=8)
|
|
||||||
|
|
||||||
# Add threshold lines
|
|
||||||
ax1.axhline(IMMINENT_THRESHOLD, color='orange', linestyle='--', alpha=0.5, linewidth=1.5)
|
|
||||||
ax1.axhline(DETECTED_THRESHOLD, color='red', linestyle='--', alpha=0.5, linewidth=1.5)
|
|
||||||
|
|
||||||
# Mark actual harvest
|
|
||||||
actual_harvest_day = model_preds['actual_harvest_day'].iloc[0] if len(model_preds) > 0 else None
|
|
||||||
if actual_harvest_day is not None and not pd.isna(actual_harvest_day):
|
|
||||||
ax1.axvline(actual_harvest_day, color='black', linestyle='--', alpha=0.7, linewidth=2.5, label=f"Actual Harvest (day {int(actual_harvest_day)})")
|
|
||||||
|
|
||||||
# Highlight fired signals
|
|
||||||
for i, (day, is_imm, is_det) in enumerate(zip(check_days, imminent_signals, detected_signals)):
|
|
||||||
if is_imm:
|
|
||||||
ax1.scatter(day, imminent_probs[i], s=200, color='orange', marker='*', edgecolors='black', linewidth=1.5, zorder=5)
|
|
||||||
if is_det:
|
|
||||||
ax1.scatter(day, detected_probs[i], s=200, color='red', marker='*', edgecolors='black', linewidth=1.5, zorder=5)
|
|
||||||
|
|
||||||
ax1.set_ylim(-0.05, 1.05)
|
|
||||||
ax1.set_xlabel('Day in Sequence', fontsize=11)
|
|
||||||
ax1.set_ylabel('Prediction Probability', fontsize=11, color='black')
|
|
||||||
ax1.tick_params(axis='y', labelcolor='black')
|
|
||||||
ax1.grid(alpha=0.3)
|
|
||||||
|
|
||||||
# Create secondary y-axis for CI
|
|
||||||
ax2 = ax1.twinx()
|
|
||||||
|
|
||||||
# Plot CI data on right y-axis
|
|
||||||
days_idx = np.arange(len(data_df))
|
|
||||||
|
|
||||||
# Use FitData as the raw CI
|
|
||||||
if 'FitData' in data_df.columns:
|
|
||||||
ci_raw = data_df['FitData'].values
|
|
||||||
ax2.plot(days_idx, ci_raw, color='seagreen', label='Raw CI', linewidth=1, alpha=0.4, linestyle=':')
|
|
||||||
|
|
||||||
# Compute 7-day moving average
|
|
||||||
ci_7d_ma = data_df['FitData'].rolling(window=7, min_periods=1).mean().values
|
|
||||||
ax2.plot(days_idx, ci_7d_ma, color='darkgreen', label='7-day MA', linewidth=2.5, alpha=0.7)
|
|
||||||
|
|
||||||
ax2.set_ylabel('CI Value', fontsize=11, color='darkgreen')
|
|
||||||
ax2.tick_params(axis='y', labelcolor='darkgreen')
|
|
||||||
|
|
||||||
# Combined legend
|
|
||||||
lines1, labels1 = ax1.get_legend_handles_labels()
|
|
||||||
lines2, labels2 = ax2.get_legend_handles_labels()
|
|
||||||
ax1.legend(lines1 + lines2, labels1 + labels2, loc='upper left', fontsize=9)
|
|
||||||
|
|
||||||
ax1.set_title(f"{field_name} | Season {season} - Model 307 Predictions + CI Sequence", fontsize=12, fontweight='bold')
|
|
||||||
|
|
||||||
plt.tight_layout()
|
|
||||||
output_file = output_dir / f"predictions_{sanitize_filename(field_name)}.png"
|
|
||||||
plt.savefig(output_file, dpi=100, bbox_inches='tight')
|
|
||||||
print(f" Saved: {output_file}")
|
|
||||||
plt.close()
|
|
||||||
|
|
||||||
print(f"Visualizations saved to: {output_dir}/")
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
def generate_convergence_plot(monitoring_df, output_dir_path="convergence_analysis"):
|
|
||||||
"""
|
|
||||||
Generate spaghetti plots showing individual prediction trajectories per field.
|
|
||||||
|
|
||||||
For each field, creates a plot with all seasons of that field overlaid,
|
|
||||||
showing how predictions change over weekly check days.
|
|
||||||
"""
|
|
||||||
output_dir = Path(output_dir_path)
|
|
||||||
output_dir.mkdir(parents=True, exist_ok=True)
|
|
||||||
|
|
||||||
print(f"\nGenerating convergence analysis plots (Spaghetti - Per Field)...")
|
|
||||||
|
|
||||||
check_days_unique = sorted(monitoring_df['check_day'].unique())
|
|
||||||
|
|
||||||
# Generate per-field spaghetti plots
|
|
||||||
for field_name in monitoring_df['field'].unique():
|
|
||||||
field_df = monitoring_df[monitoring_df['field'] == field_name]
|
|
||||||
field_seasons = field_df['season'].unique()
|
|
||||||
|
|
||||||
# Create spaghetti plot for this field
|
|
||||||
fig, (ax1, ax2) = plt.subplots(2, 1, figsize=(16, 10))
|
|
||||||
|
|
||||||
colors = plt.cm.tab20(np.linspace(0, 1, len(field_seasons)))
|
|
||||||
|
|
||||||
# Group by season to get individual traces for this field
|
|
||||||
for season_idx, season in enumerate(field_seasons):
|
|
||||||
season_df = field_df[field_df['season'] == season].sort_values('check_day')
|
|
||||||
|
|
||||||
if len(season_df) == 0:
|
|
||||||
continue
|
|
||||||
|
|
||||||
check_days_season = season_df['check_day'].values
|
|
||||||
imminent_probs_season = season_df['imminent_prob_pred'].values
|
|
||||||
detected_probs_season = season_df['detected_prob_pred'].values
|
|
||||||
actual_harvest = season_df['actual_harvest_day'].iloc[0]
|
|
||||||
|
|
||||||
# Plot with distinct colors and higher alpha for visibility
|
|
||||||
ax1.plot(check_days_season, imminent_probs_season, 'o-', alpha=0.6, linewidth=2,
|
|
||||||
markersize=5, color=colors[season_idx], label=f"{season}")
|
|
||||||
ax2.plot(check_days_season, detected_probs_season, 's-', alpha=0.6, linewidth=2,
|
|
||||||
markersize=5, color=colors[season_idx], label=f"{season}")
|
|
||||||
|
|
||||||
# Add vertical line for actual harvest date (per sequence) - same color as trajectory, bold
|
|
||||||
if not pd.isna(actual_harvest):
|
|
||||||
ax1.axvline(actual_harvest, color=colors[season_idx], linestyle='--', alpha=0.7, linewidth=2.5)
|
|
||||||
ax2.axvline(actual_harvest, color=colors[season_idx], linestyle='--', alpha=0.7, linewidth=2.5)
|
|
||||||
|
|
||||||
# Add threshold lines (no fill) and formatting for imminent
|
|
||||||
ax1.axhline(IMMINENT_THRESHOLD, color='orange', linestyle='--', linewidth=2.5, alpha=0.8,
|
|
||||||
label=f'Imminent Threshold ({IMMINENT_THRESHOLD})')
|
|
||||||
ax1.set_ylabel('Imminent Probability', fontsize=12, fontweight='bold')
|
|
||||||
ax1.set_ylim(-0.05, 1.05)
|
|
||||||
ax1.grid(alpha=0.3, axis='y')
|
|
||||||
ax1.legend(loc='upper left', fontsize=8, ncol=2)
|
|
||||||
ax1.set_title(f'Field {field_name} - Prediction Trajectories Over Time - Imminent Signal\n(Each line = one season; vertical lines = actual harvest dates)',
|
|
||||||
fontsize=13, fontweight='bold')
|
|
||||||
ax1.set_xticks(check_days_unique[::3])
|
|
||||||
ax1.set_xlim(min(check_days_unique) - 10, max(check_days_unique) + 10)
|
|
||||||
|
|
||||||
# Add threshold lines (no fill) and formatting for detected
|
|
||||||
ax2.axhline(DETECTED_THRESHOLD, color='red', linestyle='--', linewidth=2.5, alpha=0.8,
|
|
||||||
label=f'Detected Threshold ({DETECTED_THRESHOLD})')
|
|
||||||
ax2.set_xlabel('Check Day (to scale)', fontsize=12, fontweight='bold')
|
|
||||||
ax2.set_ylabel('Detected Probability', fontsize=12, fontweight='bold')
|
|
||||||
ax2.set_ylim(-0.05, 1.05)
|
|
||||||
ax2.grid(alpha=0.3, axis='y')
|
|
||||||
ax2.grid(alpha=0.2, axis='x') # Show time scale grid
|
|
||||||
ax2.legend(loc='upper left', fontsize=8, ncol=2)
|
|
||||||
ax2.set_title(f'Field {field_name} - Prediction Trajectories Over Time - Detected Signal\n(Each line = one season; vertical lines = actual harvest dates)',
|
|
||||||
fontsize=13, fontweight='bold')
|
|
||||||
ax2.set_xticks(check_days_unique[::3])
|
|
||||||
ax2.set_xlim(min(check_days_unique) - 10, max(check_days_unique) + 10)
|
|
||||||
|
|
||||||
plt.tight_layout()
|
|
||||||
output_file = output_dir / f"convergence_spaghetti_{sanitize_filename(field_name)}.png"
|
|
||||||
plt.savefig(output_file, dpi=100, bbox_inches='tight')
|
|
||||||
print(f" Saved: {output_file}")
|
|
||||||
plt.close()
|
|
||||||
|
|
||||||
print(f"Convergence plots saved to: {output_dir}/")
|
|
||||||
|
|
||||||
|
|
||||||
def generate_statistics(monitoring_df):
|
|
||||||
"""Generate production-relevant statistics."""
|
|
||||||
print("\n" + "="*80)
|
|
||||||
print("PRODUCTION SIMULATION RESULTS (Live Inference)")
|
|
||||||
print("="*80)
|
|
||||||
|
|
||||||
print(f"\nDataset Summary:")
|
|
||||||
print(f" Total field-models: {monitoring_df['season'].nunique()}")
|
|
||||||
print(f" Total monitoring events: {len(monitoring_df)}")
|
|
||||||
print(f" Check intervals: {CHECK_DAYS}")
|
|
||||||
|
|
||||||
# Imminent signal statistics
|
|
||||||
imminent_signals = monitoring_df[monitoring_df['imminent_signal']]
|
|
||||||
print(f"\nImminent Signal (prob > {IMMINENT_THRESHOLD}):")
|
|
||||||
print(f" Triggered in: {len(imminent_signals)} events ({len(imminent_signals)/len(monitoring_df)*100:.1f}%)")
|
|
||||||
|
|
||||||
if len(imminent_signals) > 0:
|
|
||||||
imminent_accurate = imminent_signals[imminent_signals['days_until_harvest'] > 0]
|
|
||||||
print(f" Accurate triggers (>0 days before harvest): {len(imminent_accurate)} ({len(imminent_accurate)/len(imminent_signals)*100:.1f}%)")
|
|
||||||
|
|
||||||
if len(imminent_accurate) > 0:
|
|
||||||
avg_days = imminent_accurate['days_until_harvest'].mean()
|
|
||||||
print(f" Average days before harvest (when accurate): {avg_days:.1f}")
|
|
||||||
|
|
||||||
# Detected signal statistics
|
|
||||||
detected_signals = monitoring_df[monitoring_df['detected_signal']]
|
|
||||||
print(f"\nDetected Signal (prob > {DETECTED_THRESHOLD}):")
|
|
||||||
print(f" Triggered in: {len(detected_signals)} events ({len(detected_signals)/len(monitoring_df)*100:.1f}%)")
|
|
||||||
|
|
||||||
if len(detected_signals) > 0:
|
|
||||||
detected_near_harvest = detected_signals[
|
|
||||||
(detected_signals['days_until_harvest'] >= 0) &
|
|
||||||
(detected_signals['days_until_harvest'] <= 7)
|
|
||||||
]
|
|
||||||
print(f" Near harvest (0-7 days before/after): {len(detected_near_harvest)} ({len(detected_near_harvest)/len(detected_signals)*100:.1f}%)")
|
|
||||||
|
|
||||||
if len(detected_near_harvest) > 0:
|
|
||||||
avg_days = detected_near_harvest['days_until_harvest'].mean()
|
|
||||||
print(f" Average days from harvest: {avg_days:.1f}")
|
|
||||||
|
|
||||||
print("\n" + "="*80)
|
|
||||||
|
|
||||||
|
|
||||||
def export_results(monitoring_df, output_dir):
|
|
||||||
"""Export CSV reports."""
|
|
||||||
output_dir.mkdir(parents=True, exist_ok=True)
|
|
||||||
|
|
||||||
# Export all events
|
|
||||||
events_file = output_dir / "production_monitoring_events.csv"
|
|
||||||
monitoring_df.to_csv(events_file, index=False)
|
|
||||||
print(f"\nExported monitoring events to: {events_file}")
|
|
||||||
|
|
||||||
# Export per-model summary
|
|
||||||
summary_data = []
|
|
||||||
for season in monitoring_df['season'].unique():
|
|
||||||
model_df = monitoring_df[monitoring_df['season'] == season]
|
|
||||||
field = model_df['field'].iloc[0]
|
|
||||||
|
|
||||||
summary_data.append({
|
|
||||||
'field': field,
|
|
||||||
'season': season,
|
|
||||||
'total_checks': len(model_df),
|
|
||||||
'imminent_signals': (model_df['imminent_signal']).sum(),
|
|
||||||
'detected_signals': (model_df['detected_signal']).sum(),
|
|
||||||
'imminent_accurate': ((model_df['imminent_signal']) & (model_df['days_until_harvest'] > 0)).sum(),
|
|
||||||
})
|
|
||||||
|
|
||||||
summary_df = pd.DataFrame(summary_data)
|
|
||||||
summary_file = output_dir / "production_monitoring_summary.csv"
|
|
||||||
summary_df.to_csv(summary_file, index=False)
|
|
||||||
print(f"Exported summary to: {summary_file}")
|
|
||||||
|
|
||||||
|
|
||||||
def main():
|
|
||||||
print("="*80)
|
|
||||||
print("PRODUCTION SIMULATION: Weekly Harvest Monitoring with Live Inference")
|
|
||||||
print("="*80)
|
|
||||||
|
|
||||||
# Load model and config
|
|
||||||
print("\n[1/5] Loading Model 307...")
|
|
||||||
model, config, scalers = load_model_and_config()
|
|
||||||
|
|
||||||
# Load training data and build sequences
|
|
||||||
print("\n[2/5] Loading training data...")
|
|
||||||
df = load_harvest_data(DATA_FILE)
|
|
||||||
print(f"Loaded {len(df)} rows")
|
|
||||||
|
|
||||||
print("\n[3/5] Building field-model sequences...")
|
|
||||||
sequences = build_sequences(df)
|
|
||||||
print(f"Built {len(sequences)} sequences")
|
|
||||||
|
|
||||||
# Run production simulation
|
|
||||||
print("\n[4/5] Running production simulation...")
|
|
||||||
monitoring_df, processed_seqs = simulate_weekly_checks(sequences, model, scalers, config)
|
|
||||||
|
|
||||||
if len(monitoring_df) == 0:
|
|
||||||
print("ERROR: No results generated!")
|
|
||||||
return
|
|
||||||
|
|
||||||
# Generate statistics and reports
|
|
||||||
print("\n[5/5] Generating reports...")
|
|
||||||
generate_statistics(monitoring_df)
|
|
||||||
|
|
||||||
# Output to results folder
|
|
||||||
if TEST_SINGLE_FIELD:
|
|
||||||
output_dir = Path("results") / f"production_simulation_test_{TEST_SINGLE_FIELD}"
|
|
||||||
else:
|
|
||||||
output_dir = Path("results") / "production_simulation_full"
|
|
||||||
|
|
||||||
export_results(monitoring_df, output_dir)
|
|
||||||
generate_timeline_visualization(monitoring_df, processed_seqs, str(output_dir / "predictions_per_field"))
|
|
||||||
generate_convergence_plot(monitoring_df, str(output_dir / "convergence_analysis"))
|
|
||||||
|
|
||||||
print(f"\n✓ All results saved to: {output_dir}/")
|
|
||||||
|
|
||||||
|
|
||||||
if __name__ == "__main__":
|
|
||||||
main()
|
|
||||||
|
|
@ -1,142 +0,0 @@
|
||||||
# 02b_CONVERT_RDS_TO_CSV.R
|
|
||||||
# ========================
|
|
||||||
# Convert combined_CI_data.rds to long format with daily interpolation
|
|
||||||
#
|
|
||||||
# Input: combined_CI_data.rds (wide: field, sub_field, and dates as columns)
|
|
||||||
# Output: ci_data_for_python.csv (long: daily interpolated data, one row per field-date)
|
|
||||||
#
|
|
||||||
# Process:
|
|
||||||
# 1. Convert wide to long (raw measurements)
|
|
||||||
# 2. For each field, create COMPLETE daily sequence (first date to last date)
|
|
||||||
# 3. Linearly interpolate CI values for missing dates (including gaps)
|
|
||||||
# 4. Add DOY = cumulative days (1, 2, 3, ...) continuously per field
|
|
||||||
# (Python script will later detect gaps/seasons and reset DOY per season)
|
|
||||||
#
|
|
||||||
# Output columns: field, sub_field, Date, value, FitData, DOY
|
|
||||||
# - value: raw CI measurement (NA if interpolated/filled)
|
|
||||||
# - FitData: linearly interpolated CI value (used by model)
|
|
||||||
# - DOY: cumulative days since first measurement (1, 2, 3, ..., continuous per field)
|
|
||||||
#
|
|
||||||
|
|
||||||
suppressPackageStartupMessages({
|
|
||||||
library(tidyverse)
|
|
||||||
library(lubridate)
|
|
||||||
library(zoo)
|
|
||||||
})
|
|
||||||
|
|
||||||
# Paths
|
|
||||||
rds_file <- "C:/Users/timon/Resilience BV/4020 SCane ESA DEMO - Documenten/General/4020 SCDEMO Team/4020 TechnicalData/WP3/smartcane_v2/smartcane/laravel_app/storage/app/angata/Data/extracted_ci/cumulative_vals/combined_CI_data.rds"
|
|
||||||
output_file <- "ci_data_for_python.csv"
|
|
||||||
|
|
||||||
cat("=" %+% strrep("=", 78) %+% "\n")
|
|
||||||
cat("RDS TO CSV: DAILY INTERPOLATION (NO SEASON RESET)\n")
|
|
||||||
cat("=" %+% strrep("=", 78) %+% "\n\n")
|
|
||||||
|
|
||||||
# Load RDS
|
|
||||||
if (!file.exists(rds_file)) {
|
|
||||||
stop(paste("ERROR: File not found:", rds_file))
|
|
||||||
}
|
|
||||||
|
|
||||||
cat(sprintf("Loading: %s\n", rds_file))
|
|
||||||
ci_wide <- readRDS(rds_file) %>% as_tibble() %>% ungroup()
|
|
||||||
|
|
||||||
cat(sprintf("✓ Loaded %d fields (wide format)\n", nrow(ci_wide)))
|
|
||||||
cat(sprintf(" Sample columns: %s\n\n", paste(head(names(ci_wide), 8), collapse = ", ")))
|
|
||||||
|
|
||||||
# Step 1: Convert to long format (raw measurements)
|
|
||||||
cat("Step 1: Converting to long format (raw measurements)...\n")
|
|
||||||
ci_raw <- ci_wide %>%
|
|
||||||
pivot_longer(
|
|
||||||
cols = -c(field, sub_field),
|
|
||||||
names_to = "Date",
|
|
||||||
values_to = "value",
|
|
||||||
values_drop_na = TRUE
|
|
||||||
) %>%
|
|
||||||
mutate(
|
|
||||||
Date = as.Date(Date),
|
|
||||||
value = as.numeric(value)
|
|
||||||
) %>%
|
|
||||||
filter(!is.na(value)) %>%
|
|
||||||
arrange(field, Date)
|
|
||||||
|
|
||||||
cat(sprintf("✓ Got %d raw measurements\n\n", nrow(ci_raw)))
|
|
||||||
|
|
||||||
# Step 2: Create complete daily sequences with interpolation
|
|
||||||
cat("Step 2: Creating complete daily sequences (with interpolation)...\n")
|
|
||||||
|
|
||||||
ci_daily <- ci_raw %>%
|
|
||||||
group_by(field) %>%
|
|
||||||
nest() %>%
|
|
||||||
mutate(
|
|
||||||
data = map(data, function(df) {
|
|
||||||
sub_field <- df$sub_field[1]
|
|
||||||
|
|
||||||
# Sort by date
|
|
||||||
df <- df %>% arrange(Date)
|
|
||||||
|
|
||||||
# Create COMPLETE daily sequence (first to last date)
|
|
||||||
date_seq <- seq(min(df$Date), max(df$Date), by = "day")
|
|
||||||
|
|
||||||
# Create full daily dataframe
|
|
||||||
daily_df <- tibble(
|
|
||||||
field = df$field[1],
|
|
||||||
sub_field = sub_field,
|
|
||||||
Date = date_seq,
|
|
||||||
value = NA_real_,
|
|
||||||
FitData = NA_real_,
|
|
||||||
DOY = seq_along(date_seq) # Continuous count: 1, 2, 3, ...
|
|
||||||
)
|
|
||||||
|
|
||||||
# Fill in actual values from raw measurements
|
|
||||||
for (i in seq_len(nrow(df))) {
|
|
||||||
idx <- which(daily_df$Date == df$Date[i])
|
|
||||||
if (length(idx) > 0) {
|
|
||||||
daily_df$value[idx] <- df$value[i]
|
|
||||||
}
|
|
||||||
}
|
|
||||||
|
|
||||||
# Linear interpolation for FitData (fills all missing dates)
|
|
||||||
daily_df$FitData <- na.approx(daily_df$value, na.rm = FALSE)
|
|
||||||
|
|
||||||
daily_df
|
|
||||||
})
|
|
||||||
) %>%
|
|
||||||
unnest(data) %>%
|
|
||||||
select(field, sub_field, Date, value, FitData, DOY)
|
|
||||||
|
|
||||||
cat(sprintf("✓ Generated %d daily rows (complete sequence with interpolation)\n\n", nrow(ci_daily)))
|
|
||||||
|
|
||||||
# Step 3: Validation
|
|
||||||
cat("Validation:\n")
|
|
||||||
cat(sprintf(" Total daily rows: %d\n", nrow(ci_daily)))
|
|
||||||
cat(sprintf(" Unique fields: %d\n", n_distinct(ci_daily$field)))
|
|
||||||
cat(sprintf(" Date range: %s to %s\n",
|
|
||||||
min(ci_daily$Date, na.rm = TRUE),
|
|
||||||
max(ci_daily$Date, na.rm = TRUE)))
|
|
||||||
cat(sprintf(" FitData range: [%.2f, %.2f]\n",
|
|
||||||
min(ci_daily$FitData, na.rm = TRUE),
|
|
||||||
max(ci_daily$FitData, na.rm = TRUE)))
|
|
||||||
cat(sprintf(" Raw measurements: %d\n", sum(!is.na(ci_daily$value))))
|
|
||||||
cat(sprintf(" Interpolated values: %d\n", sum(is.na(ci_daily$value) & !is.na(ci_daily$FitData))))
|
|
||||||
|
|
||||||
# Get max DOY per field safely
|
|
||||||
max_doy_by_field <- ci_daily %>%
|
|
||||||
group_by(field) %>%
|
|
||||||
summarise(max_doy = max(DOY, na.rm = TRUE), .groups = "drop") %>%
|
|
||||||
arrange(desc(max_doy))
|
|
||||||
cat(sprintf(" Max DOY (top 3 fields): %s\n\n",
|
|
||||||
paste(paste0(max_doy_by_field$field[1:3], "=", max_doy_by_field$max_doy[1:3]), collapse = ", ")))
|
|
||||||
|
|
||||||
# Sample data
|
|
||||||
cat("Sample (first 20 rows from field 00110):\n")
|
|
||||||
sample_data <- ci_daily %>% filter(field == "00110") %>% head(20)
|
|
||||||
print(sample_data)
|
|
||||||
cat("\n")
|
|
||||||
|
|
||||||
# Save to CSV
|
|
||||||
cat(sprintf("Saving to: %s\n", output_file))
|
|
||||||
write_csv(ci_daily, output_file)
|
|
||||||
|
|
||||||
cat(sprintf("✓ Successfully exported %d rows\n\n", nrow(ci_daily)))
|
|
||||||
cat(sprintf("Ready for Python seasonal slicing and LSTM model!\n"))
|
|
||||||
cat(sprintf("Next step: python run_export_harvest_dates.py\n"))
|
|
||||||
|
|
@ -1,38 +0,0 @@
|
||||||
# Phase 4: Production Export & Monitoring
|
|
||||||
|
|
||||||
Self-contained folder for two-step harvest date prediction and production-ready Excel export.
|
|
||||||
|
|
||||||
## Files
|
|
||||||
|
|
||||||
- `run_export_harvest_dates.py` - Main script: two-step harvest date refinement → harvest.xlsx
|
|
||||||
- `production_monitoring.py` - Ongoing weekly/daily monitoring using harvest.xlsx (TODO)
|
|
||||||
- `harvest_date_pred_utils.py` - Shared utility functions
|
|
||||||
- `config.json` - Model 307 architecture config
|
|
||||||
- `model.pt` - Trained LSTM weights (Model 307)
|
|
||||||
- `scalers.pkl` - Feature normalization scalers
|
|
||||||
- `lstm_complete_data.csv` - Input CI time series data (copy from parent or generate)
|
|
||||||
|
|
||||||
## Setup
|
|
||||||
|
|
||||||
1. Copy or generate `lstm_complete_data.csv` to this folder
|
|
||||||
2. Model files (config.json, model.pt, scalers.pkl) are already included
|
|
||||||
|
|
||||||
## Run
|
|
||||||
|
|
||||||
```powershell
|
|
||||||
conda activate pytorch_gpu
|
|
||||||
cd 04_production_export
|
|
||||||
$env:CUDA_VISIBLE_DEVICES='0'; python run_export_harvest_dates.py 2>&1 | Tee-Object export_run.log
|
|
||||||
```
|
|
||||||
|
|
||||||
This generates `harvest_production_export.xlsx` with columns:
|
|
||||||
- field
|
|
||||||
- season_start_date
|
|
||||||
- season_end_date (estimated harvest)
|
|
||||||
- ...
|
|
||||||
|
|
||||||
## Next
|
|
||||||
|
|
||||||
- [ ] Implement two-step refinement logic in `harvest_date_pred_utils.py`
|
|
||||||
- [ ] Create `production_monitoring.py` for weekly/daily predictions
|
|
||||||
- [ ] Integrate into main pipeline
|
|
||||||
|
|
@ -1,299 +0,0 @@
|
||||||
"""
|
|
||||||
Batch Field Visualization Tool - RGB Imagery Around Harvest Date
|
|
||||||
Purpose: Generate visual validation using RGB satellite imagery samples around
|
|
||||||
predicted harvest date to verify predictions (bare soil = harvested, green = not harvested)
|
|
||||||
|
|
||||||
Shows 12-15 RGB images in a grid, centered around the predicted harvest date
|
|
||||||
|
|
||||||
Usage:
|
|
||||||
python batch_plot_fields_rgb.py field1,field2,field3
|
|
||||||
python batch_plot_fields_rgb.py 10125,88,97
|
|
||||||
|
|
||||||
Or read from CSV:
|
|
||||||
python batch_plot_fields_rgb.py --file fields_to_check.csv
|
|
||||||
"""
|
|
||||||
|
|
||||||
import pandas as pd
|
|
||||||
import numpy as np
|
|
||||||
import torch
|
|
||||||
import matplotlib.pyplot as plt
|
|
||||||
import matplotlib.dates as mdates
|
|
||||||
from pathlib import Path
|
|
||||||
from datetime import datetime, timedelta
|
|
||||||
import sys
|
|
||||||
import rasterio
|
|
||||||
from rasterio.mask import mask
|
|
||||||
import geopandas as gpd
|
|
||||||
from harvest_date_pred_utils import load_model_and_config, extract_features
|
|
||||||
|
|
||||||
|
|
||||||
def get_field_centroid(field_id, geojson_path="pivot.geojson"):
|
|
||||||
"""Get centroid of field from GeoJSON for cropping RGB images."""
|
|
||||||
try:
|
|
||||||
gdf = gpd.read_file(geojson_path)
|
|
||||||
field_geom = gdf[gdf['field'] == str(field_id)]
|
|
||||||
if len(field_geom) > 0:
|
|
||||||
centroid = field_geom.geometry.iloc[0].centroid
|
|
||||||
return (centroid.x, centroid.y)
|
|
||||||
except Exception as e:
|
|
||||||
print(f" Warning: Could not get field centroid - {e}")
|
|
||||||
return None
|
|
||||||
|
|
||||||
|
|
||||||
def load_rgb_image(tif_path, field_id=None, geojson_path="pivot.geojson"):
|
|
||||||
"""
|
|
||||||
Load RGB bands from 8-band GeoTIFF
|
|
||||||
Bands: 0=coastal, 1=blue, 2=green, 3=green_i, 4=yellow, 5=red, 6=rededge, 7=nir
|
|
||||||
RGB = bands 5,3,1 (Red, Green, Blue)
|
|
||||||
"""
|
|
||||||
try:
|
|
||||||
with rasterio.open(tif_path) as src:
|
|
||||||
# Read RGB bands (bands are 1-indexed in rasterio)
|
|
||||||
red = src.read(6) # Band 6 = red (0-indexed band 5)
|
|
||||||
green = src.read(3) # Band 3 = green (0-indexed band 2)
|
|
||||||
blue = src.read(2) # Band 2 = blue (0-indexed band 1)
|
|
||||||
|
|
||||||
# Stack into RGB image
|
|
||||||
rgb = np.stack([red, green, blue], axis=2)
|
|
||||||
|
|
||||||
# Normalize to 0-1 range (8-band data is typically 0-10000)
|
|
||||||
rgb = np.clip(rgb / 5000.0, 0, 1)
|
|
||||||
|
|
||||||
return rgb
|
|
||||||
except Exception as e:
|
|
||||||
print(f" Error loading RGB from {tif_path}: {e}")
|
|
||||||
return None
|
|
||||||
|
|
||||||
|
|
||||||
def plot_field_rgb_validation(field_id, ci_data, model, config, scalers, device,
|
|
||||||
tif_folder="../../../laravel_app/storage/app/angata/merged_tif_8b",
|
|
||||||
output_dir="validation_plots_rgb"):
|
|
||||||
"""
|
|
||||||
Create validation plot for a single field:
|
|
||||||
- Top: Harvest probability over time with peak marked
|
|
||||||
- Bottom: 12-15 RGB images in grid around predicted harvest date
|
|
||||||
"""
|
|
||||||
# Create output directory
|
|
||||||
Path(output_dir).mkdir(parents=True, exist_ok=True)
|
|
||||||
|
|
||||||
# Filter field data
|
|
||||||
field_data = ci_data[ci_data['field'] == field_id].copy()
|
|
||||||
if len(field_data) == 0:
|
|
||||||
print(f" ✗ Field {field_id}: No CI data found")
|
|
||||||
return False
|
|
||||||
|
|
||||||
field_data = field_data.sort_values('Date')
|
|
||||||
print(f" ✓ Field {field_id}: {len(field_data)} days of data")
|
|
||||||
|
|
||||||
try:
|
|
||||||
# Extract features and run inference
|
|
||||||
ci_column = config['data']['ci_column']
|
|
||||||
feature_names = config['features']
|
|
||||||
|
|
||||||
feat_array = extract_features(field_data, feature_names, ci_column=ci_column)
|
|
||||||
if feat_array is None:
|
|
||||||
print(f" ✗ Field {field_id}: Feature extraction failed")
|
|
||||||
return False
|
|
||||||
|
|
||||||
# Apply scalers
|
|
||||||
if isinstance(scalers, dict) and 'features' in scalers:
|
|
||||||
feat_array = scalers['features'].transform(feat_array)
|
|
||||||
|
|
||||||
# Run inference
|
|
||||||
with torch.no_grad():
|
|
||||||
x_tensor = torch.tensor(feat_array, dtype=torch.float32).unsqueeze(0).to(device)
|
|
||||||
out_imm, out_det = model(x_tensor)
|
|
||||||
imm_probs = out_imm.squeeze(0).cpu().numpy()
|
|
||||||
|
|
||||||
# Find peak probability (predicted harvest date)
|
|
||||||
peak_idx = np.argmax(imm_probs)
|
|
||||||
peak_date = field_data['Date'].iloc[peak_idx]
|
|
||||||
peak_prob = imm_probs[peak_idx]
|
|
||||||
|
|
||||||
print(f" Peak probability: {peak_prob:.3f} on {peak_date.strftime('%Y-%m-%d')}")
|
|
||||||
|
|
||||||
# Get date range: ±6 days around peak (12-13 images total)
|
|
||||||
date_range = field_data['Date'].dt.date
|
|
||||||
peak_date_only = peak_date.date() if hasattr(peak_date, 'date') else peak_date
|
|
||||||
|
|
||||||
days_before = 6
|
|
||||||
days_after = 6
|
|
||||||
start_date = peak_date_only - timedelta(days=days_before)
|
|
||||||
end_date = peak_date_only + timedelta(days=days_after)
|
|
||||||
|
|
||||||
# Find available TIF files in date range
|
|
||||||
tif_folder_path = Path(tif_folder)
|
|
||||||
available_dates = []
|
|
||||||
for tif_file in sorted(tif_folder_path.glob("*.tif")):
|
|
||||||
date_str = tif_file.stem # YYYY-MM-DD
|
|
||||||
try:
|
|
||||||
tif_date = datetime.strptime(date_str, "%Y-%m-%d").date()
|
|
||||||
if start_date <= tif_date <= end_date:
|
|
||||||
available_dates.append((tif_date, tif_file))
|
|
||||||
except ValueError:
|
|
||||||
pass
|
|
||||||
|
|
||||||
if len(available_dates) == 0:
|
|
||||||
print(f" Warning: No TIF files found in {start_date} to {end_date}")
|
|
||||||
return False
|
|
||||||
|
|
||||||
print(f" Found {len(available_dates)} RGB images in date range")
|
|
||||||
|
|
||||||
# Load RGB images
|
|
||||||
rgb_images = []
|
|
||||||
rgb_dates = []
|
|
||||||
for tif_date, tif_file in available_dates:
|
|
||||||
rgb = load_rgb_image(str(tif_file), field_id)
|
|
||||||
if rgb is not None:
|
|
||||||
rgb_images.append(rgb)
|
|
||||||
rgb_dates.append(tif_date)
|
|
||||||
|
|
||||||
if len(rgb_images) == 0:
|
|
||||||
print(f" ✗ No RGB images loaded")
|
|
||||||
return False
|
|
||||||
|
|
||||||
print(f" Loaded {len(rgb_images)} RGB images")
|
|
||||||
|
|
||||||
# Create figure with probability plot + RGB grid
|
|
||||||
n_images = len(rgb_images)
|
|
||||||
n_cols = min(5, n_images) # Max 5 columns
|
|
||||||
n_rows = (n_images + n_cols - 1) // n_cols # Calculate rows needed
|
|
||||||
|
|
||||||
fig = plt.figure(figsize=(18, 12))
|
|
||||||
|
|
||||||
# Probability plot (top, spanning all columns)
|
|
||||||
ax_prob = plt.subplot(n_rows + 1, n_cols, (1, n_cols))
|
|
||||||
dates_arr = field_data['Date'].values
|
|
||||||
ax_prob.plot(dates_arr, imm_probs, '-', color='orange', linewidth=2.5, label='Imminent Probability', alpha=0.8)
|
|
||||||
ax_prob.axhline(y=0.5, color='red', linestyle='--', linewidth=1.5, alpha=0.5, label='Threshold (0.5)')
|
|
||||||
ax_prob.axvline(x=peak_date, color='darkred', linestyle=':', linewidth=2, alpha=0.7, label='Peak')
|
|
||||||
ax_prob.fill_between(dates_arr, 0.5, 1.0, alpha=0.08, color='red')
|
|
||||||
ax_prob.set_ylim(-0.05, 1.05)
|
|
||||||
ax_prob.set_ylabel('Probability', fontsize=11, fontweight='bold')
|
|
||||||
ax_prob.set_xlabel('Date', fontsize=11, fontweight='bold')
|
|
||||||
ax_prob.set_title(f'Field {field_id} - Model 307 Harvest Probability', fontsize=12, fontweight='bold')
|
|
||||||
ax_prob.grid(True, alpha=0.3)
|
|
||||||
ax_prob.legend(loc='upper right', fontsize=9)
|
|
||||||
ax_prob.xaxis.set_major_locator(mdates.MonthLocator(interval=1))
|
|
||||||
ax_prob.xaxis.set_major_formatter(mdates.DateFormatter('%Y-%m-%d'))
|
|
||||||
plt.setp(ax_prob.xaxis.get_majorticklabels(), rotation=45, ha='right')
|
|
||||||
|
|
||||||
# Annotate peak
|
|
||||||
ax_prob.annotate(f'{peak_prob:.2f}\n{peak_date_only}',
|
|
||||||
xy=(peak_date, peak_prob),
|
|
||||||
xytext=(20, 20), textcoords='offset points',
|
|
||||||
bbox=dict(boxstyle='round,pad=0.5', fc='yellow', alpha=0.8),
|
|
||||||
arrowprops=dict(arrowstyle='->', lw=1.5, color='darkred'))
|
|
||||||
|
|
||||||
# RGB images in grid (below probability plot)
|
|
||||||
for i, (rgb, date) in enumerate(zip(rgb_images, rgb_dates)):
|
|
||||||
ax = plt.subplot(n_rows + 1, n_cols, n_cols + i + 1)
|
|
||||||
ax.imshow(rgb, extent=[0, 100, 0, 100])
|
|
||||||
|
|
||||||
# Highlight peak date
|
|
||||||
date_label = date.strftime('%m-%d')
|
|
||||||
is_peak = date == peak_date_only
|
|
||||||
color = 'darkred' if is_peak else 'black'
|
|
||||||
weight = 'bold' if is_peak else 'normal'
|
|
||||||
size = 11 if is_peak else 9
|
|
||||||
|
|
||||||
ax.set_title(date_label, fontsize=size, fontweight=weight, color=color)
|
|
||||||
ax.set_xticks([])
|
|
||||||
ax.set_yticks([])
|
|
||||||
|
|
||||||
plt.suptitle(f'Field {field_id} RGB Imagery: {len(rgb_images)} Days Around Peak Harvest Probability\nPeak: {peak_prob:.2f} on {peak_date_only} | Green = Growing | Brown/Bare = Harvested',
|
|
||||||
fontsize=13, fontweight='bold', y=0.995)
|
|
||||||
plt.tight_layout()
|
|
||||||
|
|
||||||
# Save
|
|
||||||
output_file = Path(output_dir) / f"field_{field_id}_rgb_validation.png"
|
|
||||||
plt.savefig(output_file, dpi=100, bbox_inches='tight')
|
|
||||||
print(f" ✓ Saved: {output_file}")
|
|
||||||
plt.close()
|
|
||||||
|
|
||||||
return True
|
|
||||||
|
|
||||||
except Exception as e:
|
|
||||||
print(f" ✗ Field {field_id}: Error - {e}")
|
|
||||||
import traceback
|
|
||||||
traceback.print_exc()
|
|
||||||
return False
|
|
||||||
|
|
||||||
|
|
||||||
def main():
|
|
||||||
print("="*80)
|
|
||||||
print("BATCH RGB VISUALIZATION TOOL")
|
|
||||||
print("Visual check: RGB imagery around predicted harvest date")
|
|
||||||
print("="*80)
|
|
||||||
|
|
||||||
# Parse arguments
|
|
||||||
fields_to_plot = []
|
|
||||||
|
|
||||||
if len(sys.argv) < 2:
|
|
||||||
print("\nUsage:")
|
|
||||||
print(" python batch_plot_fields_rgb.py field1,field2,field3")
|
|
||||||
print(" python batch_plot_fields_rgb.py --file fields.csv")
|
|
||||||
print("\nExample:")
|
|
||||||
print(" python batch_plot_fields_rgb.py 10125,88,97,440")
|
|
||||||
return
|
|
||||||
|
|
||||||
if sys.argv[1] == "--file":
|
|
||||||
if len(sys.argv) < 3:
|
|
||||||
print("ERROR: --file requires a CSV filename")
|
|
||||||
return
|
|
||||||
csv_file = sys.argv[2]
|
|
||||||
print(f"\n[1/4] Loading fields from CSV: {csv_file}")
|
|
||||||
try:
|
|
||||||
df = pd.read_csv(csv_file)
|
|
||||||
fields_to_plot = df['field'].astype(str).str.strip().tolist()
|
|
||||||
print(f" ✓ Loaded {len(fields_to_plot)} fields")
|
|
||||||
except Exception as e:
|
|
||||||
print(f" ✗ Error reading CSV: {e}")
|
|
||||||
return
|
|
||||||
else:
|
|
||||||
# Parse comma-separated list
|
|
||||||
fields_to_plot = [f.strip() for f in sys.argv[1].split(',')]
|
|
||||||
print(f"\n[1/4] Processing {len(fields_to_plot)} field(s): {', '.join(fields_to_plot)}")
|
|
||||||
|
|
||||||
# Load CI data
|
|
||||||
print("\n[2/4] Loading CI data...")
|
|
||||||
try:
|
|
||||||
ci_data = pd.read_csv("ci_data_for_python.csv")
|
|
||||||
ci_data['Date'] = pd.to_datetime(ci_data['Date'])
|
|
||||||
ci_data['field'] = ci_data['field'].astype(str).str.strip()
|
|
||||||
print(f" ✓ Loaded {len(ci_data)} observations for {ci_data['field'].nunique()} fields")
|
|
||||||
except Exception as e:
|
|
||||||
print(f" ✗ Error loading CI data: {e}")
|
|
||||||
return
|
|
||||||
|
|
||||||
# Load model
|
|
||||||
print("\n[3/4] Loading model...")
|
|
||||||
try:
|
|
||||||
model, config, scalers = load_model_and_config(Path("."))
|
|
||||||
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
|
|
||||||
model.eval()
|
|
||||||
print(f" ✓ Model loaded on {device}")
|
|
||||||
except Exception as e:
|
|
||||||
print(f" ✗ Error loading model: {e}")
|
|
||||||
return
|
|
||||||
|
|
||||||
# Process each field
|
|
||||||
print("\n[4/4] Generating RGB validation plots...")
|
|
||||||
success_count = 0
|
|
||||||
for field_id in fields_to_plot:
|
|
||||||
if plot_field_rgb_validation(field_id, ci_data, model, config, scalers, device):
|
|
||||||
success_count += 1
|
|
||||||
|
|
||||||
# Summary
|
|
||||||
print("\n" + "="*80)
|
|
||||||
print(f"SUMMARY: {success_count}/{len(fields_to_plot)} fields processed successfully")
|
|
||||||
print(f"Output directory: validation_plots_rgb/")
|
|
||||||
print("="*80)
|
|
||||||
print("\nInspect the PNG files to verify predictions:")
|
|
||||||
print(" ✓ Green imagery BEFORE peak date (field growing)")
|
|
||||||
print(" ✓ Brown/Bare imagery AT/AFTER peak date (harvested)")
|
|
||||||
print(" ✓ Peak date marked with red title")
|
|
||||||
|
|
||||||
|
|
||||||
if __name__ == "__main__":
|
|
||||||
main()
|
|
||||||
|
|
@ -1,351 +0,0 @@
|
||||||
"""
|
|
||||||
Script: compare_harvest_dates.py
|
|
||||||
Purpose: Compare predicted harvest dates (from LSTM model) vs actual harvest dates.
|
|
||||||
Visualize with CI curves, probability predictions, and harvest date lines.
|
|
||||||
|
|
||||||
Workflow:
|
|
||||||
1. Load ci_data_for_python.csv (CI time series)
|
|
||||||
2. Load harvest_production_export.xlsx (predicted dates)
|
|
||||||
3. Load harvest_angata_real.xlsx (actual dates)
|
|
||||||
4. Match by field + year from "Data2024 : 2218" format
|
|
||||||
5. Calculate error (predicted - actual)
|
|
||||||
6. Visualize: 3 panels (CI, imminent prob, detected prob) with harvest lines
|
|
||||||
"""
|
|
||||||
|
|
||||||
import pandas as pd
|
|
||||||
import numpy as np
|
|
||||||
import matplotlib.pyplot as plt
|
|
||||||
from matplotlib.dates import DateFormatter
|
|
||||||
import matplotlib.dates as mdates
|
|
||||||
from pathlib import Path
|
|
||||||
from datetime import datetime, timedelta
|
|
||||||
import warnings
|
|
||||||
warnings.filterwarnings('ignore')
|
|
||||||
|
|
||||||
def load_and_prepare_data():
|
|
||||||
"""Load all required data files."""
|
|
||||||
print("="*80)
|
|
||||||
print("HARVEST DATE COMPARISON: PREDICTED VS ACTUAL")
|
|
||||||
print("="*80)
|
|
||||||
|
|
||||||
# Load CI data
|
|
||||||
print("\n[1/3] Loading CI data...")
|
|
||||||
ci_data = pd.read_csv("ci_data_for_python.csv")
|
|
||||||
ci_data['Date'] = pd.to_datetime(ci_data['Date'])
|
|
||||||
print(" [OK] Loaded {} daily rows".format(len(ci_data)))
|
|
||||||
|
|
||||||
# Load predicted harvest dates
|
|
||||||
print("\n[2/3] Loading predicted harvest dates...")
|
|
||||||
pred_harvests = pd.read_excel("harvest_production_export.xlsx")
|
|
||||||
# Find the harvest date column (might be e1_harvest_date or phase1_harvest_date)
|
|
||||||
harvest_col = None
|
|
||||||
for col in pred_harvests.columns:
|
|
||||||
if 'harvest' in col.lower() and 'date' in col.lower():
|
|
||||||
harvest_col = col
|
|
||||||
break
|
|
||||||
if harvest_col:
|
|
||||||
pred_harvests['predicted_harvest_date'] = pd.to_datetime(pred_harvests[harvest_col])
|
|
||||||
print(" [OK] Loaded {} predictions".format(len(pred_harvests)))
|
|
||||||
print(" Columns: {}".format(list(pred_harvests.columns)))
|
|
||||||
|
|
||||||
# Load actual harvest dates
|
|
||||||
print("\n[3/3] Loading actual harvest dates...")
|
|
||||||
actual_harvests = pd.read_excel("harvest_angata_real.xlsx")
|
|
||||||
# Parse date columns
|
|
||||||
actual_harvests['season_start'] = pd.to_datetime(actual_harvests['season_start'], errors='coerce')
|
|
||||||
actual_harvests['season_end'] = pd.to_datetime(actual_harvests['season_end'], errors='coerce')
|
|
||||||
print(" [OK] Loaded {} actual harvests".format(len(actual_harvests)))
|
|
||||||
print(" Columns: {}".format(list(actual_harvests.columns)))
|
|
||||||
|
|
||||||
return ci_data, pred_harvests, actual_harvests
|
|
||||||
|
|
||||||
def extract_field_year_from_season(season_str):
|
|
||||||
"""Extract field and year from season column like 'Data2023 : 2218'.
|
|
||||||
Returns: (year, field) - in that order for consistency"""
|
|
||||||
try:
|
|
||||||
parts = season_str.split(" : ")
|
|
||||||
year_part = parts[0].replace("Data", "") # "Data2023" -> "2023"
|
|
||||||
field_part = parts[1] if len(parts) > 1 else None
|
|
||||||
year = int(year_part)
|
|
||||||
return year, field_part # Return as (year, field)
|
|
||||||
except:
|
|
||||||
return None, None
|
|
||||||
|
|
||||||
def match_harvests(ci_data, pred_harvests, actual_harvests):
|
|
||||||
"""Match predicted and actual harvests by field.
|
|
||||||
|
|
||||||
Logic:
|
|
||||||
- Predicted: field column contains the field ID (not from season)
|
|
||||||
- Actual: field column contains the field ID
|
|
||||||
- Match by field directly
|
|
||||||
"""
|
|
||||||
print("\n" + "="*80)
|
|
||||||
print("MATCHING PREDICTED vs ACTUAL HARVEST DATES")
|
|
||||||
print("="*80)
|
|
||||||
|
|
||||||
# Use field column directly from predicted (NOT parsed from season)
|
|
||||||
# Clean field values: strip whitespace, remove empty, and convert to int
|
|
||||||
pred_harvests = pred_harvests[pred_harvests['field'].astype(str).str.strip() != ''].copy()
|
|
||||||
pred_harvests['field_pred'] = pred_harvests['field'].astype(str).str.strip().astype(int)
|
|
||||||
pred_harvests['year_pred'] = pred_harvests['season'].apply(
|
|
||||||
lambda x: extract_field_year_from_season(x)[0] # Just get year
|
|
||||||
)
|
|
||||||
# Use season_end_date as predicted harvest date
|
|
||||||
pred_harvests['predicted_harvest_date'] = pd.to_datetime(pred_harvests['season_end_date'])
|
|
||||||
|
|
||||||
# Actual harvests: keep field as int, extract year from season_start
|
|
||||||
actual_harvests = actual_harvests[actual_harvests['field'].astype(str).str.strip() != ''].copy()
|
|
||||||
actual_harvests['field'] = actual_harvests['field'].astype(str).str.strip().astype(int)
|
|
||||||
actual_harvests['season_start'] = pd.to_datetime(actual_harvests['season_start'])
|
|
||||||
actual_harvests['year'] = actual_harvests['season_start'].dt.year
|
|
||||||
# Actual harvest date = day before season_start (when new crop started)
|
|
||||||
actual_harvests['actual_harvest_date'] = actual_harvests['season_start'] - pd.Timedelta(days=1)
|
|
||||||
|
|
||||||
# Use all actual data (year columns will track actual season years)
|
|
||||||
|
|
||||||
print("\nPredicted harvests - sample:")
|
|
||||||
print(pred_harvests[['field_pred', 'year_pred', 'predicted_harvest_date']].head())
|
|
||||||
print("\nActual harvests - sample:")
|
|
||||||
print(actual_harvests[['field', 'year', 'actual_harvest_date']].head())
|
|
||||||
|
|
||||||
# Merge on field (match predicted field with actual field)
|
|
||||||
merged = pd.merge(
|
|
||||||
pred_harvests,
|
|
||||||
actual_harvests,
|
|
||||||
left_on=['field_pred'],
|
|
||||||
right_on=['field'],
|
|
||||||
how='inner'
|
|
||||||
)
|
|
||||||
|
|
||||||
print("\n[OK] Matched {} harvest comparisons".format(len(merged)))
|
|
||||||
|
|
||||||
if len(merged) == 0:
|
|
||||||
print("[X] No matches found!")
|
|
||||||
return merged, ci_data
|
|
||||||
|
|
||||||
# Calculate error in days (predicted - actual)
|
|
||||||
merged['error_days'] = (merged['predicted_harvest_date'] - merged['actual_harvest_date']).dt.days
|
|
||||||
|
|
||||||
print("\nError Statistics (Predicted - Actual, in days):")
|
|
||||||
print(" Mean error: {:.1f} days".format(merged['error_days'].mean()))
|
|
||||||
print(" Std error: {:.1f} days".format(merged['error_days'].std()))
|
|
||||||
print(" Min error: {:.0f} days".format(merged['error_days'].min()))
|
|
||||||
print(" Max error: {:.0f} days".format(merged['error_days'].max()))
|
|
||||||
print(" Median error: {:.0f} days".format(merged['error_days'].median()))
|
|
||||||
print(" Fields within +/- 7 days: {} / {}".format((merged['error_days'].abs() <= 7).sum(), len(merged)))
|
|
||||||
print(" Fields within +/- 14 days: {} / {}".format((merged['error_days'].abs() <= 14).sum(), len(merged)))
|
|
||||||
|
|
||||||
return merged, ci_data
|
|
||||||
|
|
||||||
def plot_comparison(ci_data, field_int, all_predictions, actual_dates, output_dir="harvest_comparison"):
|
|
||||||
"""Create 3-panel plot with all CI data, imminent prob, detected prob.
|
|
||||||
|
|
||||||
Args:
|
|
||||||
ci_data: Full CI dataset
|
|
||||||
field_int: Field ID (integer)
|
|
||||||
all_predictions: List of tuples (pred_date, year) for this field
|
|
||||||
actual_dates: List of actual harvest dates for this field
|
|
||||||
"""
|
|
||||||
|
|
||||||
# Create output directory
|
|
||||||
Path(output_dir).mkdir(exist_ok=True)
|
|
||||||
|
|
||||||
# Filter CI data for this field
|
|
||||||
field_data = ci_data[ci_data['field'] == field_int].copy()
|
|
||||||
|
|
||||||
if len(field_data) == 0:
|
|
||||||
print(" [X] No CI data for field {}".format(field_int))
|
|
||||||
return None
|
|
||||||
|
|
||||||
field_data = field_data.sort_values('Date')
|
|
||||||
|
|
||||||
# Create 3-panel plot with all CI data
|
|
||||||
fig, axes = plt.subplots(3, 1, figsize=(16, 11), sharex=True)
|
|
||||||
|
|
||||||
dates = field_data['Date'].values
|
|
||||||
fitdata_values = field_data['FitData'].values
|
|
||||||
|
|
||||||
# Calculate 7-day moving average
|
|
||||||
ma7_values = pd.Series(fitdata_values).rolling(window=7, center=True).mean().values
|
|
||||||
|
|
||||||
# Panel 1: CI curve with all predicted and actual harvest lines
|
|
||||||
ax = axes[0]
|
|
||||||
# Plot CI values in lighter green
|
|
||||||
ax.plot(dates, fitdata_values, color='lightgreen', linewidth=1, label='CI (FitData)', alpha=0.7)
|
|
||||||
# Plot 7-day MA in darker green
|
|
||||||
ax.plot(dates, ma7_values, color='green', linewidth=2.5, label='CI (7-day MA)', alpha=0.9)
|
|
||||||
|
|
||||||
# Add all predicted harvest date lines
|
|
||||||
for pred_date, year in all_predictions:
|
|
||||||
if pd.notna(pred_date):
|
|
||||||
ax.axvline(pred_date, color='orange', linestyle='--', linewidth=2, alpha=0.7)
|
|
||||||
|
|
||||||
# Add actual harvest date lines
|
|
||||||
for actual_date in actual_dates:
|
|
||||||
if pd.notna(actual_date):
|
|
||||||
ax.axvline(actual_date, color='red', linestyle='-', linewidth=2.5, alpha=0.8)
|
|
||||||
|
|
||||||
# Custom legend
|
|
||||||
from matplotlib.lines import Line2D
|
|
||||||
legend_elements = [
|
|
||||||
Line2D([0], [0], color='lightgreen', linewidth=1, label='CI (FitData)'),
|
|
||||||
Line2D([0], [0], color='green', linewidth=2.5, label='CI (7-day MA)'),
|
|
||||||
Line2D([0], [0], color='orange', linestyle='--', linewidth=2, label='Predicted harvest'),
|
|
||||||
Line2D([0], [0], color='red', linestyle='-', linewidth=2.5, label='Actual harvest')
|
|
||||||
]
|
|
||||||
ax.legend(handles=legend_elements, loc='upper left', fontsize=10)
|
|
||||||
|
|
||||||
ax.set_ylabel('CI Value', fontsize=11, fontweight='bold')
|
|
||||||
ax.set_title('Field {} - Canopy Index & Harvest Dates (All Data)'.format(field_int),
|
|
||||||
fontsize=13, fontweight='bold')
|
|
||||||
ax.grid(True, alpha=0.3)
|
|
||||||
|
|
||||||
# Panel 2: Imminent probability
|
|
||||||
ax = axes[1]
|
|
||||||
# Create synthetic probability based on CI trend
|
|
||||||
ci_normalized = (fitdata_values - fitdata_values.min()) / (fitdata_values.max() - fitdata_values.min() + 0.01)
|
|
||||||
imminent_prob = 1.0 - ci_normalized # Higher imminent when CI is low
|
|
||||||
imminent_prob = np.convolve(imminent_prob, np.ones(7)/7, mode='same') # Smooth
|
|
||||||
imminent_prob = np.clip(imminent_prob, 0, 1)
|
|
||||||
|
|
||||||
ax.plot(dates, imminent_prob, color='orange', linewidth=2.5, label='Imminent Probability', alpha=0.85)
|
|
||||||
ax.axhline(0.5, color='gray', linestyle=':', linewidth=1.5, alpha=0.5, label='Threshold (0.5)')
|
|
||||||
|
|
||||||
# Add harvest lines
|
|
||||||
for pred_date, year in all_predictions:
|
|
||||||
if pd.notna(pred_date):
|
|
||||||
ax.axvline(pred_date, color='orange', linestyle='--', linewidth=2, alpha=0.7)
|
|
||||||
for actual_date in actual_dates:
|
|
||||||
if pd.notna(actual_date):
|
|
||||||
ax.axvline(actual_date, color='red', linestyle='-', linewidth=2.5, alpha=0.8)
|
|
||||||
|
|
||||||
ax.set_ylabel('Probability', fontsize=11, fontweight='bold')
|
|
||||||
ax.set_ylim([0, 1.05])
|
|
||||||
ax.legend(loc='upper left', fontsize=10)
|
|
||||||
ax.grid(True, alpha=0.3)
|
|
||||||
|
|
||||||
# Panel 3: Detected probability (CI decline rate)
|
|
||||||
ax = axes[2]
|
|
||||||
ci_rate = np.gradient(fitdata_values)
|
|
||||||
detected_prob = np.clip(-ci_rate / (np.abs(ci_rate).max() + 0.01), 0, 1) # High when decreasing
|
|
||||||
detected_prob = np.convolve(detected_prob, np.ones(7)/7, mode='same') # Smooth
|
|
||||||
|
|
||||||
ax.plot(dates, detected_prob, color='red', linewidth=2.5, label='Detected Probability', alpha=0.85)
|
|
||||||
ax.axhline(0.5, color='gray', linestyle=':', linewidth=1.5, alpha=0.5, label='Threshold (0.5)')
|
|
||||||
|
|
||||||
# Add harvest lines
|
|
||||||
for pred_date, year in all_predictions:
|
|
||||||
if pd.notna(pred_date):
|
|
||||||
ax.axvline(pred_date, color='orange', linestyle='--', linewidth=2, alpha=0.7)
|
|
||||||
for actual_date in actual_dates:
|
|
||||||
if pd.notna(actual_date):
|
|
||||||
ax.axvline(actual_date, color='red', linestyle='-', linewidth=2.5, alpha=0.8)
|
|
||||||
|
|
||||||
ax.set_xlabel('Date', fontsize=11, fontweight='bold')
|
|
||||||
ax.set_ylabel('Probability', fontsize=11, fontweight='bold')
|
|
||||||
ax.set_ylim([0, 1.05])
|
|
||||||
ax.legend(loc='upper left', fontsize=10)
|
|
||||||
ax.grid(True, alpha=0.3)
|
|
||||||
|
|
||||||
# Format x-axis
|
|
||||||
for ax_item in axes:
|
|
||||||
ax_item.xaxis.set_major_formatter(DateFormatter("%Y-%m"))
|
|
||||||
ax_item.xaxis.set_major_locator(mdates.MonthLocator(interval=3))
|
|
||||||
ax_item.tick_params(axis='x', rotation=45)
|
|
||||||
|
|
||||||
plt.tight_layout()
|
|
||||||
|
|
||||||
# Save with field ID only (since showing all years)
|
|
||||||
filename = "harvest_comparison_{}.png".format(field_int)
|
|
||||||
filepath = Path(output_dir) / filename
|
|
||||||
plt.savefig(filepath, dpi=150, bbox_inches='tight')
|
|
||||||
print(" [OK] Saved to {}".format(filename))
|
|
||||||
plt.close()
|
|
||||||
|
|
||||||
return filepath
|
|
||||||
|
|
||||||
def main():
|
|
||||||
# Load data
|
|
||||||
ci_data, pred_harvests, actual_harvests = load_and_prepare_data()
|
|
||||||
|
|
||||||
# Match harvests
|
|
||||||
merged, ci_data = match_harvests(ci_data, pred_harvests, actual_harvests)
|
|
||||||
|
|
||||||
if len(merged) == 0:
|
|
||||||
print("\n[X] No matches found. Check column names in Excel files.")
|
|
||||||
return
|
|
||||||
|
|
||||||
# Create comparison plots for all fields
|
|
||||||
print("\n" + "="*80)
|
|
||||||
print("GENERATING COMPARISON PLOTS")
|
|
||||||
print("="*80)
|
|
||||||
|
|
||||||
# Filter to only fields that exist in CI data (convert to int for consistent comparison)
|
|
||||||
ci_fields_int = set(ci_data['field'].unique())
|
|
||||||
merged_with_ci = merged[merged['field_pred'].astype(int).isin(ci_fields_int)].copy()
|
|
||||||
|
|
||||||
print("\nFiltering merged data to fields with CI data...")
|
|
||||||
print(" Matched comparisons: {}".format(len(merged)))
|
|
||||||
print(" CI fields available: {}".format(len(ci_fields_int)))
|
|
||||||
print(" Comparisons with CI data: {}".format(len(merged_with_ci)))
|
|
||||||
|
|
||||||
if len(merged_with_ci) == 0:
|
|
||||||
print("\n[X] No fields with CI data found in predictions!")
|
|
||||||
return
|
|
||||||
|
|
||||||
# Plot all fields with CI data - one plot per field with all predicted/actual dates
|
|
||||||
print("\n" + "="*80)
|
|
||||||
print("GENERATING COMPARISON PLOTS")
|
|
||||||
print("="*80)
|
|
||||||
|
|
||||||
# Filter to only fields that exist in CI data (convert to int for consistent comparison)
|
|
||||||
ci_fields_int = set(ci_data['field'].unique())
|
|
||||||
merged_with_ci = merged[merged['field_pred'].astype(int).isin(ci_fields_int)].copy()
|
|
||||||
|
|
||||||
print("\nFiltering merged data to fields with CI data...")
|
|
||||||
print(" Matched comparisons: {}".format(len(merged)))
|
|
||||||
print(" CI fields available: {}".format(len(ci_fields_int)))
|
|
||||||
print(" Comparisons with CI data: {}".format(len(merged_with_ci)))
|
|
||||||
|
|
||||||
if len(merged_with_ci) == 0:
|
|
||||||
print("\n[X] No fields with CI data found in predictions!")
|
|
||||||
return
|
|
||||||
|
|
||||||
# Group by field to collect all predictions and actuals
|
|
||||||
field_groups = merged_with_ci.groupby('field_pred')
|
|
||||||
|
|
||||||
for idx, (field_id, group) in enumerate(field_groups):
|
|
||||||
field_int = int(field_id)
|
|
||||||
|
|
||||||
# Collect all predictions for this field
|
|
||||||
all_predictions = [(row['predicted_harvest_date'], row['year_pred'])
|
|
||||||
for _, row in group.iterrows()]
|
|
||||||
|
|
||||||
# Collect all actual dates for this field
|
|
||||||
actual_dates = group['actual_harvest_date'].unique()
|
|
||||||
|
|
||||||
print("\n[{}/{}] Field {} - {} predictions, {} actuals".format(
|
|
||||||
idx+1, len(field_groups), field_int, len(all_predictions), len(actual_dates)))
|
|
||||||
|
|
||||||
plot_comparison(ci_data, field_int, all_predictions, actual_dates)
|
|
||||||
|
|
||||||
# Export summary table
|
|
||||||
print("\n" + "="*80)
|
|
||||||
print("SAVING COMPARISON SUMMARY")
|
|
||||||
print("="*80)
|
|
||||||
|
|
||||||
summary = merged[[
|
|
||||||
'field_pred', 'year_pred', 'predicted_harvest_date', 'actual_harvest_date', 'error_days'
|
|
||||||
]].copy()
|
|
||||||
summary.columns = ['Field', 'Year', 'Predicted_Date', 'Actual_Date', 'Error_Days']
|
|
||||||
summary = summary.sort_values('Error_Days').reset_index(drop=True)
|
|
||||||
|
|
||||||
summary_file = "harvest_comparison_summary.xlsx"
|
|
||||||
summary.to_excel(summary_file, index=False)
|
|
||||||
print("\n[OK] Saved comparison summary to {}".format(summary_file))
|
|
||||||
print(" Total comparisons: {}".format(len(summary)))
|
|
||||||
|
|
||||||
print("\n✓ Harvest date comparison complete!")
|
|
||||||
|
|
||||||
if __name__ == "__main__":
|
|
||||||
main()
|
|
||||||
|
|
@ -1,43 +0,0 @@
|
||||||
{
|
|
||||||
"name": "307_dropout02_with_doy",
|
|
||||||
"description": "Phase 3: Dropout sweep 0.2 (minimal regularization)",
|
|
||||||
"features": [
|
|
||||||
"CI_raw",
|
|
||||||
"7d_MA",
|
|
||||||
"14d_MA",
|
|
||||||
"21d_MA",
|
|
||||||
"7d_velocity",
|
|
||||||
"14d_velocity",
|
|
||||||
"21d_velocity",
|
|
||||||
"7d_min",
|
|
||||||
"14d_min",
|
|
||||||
"21d_min",
|
|
||||||
"7d_std",
|
|
||||||
"14d_std",
|
|
||||||
"21d_std",
|
|
||||||
"DOY_normalized"
|
|
||||||
],
|
|
||||||
"model": {
|
|
||||||
"type": "LSTM",
|
|
||||||
"hidden_size": 256,
|
|
||||||
"num_layers": 1,
|
|
||||||
"dropout": 0.2
|
|
||||||
},
|
|
||||||
"training": {
|
|
||||||
"imminent_days_before": 28,
|
|
||||||
"imminent_days_before_end": 1,
|
|
||||||
"detected_days_after_start": 1,
|
|
||||||
"detected_days_after_end": 21,
|
|
||||||
"k_folds": 5,
|
|
||||||
"num_epochs": 150,
|
|
||||||
"patience": 20,
|
|
||||||
"learning_rate": 0.001,
|
|
||||||
"batch_size": 4
|
|
||||||
},
|
|
||||||
"data": {
|
|
||||||
"csv_path": "../lstm_complete_data.csv",
|
|
||||||
"ci_column": "FitData",
|
|
||||||
"test_fraction": 0.15,
|
|
||||||
"seed": 42
|
|
||||||
}
|
|
||||||
}
|
|
||||||
|
|
@ -1,27 +0,0 @@
|
||||||
import pandas as pd
|
|
||||||
|
|
||||||
# Load both files
|
|
||||||
h = pd.read_excel('harvest_production_export.xlsx')
|
|
||||||
c = pd.read_csv('ci_data_for_python.csv')
|
|
||||||
|
|
||||||
# Check fields
|
|
||||||
harvest_fields = set(h['field'].unique())
|
|
||||||
ci_fields = set(c['field'].unique())
|
|
||||||
|
|
||||||
print("Harvest file fields:", sorted(list(harvest_fields))[:10])
|
|
||||||
print("CI file fields:", sorted(list(ci_fields))[:10])
|
|
||||||
|
|
||||||
# Check intersection
|
|
||||||
common = harvest_fields & ci_fields
|
|
||||||
print(f"\nCommon fields: {len(common)}")
|
|
||||||
print("First 10 common:", sorted(list(common))[:10])
|
|
||||||
|
|
||||||
# Check which fields are in harvest but not in CI
|
|
||||||
harvest_only = harvest_fields - ci_fields
|
|
||||||
print(f"\nFields in harvest but NOT in CI: {len(harvest_only)}")
|
|
||||||
print("Examples:", sorted(list(harvest_only))[:10])
|
|
||||||
|
|
||||||
# Check which fields are in CI but not in harvest
|
|
||||||
ci_only = ci_fields - harvest_fields
|
|
||||||
print(f"\nFields in CI but NOT in harvest: {len(ci_only)}")
|
|
||||||
print("Examples:", sorted(list(ci_only))[:10])
|
|
||||||
|
Before Width: | Height: | Size: 437 KiB |
|
Before Width: | Height: | Size: 440 KiB |
|
Before Width: | Height: | Size: 364 KiB |
|
Before Width: | Height: | Size: 365 KiB |
|
Before Width: | Height: | Size: 392 KiB |
|
Before Width: | Height: | Size: 404 KiB |
|
Before Width: | Height: | Size: 410 KiB |
|
Before Width: | Height: | Size: 354 KiB |