SmartCane/data_validation_tool/README.md
2026-01-06 14:17:37 +01:00

213 lines
6.2 KiB
Markdown

# SmartCane Data Validation Tool
A standalone, client-side data validation tool for validating Excel harvest data and GeoJSON field boundaries before uploading to the SmartCane system.
## Features
### 🚦 Traffic Light System
- **🟢 GREEN**: All checks passed
- **🟡 YELLOW**: Warnings detected (non-critical issues)
- **🔴 RED**: Errors detected (blocking issues)
### ✅ Validation Checks
1. **Excel Column Validation**
- Checks for all 8 required columns: `field`, `sub_field`, `year`, `season_start`, `season_end`, `age`, `sub_area`, `tonnage_ha`
- Identifies extra columns that will be ignored
- Shows missing columns that must be added
2. **GeoJSON Properties Validation**
- Checks all features have required properties: `field`, `sub_field`
- Identifies redundant properties that will be ignored
3. **Coordinate Reference System (CRS)**
- Validates correct CRS: **EPSG:32736 (UTM Zone 36S)**
- This CRS was validated from your Angata farm coordinates
- Explains why this specific CRS is required
4. **Field Name Matching**
- Compares field names between Excel and GeoJSON
- Shows which fields exist in only one dataset
- Highlights misspellings or missing fields
- Provides complete matching summary table
5. **Data Type & Content Validation**
- Checks column data types:
- `year`: Must be integer
- `season_start`, `season_end`: Must be valid dates
- `age`, `sub_area`, `tonnage_ha`: Must be numeric (decimal)
- Identifies rows with missing `season_start` dates
- Flags invalid date formats and numeric values
## File Requirements
### Excel File (harvest.xlsx)
```
| field | sub_field | year | season_start | season_end | age | sub_area | tonnage_ha |
|----------|------------------|------|--------------|------------|-----|----------|-----------|
| kowawa | kowawa | 2023 | 2023-01-15 | 2024-01-14 | 1.5 | 45 | 125.5 |
| Tamu | Tamu Upper | 2023 | 2023-02-01 | 2024-01-31 | 1.0 | 30 | 98.0 |
```
**Data Types:**
- `field`, `sub_field`: Text (can be numeric as text)
- `year`: Integer
- `season_start`, `season_end`: Date (YYYY-MM-DD format)
- `age`, `sub_area`, `tonnage_ha`: Decimal/Float
**Extra columns** are allowed but will not be processed.
### GeoJSON File (pivot.geojson)
```json
{
"type": "FeatureCollection",
"crs": {
"type": "name",
"properties": {
"name": "urn:ogc:def:crs:EPSG::32736"
}
},
"features": [
{
"type": "Feature",
"properties": {
"field": "kowawa",
"sub_field": "kowawa"
},
"geometry": {
"type": "MultiPolygon",
"coordinates": [...]
}
}
]
}
```
**Required Properties:**
- `field`: Field identifier (must match Excel)
- `sub_field`: Sub-field identifier (must match Excel)
**Optional Properties:**
- `STATUS`, `name`, `age`, etc. - These are allowed but not required
**CRS:**
- Must be EPSG:32736 (UTM Zone 36S)
- This was determined from analyzing your Angata farm coordinates
## Deployment
### Local Use (Recommended for Security)
1. Download the `data_validation_tool` folder
2. Open `index.html` in a web browser
3. Files are processed entirely client-side - no data is sent to servers
### Netlify Deployment
1. Connect to your GitHub repository
2. Set build command: `None`
3. Set publish directory: `data_validation_tool`
4. Deploy
Or use Netlify CLI:
```bash
npm install -g netlify-cli
netlify deploy --dir data_validation_tool
```
### Manual Testing
1. Use the provided sample files:
- Excel: `laravel_app/storage/app/aura/Data/harvest.xlsx`
- GeoJSON: `laravel_app/storage/app/aura/Data/pivot.geojson`
2. Open `index.html`
3. Upload both files
4. Review validation results
## Technical Details
### Browser Requirements
- Modern browser with ES6 support (Chrome, Firefox, Safari, Edge)
- Must support FileReader API and JSON parsing
- Requires XLSX library for Excel parsing
### Dependencies
- **XLSX.js**: For reading Excel files (loaded via CDN in index.html)
### What Happens When You Upload
1. File is read into memory (client-side only)
2. Excel: Parsed using XLSX library into JSON
3. GeoJSON: Parsed directly as JSON
4. All validation runs in your browser
5. Results displayed locally
6. **No files are sent to any server**
## Validation Rules
### Traffic Light Logic
**All GREEN (✓ Passed)**
- All required columns/properties present
- Correct CRS
- All field names match
- All data types valid
**YELLOW (⚠️ Warnings)**
- Extra columns detected (will be ignored)
- Extra properties detected (will be ignored)
- Missing dates in some fields
- Data type issues in specific rows
**RED (✗ Failed)**
- Missing required columns/properties
- Wrong CRS
- Field names mismatch between files
- Fundamental data structure issues
### CRS Explanation
From your project's geospatial analysis:
- **Original issue**: Angata farm GeoJSON had coordinates in UTM Zone 37S but marked as WGS84
- **Root cause**: UTM Zone mismatch - farm is actually in UTM Zone 36S
- **Solution**: Reproject to EPSG:32736 (UTM Zone 36S)
- **Why**: This aligns with actual Angata farm coordinates (longitude ~34.4°E)
## Troubleshooting
### "Failed to read Excel file"
- Ensure file is `.xlsx` format
- File should not be open in Excel while uploading
- Try saving as Excel 2007+ format
### "Failed to parse GeoJSON"
- Ensure file is valid JSON
- Check for syntax errors (extra commas, missing brackets)
- Use online JSON validator at jsonlint.com
### "Wrong CRS detected"
- GeoJSON must explicitly state CRS as EPSG:32736
- Example: `"name": "urn:ogc:def:crs:EPSG::32736"`
- Reproject in QGIS or R if needed
### "Field names don't match"
- Check for typos and capitalization differences
- Spaces at beginning/end of field names
- Use field names exactly as they appear in both files
## Future Enhancements
- [ ] Download validation report as PDF
- [ ] Batch upload multiple Excel/GeoJSON pairs
- [ ] Auto-detect and suggest field mappings
- [ ] Geometry validity checks (self-intersecting polygons)
- [ ] Area comparison between Excel and GeoJSON
- [ ] Export cleaned/standardized files
## Support
For questions about data validation requirements, contact the SmartCane team.
---
**Tool Version**: 1.0
**Last Updated**: December 2025
**CRS Reference**: EPSG:32736 (UTM Zone 36S)