Files
venus/WATCHDOG_DEBUG_REPORT.md
2026-03-26 18:23:55 +00:00

112 lines
4.4 KiB
Markdown

# Watchdog Reboot Debug Report - 2026-03-26
## Problem
Cerbo GX (einstein) triggered watchdog reboot on Mar 24 20:13:43 due to sustained high load average (11/9/7) exceeding thresholds (0/10/6).
## Root Cause
Cumulative CPU pressure from **7 custom Python D-Bus services** plus **dbus-raymarine-publisher** (multicast decoder) running simultaneously on a dual-core ARM Cortex-A7.
### Services Running at Time of Reboot
1. **dbus-anchor-alarm** - 1 Hz, 10 D-Bus reads/sec, circle fitting on 1800 points, JSON serializing 5000 track points every 5s
2. **dbus-generator-ramp** - 2 Hz (500ms), multiple D-Bus reads + regression math
3. **dbus-tides** - 1 Hz, SQLite writes + harmonic calculations
4. **dbus-meteoblue-forecast** - periodic HTTP API calls
5. **dbus-no-foreign-land** - periodic GPS uploads
6. **dbus-windy-station** - periodic sensor uploads
7. **dbus-raymarine-publisher** - continuous multicast protobuf decoding (12% CPU sustained)
### Key Findings
- **dbus-daemon**: 13% CPU (bottleneck from ~20 services making synchronous GetValue() calls)
- **dbus-raymarine-publisher**: 12% CPU (multiple threads continuously parsing multicast packets)
- **Total Python CPU**: ~50% aggregate across all custom services
- **Memory**: OK (609MB available of 1GB)
- **No crash loops**: All services had 152K+ second uptimes
## Optimizations Applied (v2.1.0)
### 1. dbus-generator-ramp
- **Changed**: Main loop from 500ms → 1000ms (2Hz → 1Hz)
- **File**: `dbus-generator-ramp/config.py` line 257
- **Impact**: 50% reduction in D-Bus polling and math operations
- **Version**: 2.0.0 → 2.1.0
### 2. dbus-anchor-alarm
- **Changed**: JSON update interval from 5s → 20s
- **File**: `dbus-anchor-alarm/anchor_alarm.py` line 78
- **Impact**: 75% reduction in large JSON serializations
- **Changed**: Track buffer from 5000 → 2000 points
- **File**: `dbus-anchor-alarm/track_buffer.py` line 16
- **Impact**: 60% less data to serialize and transmit over MQTT
- **Version**: 2.0.0 → 2.1.0
## Load Average Results
**Before optimizations:**
```
14:40:30 load average: 1.04, 2.30, 2.71
14:44:10 load average: 3.91, 2.50, 2.65 (after anchor-alarm restarted)
14:52:37 load average: 1.29, 3.93, 3.77
14:55:10 load average: 7.04, 6.21, 4.76 (critical)
```
**After optimizations (v2.1.0):**
```
15:05:21 load average: 1.69, 3.95, 4.24
15:06:01 load average: 0.99, 3.48, 4.07
15:06:41 load average: 0.64, 3.08, 3.91
15:07:42 load average: 1.35, 2.87, 3.78 (trending down)
```
**Status**: 15-minute load declining from 4.76 → 3.78, should continue dropping below watchdog threshold (6.0) over next 15 minutes.
## Remaining Concerns
### High-Risk Service: dbus-raymarine-publisher (12% sustained CPU)
- Continuous multicast parsing with multiple threads
- Running at 1Hz D-Bus update but packet decoding is continuous
- **Recommendation**: Monitor this service closely; consider adding `--update-interval 2000` (2Hz → 0.5Hz) if load remains elevated
### System-Wide D-Bus Pressure
- `dbus-daemon` at 13% CPU indicates bus saturation
- 20+ services making synchronous calls
- **Future optimization**: Implement D-Bus signal subscriptions instead of polling where possible
## Monitoring Commands
Check load every minute:
```bash
ssh cerbo "watch -n 60 uptime"
```
Monitor Python service CPU:
```bash
ssh cerbo "while true; do top -b -n 1 | grep python3 | head -n 10; sleep 30; done"
```
Check service health:
```bash
ssh cerbo "svstat /service/dbus-* 2>/dev/null | grep -v 'up.*seconds'"
```
## Next Steps if Load Remains High
1. Reduce raymarine publisher update rate to 2000ms
2. Consider disabling debug logging on anchor-alarm (SQLite writes every 15s)
3. Evaluate if all 7 services need to run continuously (some could be on-demand)
4. Long-term: consolidate low-frequency services (meteoblue, windy, nfl) into a single process
## Files Modified
- `dbus-generator-ramp/config.py` (main_loop_interval_ms: 500 → 1000)
- `dbus-generator-ramp/dbus-generator-ramp.py` (VERSION: 2.0.0 → 2.1.0)
- `dbus-generator-ramp/build-package.sh` (VERSION: 1.0.0 → 2.1.0)
- `dbus-anchor-alarm/config.py` (VERSION: 2.0.0 → 2.1.0)
- `dbus-anchor-alarm/anchor_alarm.py` (_JSON_UPDATE_INTERVAL_SEC: 5.0 → 20.0)
- `dbus-anchor-alarm/track_buffer.py` (MAX_POINTS: 5000 → 2000)
- `dbus-anchor-alarm/build-package.sh` (VERSION: 2.0.0 → 2.1.0)
## Deployed Packages
- `dbus-generator-ramp-2.1.0.tar.gz` (installed and running)
- `dbus-anchor-alarm-2.1.0.tar.gz` (installed and running)