112 lines
4.4 KiB
Markdown
112 lines
4.4 KiB
Markdown
# Watchdog Reboot Debug Report - 2026-03-26
|
|
|
|
## Problem
|
|
Cerbo GX (einstein) triggered watchdog reboot on Mar 24 20:13:43 due to sustained high load average (11/9/7) exceeding thresholds (0/10/6).
|
|
|
|
## Root Cause
|
|
Cumulative CPU pressure from **7 custom Python D-Bus services** plus **dbus-raymarine-publisher** (multicast decoder) running simultaneously on a dual-core ARM Cortex-A7.
|
|
|
|
### Services Running at Time of Reboot
|
|
1. **dbus-anchor-alarm** - 1 Hz, 10 D-Bus reads/sec, circle fitting on 1800 points, JSON serializing 5000 track points every 5s
|
|
2. **dbus-generator-ramp** - 2 Hz (500ms), multiple D-Bus reads + regression math
|
|
3. **dbus-tides** - 1 Hz, SQLite writes + harmonic calculations
|
|
4. **dbus-meteoblue-forecast** - periodic HTTP API calls
|
|
5. **dbus-no-foreign-land** - periodic GPS uploads
|
|
6. **dbus-windy-station** - periodic sensor uploads
|
|
7. **dbus-raymarine-publisher** - continuous multicast protobuf decoding (12% CPU sustained)
|
|
|
|
### Key Findings
|
|
- **dbus-daemon**: 13% CPU (bottleneck from ~20 services making synchronous GetValue() calls)
|
|
- **dbus-raymarine-publisher**: 12% CPU (multiple threads continuously parsing multicast packets)
|
|
- **Total Python CPU**: ~50% aggregate across all custom services
|
|
- **Memory**: OK (609MB available of 1GB)
|
|
- **No crash loops**: All services had 152K+ second uptimes
|
|
|
|
## Optimizations Applied (v2.1.0)
|
|
|
|
### 1. dbus-generator-ramp
|
|
- **Changed**: Main loop from 500ms → 1000ms (2Hz → 1Hz)
|
|
- **File**: `dbus-generator-ramp/config.py` line 257
|
|
- **Impact**: 50% reduction in D-Bus polling and math operations
|
|
- **Version**: 2.0.0 → 2.1.0
|
|
|
|
### 2. dbus-anchor-alarm
|
|
- **Changed**: JSON update interval from 5s → 20s
|
|
- **File**: `dbus-anchor-alarm/anchor_alarm.py` line 78
|
|
- **Impact**: 75% reduction in large JSON serializations
|
|
- **Changed**: Track buffer from 5000 → 2000 points
|
|
- **File**: `dbus-anchor-alarm/track_buffer.py` line 16
|
|
- **Impact**: 60% less data to serialize and transmit over MQTT
|
|
- **Version**: 2.0.0 → 2.1.0
|
|
|
|
## Load Average Results
|
|
|
|
**Before optimizations:**
|
|
```
|
|
14:40:30 load average: 1.04, 2.30, 2.71
|
|
14:44:10 load average: 3.91, 2.50, 2.65 (after anchor-alarm restarted)
|
|
14:52:37 load average: 1.29, 3.93, 3.77
|
|
14:55:10 load average: 7.04, 6.21, 4.76 (critical)
|
|
```
|
|
|
|
**After optimizations (v2.1.0):**
|
|
```
|
|
15:05:21 load average: 1.69, 3.95, 4.24
|
|
15:06:01 load average: 0.99, 3.48, 4.07
|
|
15:06:41 load average: 0.64, 3.08, 3.91
|
|
15:07:42 load average: 1.35, 2.87, 3.78 (trending down)
|
|
```
|
|
|
|
**Status**: 15-minute load declining from 4.76 → 3.78, should continue dropping below watchdog threshold (6.0) over next 15 minutes.
|
|
|
|
## Remaining Concerns
|
|
|
|
### High-Risk Service: dbus-raymarine-publisher (12% sustained CPU)
|
|
- Continuous multicast parsing with multiple threads
|
|
- Running at 1Hz D-Bus update but packet decoding is continuous
|
|
- **Recommendation**: Monitor this service closely; consider adding `--update-interval 2000` (2Hz → 0.5Hz) if load remains elevated
|
|
|
|
### System-Wide D-Bus Pressure
|
|
- `dbus-daemon` at 13% CPU indicates bus saturation
|
|
- 20+ services making synchronous calls
|
|
- **Future optimization**: Implement D-Bus signal subscriptions instead of polling where possible
|
|
|
|
## Monitoring Commands
|
|
|
|
Check load every minute:
|
|
```bash
|
|
ssh cerbo "watch -n 60 uptime"
|
|
```
|
|
|
|
Monitor Python service CPU:
|
|
```bash
|
|
ssh cerbo "while true; do top -b -n 1 | grep python3 | head -n 10; sleep 30; done"
|
|
```
|
|
|
|
Check service health:
|
|
```bash
|
|
ssh cerbo "svstat /service/dbus-* 2>/dev/null | grep -v 'up.*seconds'"
|
|
```
|
|
|
|
## Next Steps if Load Remains High
|
|
|
|
1. Reduce raymarine publisher update rate to 2000ms
|
|
2. Consider disabling debug logging on anchor-alarm (SQLite writes every 15s)
|
|
3. Evaluate if all 7 services need to run continuously (some could be on-demand)
|
|
4. Long-term: consolidate low-frequency services (meteoblue, windy, nfl) into a single process
|
|
|
|
## Files Modified
|
|
|
|
- `dbus-generator-ramp/config.py` (main_loop_interval_ms: 500 → 1000)
|
|
- `dbus-generator-ramp/dbus-generator-ramp.py` (VERSION: 2.0.0 → 2.1.0)
|
|
- `dbus-generator-ramp/build-package.sh` (VERSION: 1.0.0 → 2.1.0)
|
|
- `dbus-anchor-alarm/config.py` (VERSION: 2.0.0 → 2.1.0)
|
|
- `dbus-anchor-alarm/anchor_alarm.py` (_JSON_UPDATE_INTERVAL_SEC: 5.0 → 20.0)
|
|
- `dbus-anchor-alarm/track_buffer.py` (MAX_POINTS: 5000 → 2000)
|
|
- `dbus-anchor-alarm/build-package.sh` (VERSION: 2.0.0 → 2.1.0)
|
|
|
|
## Deployed Packages
|
|
|
|
- `dbus-generator-ramp-2.1.0.tar.gz` (installed and running)
|
|
- `dbus-anchor-alarm-2.1.0.tar.gz` (installed and running)
|