Files
venus/WATCHDOG_DEBUG_REPORT.md
2026-03-26 18:23:55 +00:00

4.4 KiB

Watchdog Reboot Debug Report - 2026-03-26

Problem

Cerbo GX (einstein) triggered watchdog reboot on Mar 24 20:13:43 due to sustained high load average (11/9/7) exceeding thresholds (0/10/6).

Root Cause

Cumulative CPU pressure from 7 custom Python D-Bus services plus dbus-raymarine-publisher (multicast decoder) running simultaneously on a dual-core ARM Cortex-A7.

Services Running at Time of Reboot

  1. dbus-anchor-alarm - 1 Hz, 10 D-Bus reads/sec, circle fitting on 1800 points, JSON serializing 5000 track points every 5s
  2. dbus-generator-ramp - 2 Hz (500ms), multiple D-Bus reads + regression math
  3. dbus-tides - 1 Hz, SQLite writes + harmonic calculations
  4. dbus-meteoblue-forecast - periodic HTTP API calls
  5. dbus-no-foreign-land - periodic GPS uploads
  6. dbus-windy-station - periodic sensor uploads
  7. dbus-raymarine-publisher - continuous multicast protobuf decoding (12% CPU sustained)

Key Findings

  • dbus-daemon: 13% CPU (bottleneck from ~20 services making synchronous GetValue() calls)
  • dbus-raymarine-publisher: 12% CPU (multiple threads continuously parsing multicast packets)
  • Total Python CPU: ~50% aggregate across all custom services
  • Memory: OK (609MB available of 1GB)
  • No crash loops: All services had 152K+ second uptimes

Optimizations Applied (v2.1.0)

1. dbus-generator-ramp

  • Changed: Main loop from 500ms → 1000ms (2Hz → 1Hz)
  • File: dbus-generator-ramp/config.py line 257
  • Impact: 50% reduction in D-Bus polling and math operations
  • Version: 2.0.0 → 2.1.0

2. dbus-anchor-alarm

  • Changed: JSON update interval from 5s → 20s
  • File: dbus-anchor-alarm/anchor_alarm.py line 78
  • Impact: 75% reduction in large JSON serializations
  • Changed: Track buffer from 5000 → 2000 points
  • File: dbus-anchor-alarm/track_buffer.py line 16
  • Impact: 60% less data to serialize and transmit over MQTT
  • Version: 2.0.0 → 2.1.0

Load Average Results

Before optimizations:

 14:40:30  load average: 1.04, 2.30, 2.71
 14:44:10  load average: 3.91, 2.50, 2.65  (after anchor-alarm restarted)
 14:52:37  load average: 1.29, 3.93, 3.77
 14:55:10  load average: 7.04, 6.21, 4.76  (critical)

After optimizations (v2.1.0):

 15:05:21  load average: 1.69, 3.95, 4.24
 15:06:01  load average: 0.99, 3.48, 4.07
 15:06:41  load average: 0.64, 3.08, 3.91
 15:07:42  load average: 1.35, 2.87, 3.78  (trending down)

Status: 15-minute load declining from 4.76 → 3.78, should continue dropping below watchdog threshold (6.0) over next 15 minutes.

Remaining Concerns

High-Risk Service: dbus-raymarine-publisher (12% sustained CPU)

  • Continuous multicast parsing with multiple threads
  • Running at 1Hz D-Bus update but packet decoding is continuous
  • Recommendation: Monitor this service closely; consider adding --update-interval 2000 (2Hz → 0.5Hz) if load remains elevated

System-Wide D-Bus Pressure

  • dbus-daemon at 13% CPU indicates bus saturation
  • 20+ services making synchronous calls
  • Future optimization: Implement D-Bus signal subscriptions instead of polling where possible

Monitoring Commands

Check load every minute:

ssh cerbo "watch -n 60 uptime"

Monitor Python service CPU:

ssh cerbo "while true; do top -b -n 1 | grep python3 | head -n 10; sleep 30; done"

Check service health:

ssh cerbo "svstat /service/dbus-* 2>/dev/null | grep -v 'up.*seconds'"

Next Steps if Load Remains High

  1. Reduce raymarine publisher update rate to 2000ms
  2. Consider disabling debug logging on anchor-alarm (SQLite writes every 15s)
  3. Evaluate if all 7 services need to run continuously (some could be on-demand)
  4. Long-term: consolidate low-frequency services (meteoblue, windy, nfl) into a single process

Files Modified

  • dbus-generator-ramp/config.py (main_loop_interval_ms: 500 → 1000)
  • dbus-generator-ramp/dbus-generator-ramp.py (VERSION: 2.0.0 → 2.1.0)
  • dbus-generator-ramp/build-package.sh (VERSION: 1.0.0 → 2.1.0)
  • dbus-anchor-alarm/config.py (VERSION: 2.0.0 → 2.1.0)
  • dbus-anchor-alarm/anchor_alarm.py (_JSON_UPDATE_INTERVAL_SEC: 5.0 → 20.0)
  • dbus-anchor-alarm/track_buffer.py (MAX_POINTS: 5000 → 2000)
  • dbus-anchor-alarm/build-package.sh (VERSION: 2.0.0 → 2.1.0)

Deployed Packages

  • dbus-generator-ramp-2.1.0.tar.gz (installed and running)
  • dbus-anchor-alarm-2.1.0.tar.gz (installed and running)