diff options
| author | DanConwayDev <DanConwayDev@protonmail.com> | 2026-01-09 14:12:24 +0000 |
|---|---|---|
| committer | DanConwayDev <DanConwayDev@protonmail.com> | 2026-01-09 14:12:24 +0000 |
| commit | 93a1684f068603b354ba3c05957a25459c73de05 (patch) | |
| tree | 324e6d0e2a6a34fd4804ef94133cd35233081bb9 /docs/explanation/monitoring.md | |
| parent | c34492069abacae67482af4c8356241958a524f7 (diff) | |
feat(sync): add ConnectedDegraded status for failed historic sync
- Add ConnectionStatus::ConnectedDegraded (status=4 in metrics)
- Track batch failures via PendingBatch.failed field
- Track relay-level failures via RelayState.historic_sync_had_failures
- Transition to ConnectedDegraded when any batch fails during historic sync
- Add is_live_sync_active() helper for cleaner match patterns
- Update state machine diagram with ConnectedDegraded transitions
- Update metrics docs with status=4 and example queries
Fixes issue where relays with failed negentropy retries would
incorrectly transition to Connected status despite missing data.
Now operators can distinguish 'fully synced' vs 'degraded (partial data)'.
Diffstat (limited to 'docs/explanation/monitoring.md')
| -rw-r--r-- | docs/explanation/monitoring.md | 9 |
1 files changed, 7 insertions, 2 deletions
diff --git a/docs/explanation/monitoring.md b/docs/explanation/monitoring.md index d2d20c0..cc164ab 100644 --- a/docs/explanation/monitoring.md +++ b/docs/explanation/monitoring.md | |||
| @@ -98,7 +98,7 @@ When GRASP-02 proactive sync is implemented, the following metrics will be added | |||
| 98 | 98 | ||
| 99 | | Metric | Type | Labels | Description | | 99 | | Metric | Type | Labels | Description | |
| 100 | |--------|------|--------|-------------| | 100 | |--------|------|--------|-------------| |
| 101 | | `ngit_sync_relay_connected` | Gauge | relay | Connection status (0=disconnected, 1=connecting, 2=syncing, 3=connected) | | 101 | | `ngit_sync_relay_connected` | Gauge | relay | Connection status (0=disconnected, 1=connecting, 2=syncing, 3=connected, 4=connected_degraded) | |
| 102 | | `ngit_sync_connection_attempts_total` | Counter | relay, result | Connection attempt outcomes | | 102 | | `ngit_sync_connection_attempts_total` | Counter | relay, result | Connection attempt outcomes | |
| 103 | | `ngit_sync_relay_status` | Gauge | relay | Health status (1=healthy, 2=disconnected, 3=degraded, 4=dead, 5=rate_limited) | | 103 | | `ngit_sync_relay_status` | Gauge | relay | Health status (1=healthy, 2=disconnected, 3=degraded, 4=dead, 5=rate_limited) | |
| 104 | | `ngit_sync_relay_failures` | Gauge | relay | Current consecutive failure count | | 104 | | `ngit_sync_relay_failures` | Gauge | relay | Current consecutive failure count | |
| @@ -115,8 +115,9 @@ The `ngit_sync_relay_connected` metric tracks the connection lifecycle: | |||
| 115 | - `1` = **Connecting** - Connection attempt in progress | 115 | - `1` = **Connecting** - Connection attempt in progress |
| 116 | - `2` = **Syncing** - Connected, historic sync in progress | 116 | - `2` = **Syncing** - Connected, historic sync in progress |
| 117 | - `3` = **Connected** - Connected, historic sync complete, live sync active | 117 | - `3` = **Connected** - Connected, historic sync complete, live sync active |
| 118 | - `4` = **ConnectedDegraded** - Connected, historic sync failed, live sync active, partial data | ||
| 118 | 119 | ||
| 119 | This allows operators to distinguish between "connected but still catching up" (Syncing) vs "fully synced and live" (Connected). | 120 | This allows operators to distinguish between "connected but still catching up" (Syncing) vs "fully synced and live" (Connected) vs "degraded - missing historic data" (ConnectedDegraded). |
| 120 | 121 | ||
| 121 | ### Relay Health States | 122 | ### Relay Health States |
| 122 | 123 | ||
| @@ -136,10 +137,14 @@ sum by (relay) (ngit_sync_relay_connected == 0) # Disconnected | |||
| 136 | sum by (relay) (ngit_sync_relay_connected == 1) # Connecting | 137 | sum by (relay) (ngit_sync_relay_connected == 1) # Connecting |
| 137 | sum by (relay) (ngit_sync_relay_connected == 2) # Syncing | 138 | sum by (relay) (ngit_sync_relay_connected == 2) # Syncing |
| 138 | sum by (relay) (ngit_sync_relay_connected == 3) # Connected | 139 | sum by (relay) (ngit_sync_relay_connected == 3) # Connected |
| 140 | sum by (relay) (ngit_sync_relay_connected == 4) # ConnectedDegraded | ||
| 139 | 141 | ||
| 140 | # Relays still syncing (not yet fully caught up) | 142 | # Relays still syncing (not yet fully caught up) |
| 141 | count(ngit_sync_relay_connected == 2) | 143 | count(ngit_sync_relay_connected == 2) |
| 142 | 144 | ||
| 145 | # Relays with degraded sync (missing historic data) | ||
| 146 | count(ngit_sync_relay_connected == 4) | ||
| 147 | |||
| 143 | # Connection success rate over last hour | 148 | # Connection success rate over last hour |
| 144 | sum(rate(ngit_sync_connection_attempts_total{result="success"}[1h])) | 149 | sum(rate(ngit_sync_connection_attempts_total{result="success"}[1h])) |
| 145 | / sum(rate(ngit_sync_connection_attempts_total[1h])) | 150 | / sum(rate(ngit_sync_connection_attempts_total[1h])) |