upleb.uk

Public git repos — served from a NIP-34 GRASP relay at git.upleb.uk

summaryrefslogtreecommitdiff
path: root/docs/explanation/monitoring.md
diff options
context:
space:
mode:
authorDanConwayDev <DanConwayDev@protonmail.com>2026-01-09 14:12:24 +0000
committerDanConwayDev <DanConwayDev@protonmail.com>2026-01-09 14:12:24 +0000
commit93a1684f068603b354ba3c05957a25459c73de05 (patch)
tree324e6d0e2a6a34fd4804ef94133cd35233081bb9 /docs/explanation/monitoring.md
parentc34492069abacae67482af4c8356241958a524f7 (diff)
feat(sync): add ConnectedDegraded status for failed historic sync
- Add ConnectionStatus::ConnectedDegraded (status=4 in metrics) - Track batch failures via PendingBatch.failed field - Track relay-level failures via RelayState.historic_sync_had_failures - Transition to ConnectedDegraded when any batch fails during historic sync - Add is_live_sync_active() helper for cleaner match patterns - Update state machine diagram with ConnectedDegraded transitions - Update metrics docs with status=4 and example queries Fixes issue where relays with failed negentropy retries would incorrectly transition to Connected status despite missing data. Now operators can distinguish 'fully synced' vs 'degraded (partial data)'.
Diffstat (limited to 'docs/explanation/monitoring.md')
-rw-r--r--docs/explanation/monitoring.md9
1 files changed, 7 insertions, 2 deletions
diff --git a/docs/explanation/monitoring.md b/docs/explanation/monitoring.md
index d2d20c0..cc164ab 100644
--- a/docs/explanation/monitoring.md
+++ b/docs/explanation/monitoring.md
@@ -98,7 +98,7 @@ When GRASP-02 proactive sync is implemented, the following metrics will be added
98 98
99| Metric | Type | Labels | Description | 99| Metric | Type | Labels | Description |
100|--------|------|--------|-------------| 100|--------|------|--------|-------------|
101| `ngit_sync_relay_connected` | Gauge | relay | Connection status (0=disconnected, 1=connecting, 2=syncing, 3=connected) | 101| `ngit_sync_relay_connected` | Gauge | relay | Connection status (0=disconnected, 1=connecting, 2=syncing, 3=connected, 4=connected_degraded) |
102| `ngit_sync_connection_attempts_total` | Counter | relay, result | Connection attempt outcomes | 102| `ngit_sync_connection_attempts_total` | Counter | relay, result | Connection attempt outcomes |
103| `ngit_sync_relay_status` | Gauge | relay | Health status (1=healthy, 2=disconnected, 3=degraded, 4=dead, 5=rate_limited) | 103| `ngit_sync_relay_status` | Gauge | relay | Health status (1=healthy, 2=disconnected, 3=degraded, 4=dead, 5=rate_limited) |
104| `ngit_sync_relay_failures` | Gauge | relay | Current consecutive failure count | 104| `ngit_sync_relay_failures` | Gauge | relay | Current consecutive failure count |
@@ -115,8 +115,9 @@ The `ngit_sync_relay_connected` metric tracks the connection lifecycle:
115- `1` = **Connecting** - Connection attempt in progress 115- `1` = **Connecting** - Connection attempt in progress
116- `2` = **Syncing** - Connected, historic sync in progress 116- `2` = **Syncing** - Connected, historic sync in progress
117- `3` = **Connected** - Connected, historic sync complete, live sync active 117- `3` = **Connected** - Connected, historic sync complete, live sync active
118- `4` = **ConnectedDegraded** - Connected, historic sync failed, live sync active, partial data
118 119
119This allows operators to distinguish between "connected but still catching up" (Syncing) vs "fully synced and live" (Connected). 120This allows operators to distinguish between "connected but still catching up" (Syncing) vs "fully synced and live" (Connected) vs "degraded - missing historic data" (ConnectedDegraded).
120 121
121### Relay Health States 122### Relay Health States
122 123
@@ -136,10 +137,14 @@ sum by (relay) (ngit_sync_relay_connected == 0) # Disconnected
136sum by (relay) (ngit_sync_relay_connected == 1) # Connecting 137sum by (relay) (ngit_sync_relay_connected == 1) # Connecting
137sum by (relay) (ngit_sync_relay_connected == 2) # Syncing 138sum by (relay) (ngit_sync_relay_connected == 2) # Syncing
138sum by (relay) (ngit_sync_relay_connected == 3) # Connected 139sum by (relay) (ngit_sync_relay_connected == 3) # Connected
140sum by (relay) (ngit_sync_relay_connected == 4) # ConnectedDegraded
139 141
140# Relays still syncing (not yet fully caught up) 142# Relays still syncing (not yet fully caught up)
141count(ngit_sync_relay_connected == 2) 143count(ngit_sync_relay_connected == 2)
142 144
145# Relays with degraded sync (missing historic data)
146count(ngit_sync_relay_connected == 4)
147
143# Connection success rate over last hour 148# Connection success rate over last hour
144sum(rate(ngit_sync_connection_attempts_total{result="success"}[1h])) 149sum(rate(ngit_sync_connection_attempts_total{result="success"}[1h]))
145/ sum(rate(ngit_sync_connection_attempts_total[1h])) 150/ sum(rate(ngit_sync_connection_attempts_total[1h]))