upleb.uk

Public git repos — served from a NIP-34 GRASP relay at git.upleb.uk

summaryrefslogtreecommitdiff
path: root/docs
diff options
context:
space:
mode:
authorDanConwayDev <DanConwayDev@protonmail.com>2026-01-09 14:12:24 +0000
committerDanConwayDev <DanConwayDev@protonmail.com>2026-01-09 14:12:24 +0000
commit93a1684f068603b354ba3c05957a25459c73de05 (patch)
tree324e6d0e2a6a34fd4804ef94133cd35233081bb9 /docs
parentc34492069abacae67482af4c8356241958a524f7 (diff)
feat(sync): add ConnectedDegraded status for failed historic sync
- Add ConnectionStatus::ConnectedDegraded (status=4 in metrics) - Track batch failures via PendingBatch.failed field - Track relay-level failures via RelayState.historic_sync_had_failures - Transition to ConnectedDegraded when any batch fails during historic sync - Add is_live_sync_active() helper for cleaner match patterns - Update state machine diagram with ConnectedDegraded transitions - Update metrics docs with status=4 and example queries Fixes issue where relays with failed negentropy retries would incorrectly transition to Connected status despite missing data. Now operators can distinguish 'fully synced' vs 'degraded (partial data)'.
Diffstat (limited to 'docs')
-rw-r--r--docs/explanation/grasp-02-proactive-sync.md29
-rw-r--r--docs/explanation/monitoring.md9
2 files changed, 30 insertions, 8 deletions
diff --git a/docs/explanation/grasp-02-proactive-sync.md b/docs/explanation/grasp-02-proactive-sync.md
index e1fb367..b17b8bf 100644
--- a/docs/explanation/grasp-02-proactive-sync.md
+++ b/docs/explanation/grasp-02-proactive-sync.md
@@ -79,6 +79,8 @@ pub enum ConnectionStatus {
79 Syncing, 79 Syncing,
80 /// Successfully connected, historic sync completed 80 /// Successfully connected, historic sync completed
81 Connected, 81 Connected,
82 /// Successfully connected, historic sync failed but live sync active
83 ConnectedDegraded,
82} 84}
83 85
84/// Complete state for a single relay - combines sync needs with connection lifecycle 86/// Complete state for a single relay - combines sync needs with connection lifecycle
@@ -207,15 +209,19 @@ stateDiagram-v2
207 Disconnected --> Connecting: retry_disconnected_relays → try_connect_relay 209 Disconnected --> Connecting: retry_disconnected_relays → try_connect_relay
208 Connecting --> Syncing: success → handle_connect_or_reconnect 210 Connecting --> Syncing: success → handle_connect_or_reconnect
209 Connecting --> Disconnected: failure + record in health tracker 211 Connecting --> Disconnected: failure + record in health tracker
210 Syncing --> Connected: all historic batches complete → check_and_complete_historic_sync 212 Syncing --> Connected: all batches succeed → check_and_complete_historic_sync
213 Syncing --> ConnectedDegraded: any batch failed → check_and_complete_historic_sync
211 Syncing --> Disconnected: connection lost → handle_disconnect 214 Syncing --> Disconnected: connection lost → handle_disconnect
212 Connected --> Disconnected: connection lost → handle_disconnect 215 Connected --> Disconnected: connection lost → handle_disconnect
216 ConnectedDegraded --> Disconnected: connection lost → handle_disconnect
213 Connected --> [*]: intentional disconnect via check_disconnects 217 Connected --> [*]: intentional disconnect via check_disconnects
218 ConnectedDegraded --> [*]: intentional disconnect via check_disconnects
214 219
215 note right of Disconnected: disconnected_at set for 15min rule<br/>RelayConnection kept in HashMap 220 note right of Disconnected: disconnected_at set for 15min rule<br/>RelayConnection kept in HashMap
216 note right of Connecting: connection attempt with timeout 221 note right of Connecting: connection attempt with timeout
217 note right of Syncing: historic sync in progress<br/>event loop spawned here 222 note right of Syncing: historic sync in progress<br/>event loop spawned here
218 note right of Connected: historic sync complete<br/>last_connected tracked for since filter 223 note right of Connected: historic sync complete<br/>last_connected tracked for since filter
224 note right of ConnectedDegraded: historic sync failed (missing events)<br/>live sync active, partial data
219``` 225```
220 226
221### Connection Flow Methods 227### Connection Flow Methods
@@ -240,17 +246,28 @@ When a relay first connects, it enters the **Syncing** state and begins historic
240Each layer creates one or more `PendingBatch` entries tracked in `PendingSyncIndex`. As EOSE messages arrive: 246Each layer creates one or more `PendingBatch` entries tracked in `PendingSyncIndex`. As EOSE messages arrive:
241 247
242- `handle_eose()` confirms each batch via `confirm_batch()` 248- `handle_eose()` confirms each batch via `confirm_batch()`
243- `confirm_batch()` moves items to confirmed state and calls `check_and_complete_historic_sync()` 249- `confirm_batch()` moves items to confirmed state, tracks if batch failed, and calls `check_and_complete_historic_sync()`
244- `check_and_complete_historic_sync()` checks if `PendingSyncIndex` is empty for this relay 250- `check_and_complete_historic_sync()` uses a **double-check pattern** to avoid race conditions:
245- When empty: transitions `Syncing` → `Connected`, sets `historic_sync_completed = true` 251 1. First check: Are there pending batches? If yes, return early
252 2. Wait 6 seconds (batch window + buffer) for self-subscriber to process in-flight events
253 3. Second check: Are there still no pending batches? If yes, return early
254 4. If no pending batches after wait:
255 - If any batch failed: transition `Syncing` → `ConnectedDegraded`
256 - If all batches succeeded: transition `Syncing` → `Connected`
257 - Set `historic_sync_completed = true`
258
259**Why the double-check?** There's an async gap between receiving EOSE and the self-subscriber processing events to create Layer 2/3 filters. The 6-second wait (5s batch window + 1s buffer) ensures we don't prematurely mark sync complete while Layer 2/3 batches are being created.
260
261**Batch Failure Tracking**: When negentropy retry protection triggers (relay returns zero requested events on retry), the batch is marked as `failed = true`. This causes the relay to transition to `ConnectedDegraded` instead of `Connected`, signaling that live sync is active but historic sync is incomplete.
246 262
247**Metrics tracking**: The `ngit_sync_relay_connected` metric shows: 263**Metrics tracking**: The `ngit_sync_relay_connected` metric shows:
248- `0` = Disconnected 264- `0` = Disconnected
249- `1` = Connecting 265- `1` = Connecting
250- `2` = Syncing (historic sync in progress) 266- `2` = Syncing (historic sync in progress)
251- `3` = Connected (historic sync complete, live sync active) 267- `3` = Connected (historic sync complete, live sync active)
268- `4` = ConnectedDegraded (historic sync failed, live sync active, partial data)
252 269
253This allows operators to monitor sync progress and distinguish between "connected but still catching up" vs "fully synced and live". 270This allows operators to monitor sync progress and distinguish between "connected but still catching up" vs "fully synced and live" vs "degraded (missing historic data)".
254 271
255### Event Loop Lifecycle 272### Event Loop Lifecycle
256 273
diff --git a/docs/explanation/monitoring.md b/docs/explanation/monitoring.md
index d2d20c0..cc164ab 100644
--- a/docs/explanation/monitoring.md
+++ b/docs/explanation/monitoring.md
@@ -98,7 +98,7 @@ When GRASP-02 proactive sync is implemented, the following metrics will be added
98 98
99| Metric | Type | Labels | Description | 99| Metric | Type | Labels | Description |
100|--------|------|--------|-------------| 100|--------|------|--------|-------------|
101| `ngit_sync_relay_connected` | Gauge | relay | Connection status (0=disconnected, 1=connecting, 2=syncing, 3=connected) | 101| `ngit_sync_relay_connected` | Gauge | relay | Connection status (0=disconnected, 1=connecting, 2=syncing, 3=connected, 4=connected_degraded) |
102| `ngit_sync_connection_attempts_total` | Counter | relay, result | Connection attempt outcomes | 102| `ngit_sync_connection_attempts_total` | Counter | relay, result | Connection attempt outcomes |
103| `ngit_sync_relay_status` | Gauge | relay | Health status (1=healthy, 2=disconnected, 3=degraded, 4=dead, 5=rate_limited) | 103| `ngit_sync_relay_status` | Gauge | relay | Health status (1=healthy, 2=disconnected, 3=degraded, 4=dead, 5=rate_limited) |
104| `ngit_sync_relay_failures` | Gauge | relay | Current consecutive failure count | 104| `ngit_sync_relay_failures` | Gauge | relay | Current consecutive failure count |
@@ -115,8 +115,9 @@ The `ngit_sync_relay_connected` metric tracks the connection lifecycle:
115- `1` = **Connecting** - Connection attempt in progress 115- `1` = **Connecting** - Connection attempt in progress
116- `2` = **Syncing** - Connected, historic sync in progress 116- `2` = **Syncing** - Connected, historic sync in progress
117- `3` = **Connected** - Connected, historic sync complete, live sync active 117- `3` = **Connected** - Connected, historic sync complete, live sync active
118- `4` = **ConnectedDegraded** - Connected, historic sync failed, live sync active, partial data
118 119
119This allows operators to distinguish between "connected but still catching up" (Syncing) vs "fully synced and live" (Connected). 120This allows operators to distinguish between "connected but still catching up" (Syncing) vs "fully synced and live" (Connected) vs "degraded - missing historic data" (ConnectedDegraded).
120 121
121### Relay Health States 122### Relay Health States
122 123
@@ -136,10 +137,14 @@ sum by (relay) (ngit_sync_relay_connected == 0) # Disconnected
136sum by (relay) (ngit_sync_relay_connected == 1) # Connecting 137sum by (relay) (ngit_sync_relay_connected == 1) # Connecting
137sum by (relay) (ngit_sync_relay_connected == 2) # Syncing 138sum by (relay) (ngit_sync_relay_connected == 2) # Syncing
138sum by (relay) (ngit_sync_relay_connected == 3) # Connected 139sum by (relay) (ngit_sync_relay_connected == 3) # Connected
140sum by (relay) (ngit_sync_relay_connected == 4) # ConnectedDegraded
139 141
140# Relays still syncing (not yet fully caught up) 142# Relays still syncing (not yet fully caught up)
141count(ngit_sync_relay_connected == 2) 143count(ngit_sync_relay_connected == 2)
142 144
145# Relays with degraded sync (missing historic data)
146count(ngit_sync_relay_connected == 4)
147
143# Connection success rate over last hour 148# Connection success rate over last hour
144sum(rate(ngit_sync_connection_attempts_total{result="success"}[1h])) 149sum(rate(ngit_sync_connection_attempts_total{result="success"}[1h]))
145/ sum(rate(ngit_sync_connection_attempts_total[1h])) 150/ sum(rate(ngit_sync_connection_attempts_total[1h]))