diff options
| author | DanConwayDev <DanConwayDev@protonmail.com> | 2026-01-09 14:12:24 +0000 |
|---|---|---|
| committer | DanConwayDev <DanConwayDev@protonmail.com> | 2026-01-09 14:12:24 +0000 |
| commit | 93a1684f068603b354ba3c05957a25459c73de05 (patch) | |
| tree | 324e6d0e2a6a34fd4804ef94133cd35233081bb9 /docs/explanation | |
| parent | c34492069abacae67482af4c8356241958a524f7 (diff) | |
feat(sync): add ConnectedDegraded status for failed historic sync
- Add ConnectionStatus::ConnectedDegraded (status=4 in metrics)
- Track batch failures via PendingBatch.failed field
- Track relay-level failures via RelayState.historic_sync_had_failures
- Transition to ConnectedDegraded when any batch fails during historic sync
- Add is_live_sync_active() helper for cleaner match patterns
- Update state machine diagram with ConnectedDegraded transitions
- Update metrics docs with status=4 and example queries
Fixes issue where relays with failed negentropy retries would
incorrectly transition to Connected status despite missing data.
Now operators can distinguish 'fully synced' vs 'degraded (partial data)'.
Diffstat (limited to 'docs/explanation')
| -rw-r--r-- | docs/explanation/grasp-02-proactive-sync.md | 29 | ||||
| -rw-r--r-- | docs/explanation/monitoring.md | 9 |
2 files changed, 30 insertions, 8 deletions
diff --git a/docs/explanation/grasp-02-proactive-sync.md b/docs/explanation/grasp-02-proactive-sync.md index e1fb367..b17b8bf 100644 --- a/docs/explanation/grasp-02-proactive-sync.md +++ b/docs/explanation/grasp-02-proactive-sync.md | |||
| @@ -79,6 +79,8 @@ pub enum ConnectionStatus { | |||
| 79 | Syncing, | 79 | Syncing, |
| 80 | /// Successfully connected, historic sync completed | 80 | /// Successfully connected, historic sync completed |
| 81 | Connected, | 81 | Connected, |
| 82 | /// Successfully connected, historic sync failed but live sync active | ||
| 83 | ConnectedDegraded, | ||
| 82 | } | 84 | } |
| 83 | 85 | ||
| 84 | /// Complete state for a single relay - combines sync needs with connection lifecycle | 86 | /// Complete state for a single relay - combines sync needs with connection lifecycle |
| @@ -207,15 +209,19 @@ stateDiagram-v2 | |||
| 207 | Disconnected --> Connecting: retry_disconnected_relays → try_connect_relay | 209 | Disconnected --> Connecting: retry_disconnected_relays → try_connect_relay |
| 208 | Connecting --> Syncing: success → handle_connect_or_reconnect | 210 | Connecting --> Syncing: success → handle_connect_or_reconnect |
| 209 | Connecting --> Disconnected: failure + record in health tracker | 211 | Connecting --> Disconnected: failure + record in health tracker |
| 210 | Syncing --> Connected: all historic batches complete → check_and_complete_historic_sync | 212 | Syncing --> Connected: all batches succeed → check_and_complete_historic_sync |
| 213 | Syncing --> ConnectedDegraded: any batch failed → check_and_complete_historic_sync | ||
| 211 | Syncing --> Disconnected: connection lost → handle_disconnect | 214 | Syncing --> Disconnected: connection lost → handle_disconnect |
| 212 | Connected --> Disconnected: connection lost → handle_disconnect | 215 | Connected --> Disconnected: connection lost → handle_disconnect |
| 216 | ConnectedDegraded --> Disconnected: connection lost → handle_disconnect | ||
| 213 | Connected --> [*]: intentional disconnect via check_disconnects | 217 | Connected --> [*]: intentional disconnect via check_disconnects |
| 218 | ConnectedDegraded --> [*]: intentional disconnect via check_disconnects | ||
| 214 | 219 | ||
| 215 | note right of Disconnected: disconnected_at set for 15min rule<br/>RelayConnection kept in HashMap | 220 | note right of Disconnected: disconnected_at set for 15min rule<br/>RelayConnection kept in HashMap |
| 216 | note right of Connecting: connection attempt with timeout | 221 | note right of Connecting: connection attempt with timeout |
| 217 | note right of Syncing: historic sync in progress<br/>event loop spawned here | 222 | note right of Syncing: historic sync in progress<br/>event loop spawned here |
| 218 | note right of Connected: historic sync complete<br/>last_connected tracked for since filter | 223 | note right of Connected: historic sync complete<br/>last_connected tracked for since filter |
| 224 | note right of ConnectedDegraded: historic sync failed (missing events)<br/>live sync active, partial data | ||
| 219 | ``` | 225 | ``` |
| 220 | 226 | ||
| 221 | ### Connection Flow Methods | 227 | ### Connection Flow Methods |
| @@ -240,17 +246,28 @@ When a relay first connects, it enters the **Syncing** state and begins historic | |||
| 240 | Each layer creates one or more `PendingBatch` entries tracked in `PendingSyncIndex`. As EOSE messages arrive: | 246 | Each layer creates one or more `PendingBatch` entries tracked in `PendingSyncIndex`. As EOSE messages arrive: |
| 241 | 247 | ||
| 242 | - `handle_eose()` confirms each batch via `confirm_batch()` | 248 | - `handle_eose()` confirms each batch via `confirm_batch()` |
| 243 | - `confirm_batch()` moves items to confirmed state and calls `check_and_complete_historic_sync()` | 249 | - `confirm_batch()` moves items to confirmed state, tracks if batch failed, and calls `check_and_complete_historic_sync()` |
| 244 | - `check_and_complete_historic_sync()` checks if `PendingSyncIndex` is empty for this relay | 250 | - `check_and_complete_historic_sync()` uses a **double-check pattern** to avoid race conditions: |
| 245 | - When empty: transitions `Syncing` → `Connected`, sets `historic_sync_completed = true` | 251 | 1. First check: Are there pending batches? If yes, return early |
| 252 | 2. Wait 6 seconds (batch window + buffer) for self-subscriber to process in-flight events | ||
| 253 | 3. Second check: Are there still no pending batches? If yes, return early | ||
| 254 | 4. If no pending batches after wait: | ||
| 255 | - If any batch failed: transition `Syncing` → `ConnectedDegraded` | ||
| 256 | - If all batches succeeded: transition `Syncing` → `Connected` | ||
| 257 | - Set `historic_sync_completed = true` | ||
| 258 | |||
| 259 | **Why the double-check?** There's an async gap between receiving EOSE and the self-subscriber processing events to create Layer 2/3 filters. The 6-second wait (5s batch window + 1s buffer) ensures we don't prematurely mark sync complete while Layer 2/3 batches are being created. | ||
| 260 | |||
| 261 | **Batch Failure Tracking**: When negentropy retry protection triggers (relay returns zero requested events on retry), the batch is marked as `failed = true`. This causes the relay to transition to `ConnectedDegraded` instead of `Connected`, signaling that live sync is active but historic sync is incomplete. | ||
| 246 | 262 | ||
| 247 | **Metrics tracking**: The `ngit_sync_relay_connected` metric shows: | 263 | **Metrics tracking**: The `ngit_sync_relay_connected` metric shows: |
| 248 | - `0` = Disconnected | 264 | - `0` = Disconnected |
| 249 | - `1` = Connecting | 265 | - `1` = Connecting |
| 250 | - `2` = Syncing (historic sync in progress) | 266 | - `2` = Syncing (historic sync in progress) |
| 251 | - `3` = Connected (historic sync complete, live sync active) | 267 | - `3` = Connected (historic sync complete, live sync active) |
| 268 | - `4` = ConnectedDegraded (historic sync failed, live sync active, partial data) | ||
| 252 | 269 | ||
| 253 | This allows operators to monitor sync progress and distinguish between "connected but still catching up" vs "fully synced and live". | 270 | This allows operators to monitor sync progress and distinguish between "connected but still catching up" vs "fully synced and live" vs "degraded (missing historic data)". |
| 254 | 271 | ||
| 255 | ### Event Loop Lifecycle | 272 | ### Event Loop Lifecycle |
| 256 | 273 | ||
diff --git a/docs/explanation/monitoring.md b/docs/explanation/monitoring.md index d2d20c0..cc164ab 100644 --- a/docs/explanation/monitoring.md +++ b/docs/explanation/monitoring.md | |||
| @@ -98,7 +98,7 @@ When GRASP-02 proactive sync is implemented, the following metrics will be added | |||
| 98 | 98 | ||
| 99 | | Metric | Type | Labels | Description | | 99 | | Metric | Type | Labels | Description | |
| 100 | |--------|------|--------|-------------| | 100 | |--------|------|--------|-------------| |
| 101 | | `ngit_sync_relay_connected` | Gauge | relay | Connection status (0=disconnected, 1=connecting, 2=syncing, 3=connected) | | 101 | | `ngit_sync_relay_connected` | Gauge | relay | Connection status (0=disconnected, 1=connecting, 2=syncing, 3=connected, 4=connected_degraded) | |
| 102 | | `ngit_sync_connection_attempts_total` | Counter | relay, result | Connection attempt outcomes | | 102 | | `ngit_sync_connection_attempts_total` | Counter | relay, result | Connection attempt outcomes | |
| 103 | | `ngit_sync_relay_status` | Gauge | relay | Health status (1=healthy, 2=disconnected, 3=degraded, 4=dead, 5=rate_limited) | | 103 | | `ngit_sync_relay_status` | Gauge | relay | Health status (1=healthy, 2=disconnected, 3=degraded, 4=dead, 5=rate_limited) | |
| 104 | | `ngit_sync_relay_failures` | Gauge | relay | Current consecutive failure count | | 104 | | `ngit_sync_relay_failures` | Gauge | relay | Current consecutive failure count | |
| @@ -115,8 +115,9 @@ The `ngit_sync_relay_connected` metric tracks the connection lifecycle: | |||
| 115 | - `1` = **Connecting** - Connection attempt in progress | 115 | - `1` = **Connecting** - Connection attempt in progress |
| 116 | - `2` = **Syncing** - Connected, historic sync in progress | 116 | - `2` = **Syncing** - Connected, historic sync in progress |
| 117 | - `3` = **Connected** - Connected, historic sync complete, live sync active | 117 | - `3` = **Connected** - Connected, historic sync complete, live sync active |
| 118 | - `4` = **ConnectedDegraded** - Connected, historic sync failed, live sync active, partial data | ||
| 118 | 119 | ||
| 119 | This allows operators to distinguish between "connected but still catching up" (Syncing) vs "fully synced and live" (Connected). | 120 | This allows operators to distinguish between "connected but still catching up" (Syncing) vs "fully synced and live" (Connected) vs "degraded - missing historic data" (ConnectedDegraded). |
| 120 | 121 | ||
| 121 | ### Relay Health States | 122 | ### Relay Health States |
| 122 | 123 | ||
| @@ -136,10 +137,14 @@ sum by (relay) (ngit_sync_relay_connected == 0) # Disconnected | |||
| 136 | sum by (relay) (ngit_sync_relay_connected == 1) # Connecting | 137 | sum by (relay) (ngit_sync_relay_connected == 1) # Connecting |
| 137 | sum by (relay) (ngit_sync_relay_connected == 2) # Syncing | 138 | sum by (relay) (ngit_sync_relay_connected == 2) # Syncing |
| 138 | sum by (relay) (ngit_sync_relay_connected == 3) # Connected | 139 | sum by (relay) (ngit_sync_relay_connected == 3) # Connected |
| 140 | sum by (relay) (ngit_sync_relay_connected == 4) # ConnectedDegraded | ||
| 139 | 141 | ||
| 140 | # Relays still syncing (not yet fully caught up) | 142 | # Relays still syncing (not yet fully caught up) |
| 141 | count(ngit_sync_relay_connected == 2) | 143 | count(ngit_sync_relay_connected == 2) |
| 142 | 144 | ||
| 145 | # Relays with degraded sync (missing historic data) | ||
| 146 | count(ngit_sync_relay_connected == 4) | ||
| 147 | |||
| 143 | # Connection success rate over last hour | 148 | # Connection success rate over last hour |
| 144 | sum(rate(ngit_sync_connection_attempts_total{result="success"}[1h])) | 149 | sum(rate(ngit_sync_connection_attempts_total{result="success"}[1h])) |
| 145 | / sum(rate(ngit_sync_connection_attempts_total[1h])) | 150 | / sum(rate(ngit_sync_connection_attempts_total[1h])) |