From 93a1684f068603b354ba3c05957a25459c73de05 Mon Sep 17 00:00:00 2001 From: DanConwayDev Date: Fri, 9 Jan 2026 14:12:24 +0000 Subject: feat(sync): add ConnectedDegraded status for failed historic sync - Add ConnectionStatus::ConnectedDegraded (status=4 in metrics) - Track batch failures via PendingBatch.failed field - Track relay-level failures via RelayState.historic_sync_had_failures - Transition to ConnectedDegraded when any batch fails during historic sync - Add is_live_sync_active() helper for cleaner match patterns - Update state machine diagram with ConnectedDegraded transitions - Update metrics docs with status=4 and example queries Fixes issue where relays with failed negentropy retries would incorrectly transition to Connected status despite missing data. Now operators can distinguish 'fully synced' vs 'degraded (partial data)'. --- docs/explanation/grasp-02-proactive-sync.md | 29 +++++++++++++++++++++++------ docs/explanation/monitoring.md | 9 +++++++-- 2 files changed, 30 insertions(+), 8 deletions(-) (limited to 'docs') diff --git a/docs/explanation/grasp-02-proactive-sync.md b/docs/explanation/grasp-02-proactive-sync.md index e1fb367..b17b8bf 100644 --- a/docs/explanation/grasp-02-proactive-sync.md +++ b/docs/explanation/grasp-02-proactive-sync.md @@ -79,6 +79,8 @@ pub enum ConnectionStatus { Syncing, /// Successfully connected, historic sync completed Connected, + /// Successfully connected, historic sync failed but live sync active + ConnectedDegraded, } /// Complete state for a single relay - combines sync needs with connection lifecycle @@ -207,15 +209,19 @@ stateDiagram-v2 Disconnected --> Connecting: retry_disconnected_relays → try_connect_relay Connecting --> Syncing: success → handle_connect_or_reconnect Connecting --> Disconnected: failure + record in health tracker - Syncing --> Connected: all historic batches complete → check_and_complete_historic_sync + Syncing --> Connected: all batches succeed → check_and_complete_historic_sync + Syncing --> ConnectedDegraded: any batch failed → check_and_complete_historic_sync Syncing --> Disconnected: connection lost → handle_disconnect Connected --> Disconnected: connection lost → handle_disconnect + ConnectedDegraded --> Disconnected: connection lost → handle_disconnect Connected --> [*]: intentional disconnect via check_disconnects + ConnectedDegraded --> [*]: intentional disconnect via check_disconnects note right of Disconnected: disconnected_at set for 15min rule
RelayConnection kept in HashMap note right of Connecting: connection attempt with timeout note right of Syncing: historic sync in progress
event loop spawned here note right of Connected: historic sync complete
last_connected tracked for since filter + note right of ConnectedDegraded: historic sync failed (missing events)
live sync active, partial data ``` ### Connection Flow Methods @@ -240,17 +246,28 @@ When a relay first connects, it enters the **Syncing** state and begins historic Each layer creates one or more `PendingBatch` entries tracked in `PendingSyncIndex`. As EOSE messages arrive: - `handle_eose()` confirms each batch via `confirm_batch()` -- `confirm_batch()` moves items to confirmed state and calls `check_and_complete_historic_sync()` -- `check_and_complete_historic_sync()` checks if `PendingSyncIndex` is empty for this relay -- When empty: transitions `Syncing` → `Connected`, sets `historic_sync_completed = true` +- `confirm_batch()` moves items to confirmed state, tracks if batch failed, and calls `check_and_complete_historic_sync()` +- `check_and_complete_historic_sync()` uses a **double-check pattern** to avoid race conditions: + 1. First check: Are there pending batches? If yes, return early + 2. Wait 6 seconds (batch window + buffer) for self-subscriber to process in-flight events + 3. Second check: Are there still no pending batches? If yes, return early + 4. If no pending batches after wait: + - If any batch failed: transition `Syncing` → `ConnectedDegraded` + - If all batches succeeded: transition `Syncing` → `Connected` + - Set `historic_sync_completed = true` + +**Why the double-check?** There's an async gap between receiving EOSE and the self-subscriber processing events to create Layer 2/3 filters. The 6-second wait (5s batch window + 1s buffer) ensures we don't prematurely mark sync complete while Layer 2/3 batches are being created. + +**Batch Failure Tracking**: When negentropy retry protection triggers (relay returns zero requested events on retry), the batch is marked as `failed = true`. This causes the relay to transition to `ConnectedDegraded` instead of `Connected`, signaling that live sync is active but historic sync is incomplete. **Metrics tracking**: The `ngit_sync_relay_connected` metric shows: - `0` = Disconnected -- `1` = Connecting +- `1` = Connecting - `2` = Syncing (historic sync in progress) - `3` = Connected (historic sync complete, live sync active) +- `4` = ConnectedDegraded (historic sync failed, live sync active, partial data) -This allows operators to monitor sync progress and distinguish between "connected but still catching up" vs "fully synced and live". +This allows operators to monitor sync progress and distinguish between "connected but still catching up" vs "fully synced and live" vs "degraded (missing historic data)". ### Event Loop Lifecycle diff --git a/docs/explanation/monitoring.md b/docs/explanation/monitoring.md index d2d20c0..cc164ab 100644 --- a/docs/explanation/monitoring.md +++ b/docs/explanation/monitoring.md @@ -98,7 +98,7 @@ When GRASP-02 proactive sync is implemented, the following metrics will be added | Metric | Type | Labels | Description | |--------|------|--------|-------------| -| `ngit_sync_relay_connected` | Gauge | relay | Connection status (0=disconnected, 1=connecting, 2=syncing, 3=connected) | +| `ngit_sync_relay_connected` | Gauge | relay | Connection status (0=disconnected, 1=connecting, 2=syncing, 3=connected, 4=connected_degraded) | | `ngit_sync_connection_attempts_total` | Counter | relay, result | Connection attempt outcomes | | `ngit_sync_relay_status` | Gauge | relay | Health status (1=healthy, 2=disconnected, 3=degraded, 4=dead, 5=rate_limited) | | `ngit_sync_relay_failures` | Gauge | relay | Current consecutive failure count | @@ -115,8 +115,9 @@ The `ngit_sync_relay_connected` metric tracks the connection lifecycle: - `1` = **Connecting** - Connection attempt in progress - `2` = **Syncing** - Connected, historic sync in progress - `3` = **Connected** - Connected, historic sync complete, live sync active +- `4` = **ConnectedDegraded** - Connected, historic sync failed, live sync active, partial data -This allows operators to distinguish between "connected but still catching up" (Syncing) vs "fully synced and live" (Connected). +This allows operators to distinguish between "connected but still catching up" (Syncing) vs "fully synced and live" (Connected) vs "degraded - missing historic data" (ConnectedDegraded). ### Relay Health States @@ -136,10 +137,14 @@ sum by (relay) (ngit_sync_relay_connected == 0) # Disconnected sum by (relay) (ngit_sync_relay_connected == 1) # Connecting sum by (relay) (ngit_sync_relay_connected == 2) # Syncing sum by (relay) (ngit_sync_relay_connected == 3) # Connected +sum by (relay) (ngit_sync_relay_connected == 4) # ConnectedDegraded # Relays still syncing (not yet fully caught up) count(ngit_sync_relay_connected == 2) +# Relays with degraded sync (missing historic data) +count(ngit_sync_relay_connected == 4) + # Connection success rate over last hour sum(rate(ngit_sync_connection_attempts_total{result="success"}[1h])) / sum(rate(ngit_sync_connection_attempts_total[1h])) -- cgit v1.2.3