From 208ea60836cfc98857cf3359a73d8874ed5d935a Mon Sep 17 00:00:00 2001 From: DanConwayDev Date: Fri, 9 Jan 2026 14:23:44 +0000 Subject: refactor(sync): rename ConnectedDegraded to ConnectedHistoricSyncFailures MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Resolves naming conflict with RelayHealthState::Degraded by using a more explicit name that clearly indicates the connection status relates to historic sync failures, not connection health degradation. Changes: - ConnectionStatus::ConnectedDegraded → ConnectedHistoricSyncFailures - Updated all documentation and comments - Updated Prometheus metric descriptions - Metric value remains 4 for backward compatibility This makes it clear that: - ConnectedHistoricSyncFailures = connection lifecycle (missing historic data) - RelayHealthState::Degraded = connection health (reliability issues) These are orthogonal concerns - a relay can be ConnectedHistoricSyncFailures but Healthy, or Connected but Degraded. --- docs/explanation/grasp-02-proactive-sync.md | 20 ++++++++++---------- docs/explanation/monitoring.md | 10 +++++----- 2 files changed, 15 insertions(+), 15 deletions(-) (limited to 'docs') diff --git a/docs/explanation/grasp-02-proactive-sync.md b/docs/explanation/grasp-02-proactive-sync.md index b17b8bf..e983316 100644 --- a/docs/explanation/grasp-02-proactive-sync.md +++ b/docs/explanation/grasp-02-proactive-sync.md @@ -79,8 +79,8 @@ pub enum ConnectionStatus { Syncing, /// Successfully connected, historic sync completed Connected, - /// Successfully connected, historic sync failed but live sync active - ConnectedDegraded, + /// Successfully connected, historic sync had failures but live sync active + ConnectedHistoricSyncFailures, } /// Complete state for a single relay - combines sync needs with connection lifecycle @@ -210,18 +210,18 @@ stateDiagram-v2 Connecting --> Syncing: success → handle_connect_or_reconnect Connecting --> Disconnected: failure + record in health tracker Syncing --> Connected: all batches succeed → check_and_complete_historic_sync - Syncing --> ConnectedDegraded: any batch failed → check_and_complete_historic_sync + Syncing --> ConnectedHistoricSyncFailures: any batch failed → check_and_complete_historic_sync Syncing --> Disconnected: connection lost → handle_disconnect Connected --> Disconnected: connection lost → handle_disconnect - ConnectedDegraded --> Disconnected: connection lost → handle_disconnect + ConnectedHistoricSyncFailures --> Disconnected: connection lost → handle_disconnect Connected --> [*]: intentional disconnect via check_disconnects - ConnectedDegraded --> [*]: intentional disconnect via check_disconnects + ConnectedHistoricSyncFailures --> [*]: intentional disconnect via check_disconnects note right of Disconnected: disconnected_at set for 15min rule
RelayConnection kept in HashMap note right of Connecting: connection attempt with timeout note right of Syncing: historic sync in progress
event loop spawned here note right of Connected: historic sync complete
last_connected tracked for since filter - note right of ConnectedDegraded: historic sync failed (missing events)
live sync active, partial data + note right of ConnectedHistoricSyncFailures: historic sync had failures (missing events)
live sync active, partial data ``` ### Connection Flow Methods @@ -252,22 +252,22 @@ Each layer creates one or more `PendingBatch` entries tracked in `PendingSyncInd 2. Wait 6 seconds (batch window + buffer) for self-subscriber to process in-flight events 3. Second check: Are there still no pending batches? If yes, return early 4. If no pending batches after wait: - - If any batch failed: transition `Syncing` → `ConnectedDegraded` + - If any batch failed: transition `Syncing` → `ConnectedHistoricSyncFailures` - If all batches succeeded: transition `Syncing` → `Connected` - Set `historic_sync_completed = true` **Why the double-check?** There's an async gap between receiving EOSE and the self-subscriber processing events to create Layer 2/3 filters. The 6-second wait (5s batch window + 1s buffer) ensures we don't prematurely mark sync complete while Layer 2/3 batches are being created. -**Batch Failure Tracking**: When negentropy retry protection triggers (relay returns zero requested events on retry), the batch is marked as `failed = true`. This causes the relay to transition to `ConnectedDegraded` instead of `Connected`, signaling that live sync is active but historic sync is incomplete. +**Batch Failure Tracking**: When negentropy retry protection triggers (relay returns zero requested events on retry), the batch is marked as `failed = true`. This causes the relay to transition to `ConnectedHistoricSyncFailures` instead of `Connected`, signaling that live sync is active but historic sync is incomplete. **Metrics tracking**: The `ngit_sync_relay_connected` metric shows: - `0` = Disconnected - `1` = Connecting - `2` = Syncing (historic sync in progress) - `3` = Connected (historic sync complete, live sync active) -- `4` = ConnectedDegraded (historic sync failed, live sync active, partial data) +- `4` = ConnectedHistoricSyncFailures (historic sync had failures, live sync active, partial data) -This allows operators to monitor sync progress and distinguish between "connected but still catching up" vs "fully synced and live" vs "degraded (missing historic data)". +This allows operators to monitor sync progress and distinguish between "connected but still catching up" vs "fully synced and live" vs "historic sync failures (missing historic data)". ### Event Loop Lifecycle diff --git a/docs/explanation/monitoring.md b/docs/explanation/monitoring.md index cc164ab..7520813 100644 --- a/docs/explanation/monitoring.md +++ b/docs/explanation/monitoring.md @@ -98,7 +98,7 @@ When GRASP-02 proactive sync is implemented, the following metrics will be added | Metric | Type | Labels | Description | |--------|------|--------|-------------| -| `ngit_sync_relay_connected` | Gauge | relay | Connection status (0=disconnected, 1=connecting, 2=syncing, 3=connected, 4=connected_degraded) | +| `ngit_sync_relay_connected` | Gauge | relay | Connection status (0=disconnected, 1=connecting, 2=syncing, 3=connected, 4=connected_historic_sync_failures) | | `ngit_sync_connection_attempts_total` | Counter | relay, result | Connection attempt outcomes | | `ngit_sync_relay_status` | Gauge | relay | Health status (1=healthy, 2=disconnected, 3=degraded, 4=dead, 5=rate_limited) | | `ngit_sync_relay_failures` | Gauge | relay | Current consecutive failure count | @@ -115,9 +115,9 @@ The `ngit_sync_relay_connected` metric tracks the connection lifecycle: - `1` = **Connecting** - Connection attempt in progress - `2` = **Syncing** - Connected, historic sync in progress - `3` = **Connected** - Connected, historic sync complete, live sync active -- `4` = **ConnectedDegraded** - Connected, historic sync failed, live sync active, partial data +- `4` = **ConnectedHistoricSyncFailures** - Connected, historic sync had failures, live sync active, partial data -This allows operators to distinguish between "connected but still catching up" (Syncing) vs "fully synced and live" (Connected) vs "degraded - missing historic data" (ConnectedDegraded). +This allows operators to distinguish between "connected but still catching up" (Syncing) vs "fully synced and live" (Connected) vs "historic sync failures - missing historic data" (ConnectedHistoricSyncFailures). ### Relay Health States @@ -137,12 +137,12 @@ sum by (relay) (ngit_sync_relay_connected == 0) # Disconnected sum by (relay) (ngit_sync_relay_connected == 1) # Connecting sum by (relay) (ngit_sync_relay_connected == 2) # Syncing sum by (relay) (ngit_sync_relay_connected == 3) # Connected -sum by (relay) (ngit_sync_relay_connected == 4) # ConnectedDegraded +sum by (relay) (ngit_sync_relay_connected == 4) # ConnectedHistoricSyncFailures # Relays still syncing (not yet fully caught up) count(ngit_sync_relay_connected == 2) -# Relays with degraded sync (missing historic data) +# Relays with historic sync failures (missing historic data) count(ngit_sync_relay_connected == 4) # Connection success rate over last hour -- cgit v1.2.3