diff options
| author | DanConwayDev <DanConwayDev@protonmail.com> | 2026-01-09 13:28:11 +0000 |
|---|---|---|
| committer | DanConwayDev <DanConwayDev@protonmail.com> | 2026-01-09 13:28:11 +0000 |
| commit | c34492069abacae67482af4c8356241958a524f7 (patch) | |
| tree | fd9b8ca3c26a96742bad4e9e359a20fc37c998aa /docs/explanation/monitoring.md | |
| parent | eb10e85f199266affd3bca0a3d4cd934f74f3e7f (diff) | |
feat(sync): add Syncing connection status to track historic sync progress
- Add ConnectionStatus::Syncing state between Connecting and Connected
- Track historic_sync_completed and historic_sync_completed_at in RelayState
- Auto-detect sync completion via check_and_complete_historic_sync()
- Update metrics: ngit_sync_relay_connected now shows 0-3 (disconnected/connecting/syncing/connected)
- Update Prometheus metric documentation with new status values
- Add state machine diagram showing Syncing transition
- Operators can now distinguish 'connected but catching up' vs 'fully synced'
Diffstat (limited to 'docs/explanation/monitoring.md')
| -rw-r--r-- | docs/explanation/monitoring.md | 83 |
1 files changed, 50 insertions, 33 deletions
diff --git a/docs/explanation/monitoring.md b/docs/explanation/monitoring.md index 9368bf4..d2d20c0 100644 --- a/docs/explanation/monitoring.md +++ b/docs/explanation/monitoring.md | |||
| @@ -98,54 +98,64 @@ When GRASP-02 proactive sync is implemented, the following metrics will be added | |||
| 98 | 98 | ||
| 99 | | Metric | Type | Labels | Description | | 99 | | Metric | Type | Labels | Description | |
| 100 | |--------|------|--------|-------------| | 100 | |--------|------|--------|-------------| |
| 101 | | `ngit_sync_relay_connected` | Gauge | relay | 1 if connected, 0 if not | | 101 | | `ngit_sync_relay_connected` | Gauge | relay | Connection status (0=disconnected, 1=connecting, 2=syncing, 3=connected) | |
| 102 | | `ngit_sync_connection_attempts_total` | Counter | relay, result | Connection attempt outcomes | | 102 | | `ngit_sync_connection_attempts_total` | Counter | relay, result | Connection attempt outcomes | |
| 103 | | `ngit_sync_relay_status` | Gauge | relay, status | 1 for current status, 0 otherwise | | 103 | | `ngit_sync_relay_status` | Gauge | relay | Health status (1=healthy, 2=disconnected, 3=degraded, 4=dead, 5=rate_limited) | |
| 104 | | `ngit_sync_relay_failures` | Gauge | relay | Current consecutive failure count | | 104 | | `ngit_sync_relay_failures` | Gauge | relay | Current consecutive failure count | |
| 105 | | `ngit_sync_events_total` | Counter | source | Events received by source type | | 105 | | `ngit_sync_events_synced_total` | Counter | - | Events synced (newly saved events only) | |
| 106 | | `ngit_sync_gap_events_total` | Counter | relay | Events found during catchup | | ||
| 107 | | `ngit_sync_relays_tracked_total` | Gauge | - | Total relays discovered | | 106 | | `ngit_sync_relays_tracked_total` | Gauge | - | Total relays discovered | |
| 108 | | `ngit_sync_relays_connected_total` | Gauge | - | Currently connected relay count | | 107 | | `ngit_sync_relays_connected_total` | Gauge | - | Currently connected relay count | |
| 109 | | `ngit_sync_relays_dead_total` | Gauge | - | Relays marked as dead | | 108 | | `ngit_sync_relays_dead_total` | Gauge | - | Relays marked as dead | |
| 110 | 109 | ||
| 111 | ### Event Sources | 110 | ### Connection Status Values |
| 112 | 111 | ||
| 113 | The `source` label on `ngit_sync_events_total` tracks how events were received: | 112 | The `ngit_sync_relay_connected` metric tracks the connection lifecycle: |
| 114 | 113 | ||
| 115 | - `direct` - Submitted directly to our relay by a user | 114 | - `0` = **Disconnected** - Not currently connected |
| 116 | - `live_sync` - Received via live WebSocket subscription (expected path) | 115 | - `1` = **Connecting** - Connection attempt in progress |
| 117 | - `catchup` - Found during negentropy catchup after reconnect | 116 | - `2` = **Syncing** - Connected, historic sync in progress |
| 118 | - `daily_catchup` - Found during daily reconciliation | 117 | - `3` = **Connected** - Connected, historic sync complete, live sync active |
| 119 | 118 | ||
| 120 | **Catchup events indicate sync failures** - these should have been received via live sync. High catchup rates suggest connectivity issues or filter mismatches. | 119 | This allows operators to distinguish between "connected but still catching up" (Syncing) vs "fully synced and live" (Connected). |
| 121 | 120 | ||
| 122 | ### Relay Health States | 121 | ### Relay Health States |
| 123 | 122 | ||
| 124 | The `status` label on `ngit_sync_relay_status` tracks relay health: | 123 | The `ngit_sync_relay_status` metric tracks relay health: |
| 125 | 124 | ||
| 126 | - `healthy` - Normal operation, connections working | 125 | - `1` = **Healthy** - Connected and stable |
| 127 | - `backoff` - Exponential backoff after failures (5s → 10s → ... → 1h) | 126 | - `2` = **Disconnected** - Not connected, but no issues detected |
| 128 | - `dead` - 24h of continuous failures, daily retry only | 127 | - `3` = **Degraded** - Connection problems or unstable after recovery |
| 128 | - `4` = **Dead** - 24h+ of continuous failures | ||
| 129 | - `5` = **RateLimited** - Rate limit cooldown active (65s) | ||
| 129 | 130 | ||
| 130 | ### Example Grafana Queries | 131 | ### Example Grafana Queries |
| 131 | 132 | ||
| 132 | ```promql | 133 | ```promql |
| 133 | # Relay health overview - count by status | 134 | # Relay connection status overview - count by status |
| 134 | sum by (status) (ngit_sync_relay_status == 1) | 135 | sum by (relay) (ngit_sync_relay_connected == 0) # Disconnected |
| 136 | sum by (relay) (ngit_sync_relay_connected == 1) # Connecting | ||
| 137 | sum by (relay) (ngit_sync_relay_connected == 2) # Syncing | ||
| 138 | sum by (relay) (ngit_sync_relay_connected == 3) # Connected | ||
| 139 | |||
| 140 | # Relays still syncing (not yet fully caught up) | ||
| 141 | count(ngit_sync_relay_connected == 2) | ||
| 135 | 142 | ||
| 136 | # Connection success rate over last hour | 143 | # Connection success rate over last hour |
| 137 | sum(rate(ngit_sync_connection_attempts_total{result="success"}[1h])) | 144 | sum(rate(ngit_sync_connection_attempts_total{result="success"}[1h])) |
| 138 | / sum(rate(ngit_sync_connection_attempts_total[1h])) | 145 | / sum(rate(ngit_sync_connection_attempts_total[1h])) |
| 139 | 146 | ||
| 140 | # Sync gap detection - events that should have been live synced | 147 | # Event sync rate (newly saved events) |
| 141 | sum(rate(ngit_sync_gap_events_total[1h])) by (relay) | 148 | rate(ngit_sync_events_synced_total[5m]) |
| 142 | |||
| 143 | # Live sync effectiveness (lower is better - fewer gaps) | ||
| 144 | sum(rate(ngit_sync_events_total{source=~"catchup|daily_catchup"}[1h])) | ||
| 145 | / sum(rate(ngit_sync_events_total[1h])) | ||
| 146 | 149 | ||
| 147 | # Relays with high failure counts (potential issues) | 150 | # Relays with high failure counts (potential issues) |
| 148 | topk(10, ngit_sync_relay_failures) | 151 | topk(10, ngit_sync_relay_failures) |
| 152 | |||
| 153 | # Relay health overview - count by health state | ||
| 154 | sum(ngit_sync_relay_status == 1) # Healthy | ||
| 155 | sum(ngit_sync_relay_status == 2) # Disconnected | ||
| 156 | sum(ngit_sync_relay_status == 3) # Degraded | ||
| 157 | sum(ngit_sync_relay_status == 4) # Dead | ||
| 158 | sum(ngit_sync_relay_status == 5) # RateLimited | ||
| 149 | ``` | 159 | ``` |
| 150 | 160 | ||
| 151 | ### Example Alerts | 161 | ### Example Alerts |
| @@ -153,23 +163,30 @@ topk(10, ngit_sync_relay_failures) | |||
| 153 | ```yaml | 163 | ```yaml |
| 154 | # Alert if relay stuck in dead state for > 1 day | 164 | # Alert if relay stuck in dead state for > 1 day |
| 155 | - alert: SyncRelayDead | 165 | - alert: SyncRelayDead |
| 156 | expr: ngit_sync_relay_status{status="dead"} == 1 | 166 | expr: ngit_sync_relay_status == 4 # Dead state |
| 157 | for: 1d | 167 | for: 1d |
| 158 | labels: | 168 | labels: |
| 159 | severity: warning | 169 | severity: warning |
| 160 | annotations: | 170 | annotations: |
| 161 | summary: "Sync relay {{ $labels.relay }} is dead" | 171 | summary: "Sync relay {{ $labels.relay }} is dead (24h+ failures)" |
| 162 | 172 | ||
| 163 | # Alert if sync gap rate is high (>10% of events from catchup) | 173 | # Alert if relay stuck in syncing state for > 1 hour |
| 164 | - alert: SyncGapHigh | 174 | - alert: SyncRelaySlow |
| 165 | expr: > | 175 | expr: ngit_sync_relay_connected == 2 # Syncing state |
| 166 | sum(rate(ngit_sync_events_total{source=~"catchup|daily_catchup"}[1h])) | 176 | for: 1h |
| 167 | / sum(rate(ngit_sync_events_total[1h])) > 0.1 | 177 | labels: |
| 168 | for: 30m | 178 | severity: info |
| 179 | annotations: | ||
| 180 | summary: "Sync relay {{ $labels.relay }} taking >1h to complete historic sync" | ||
| 181 | |||
| 182 | # Alert if too many relays are degraded | ||
| 183 | - alert: SyncManyDegraded | ||
| 184 | expr: sum(ngit_sync_relay_status == 3) > 5 # Degraded state | ||
| 185 | for: 15m | ||
| 169 | labels: | 186 | labels: |
| 170 | severity: warning | 187 | severity: warning |
| 171 | annotations: | 188 | annotations: |
| 172 | summary: "High sync gap rate - {{ $value | humanizePercentage }} of events from catchup" | 189 | summary: "{{ $value }} relays in degraded state" |
| 173 | ``` | 190 | ``` |
| 174 | 191 | ||
| 175 | ### Design Rationale | 192 | ### Design Rationale |