upleb.uk

Public git repos — served from a NIP-34 GRASP relay at git.upleb.uk

summaryrefslogtreecommitdiff
path: root/docs/explanation/monitoring.md
diff options
context:
space:
mode:
authorDanConwayDev <DanConwayDev@protonmail.com>2026-01-09 13:28:11 +0000
committerDanConwayDev <DanConwayDev@protonmail.com>2026-01-09 13:28:11 +0000
commitc34492069abacae67482af4c8356241958a524f7 (patch)
treefd9b8ca3c26a96742bad4e9e359a20fc37c998aa /docs/explanation/monitoring.md
parenteb10e85f199266affd3bca0a3d4cd934f74f3e7f (diff)
feat(sync): add Syncing connection status to track historic sync progress
- Add ConnectionStatus::Syncing state between Connecting and Connected - Track historic_sync_completed and historic_sync_completed_at in RelayState - Auto-detect sync completion via check_and_complete_historic_sync() - Update metrics: ngit_sync_relay_connected now shows 0-3 (disconnected/connecting/syncing/connected) - Update Prometheus metric documentation with new status values - Add state machine diagram showing Syncing transition - Operators can now distinguish 'connected but catching up' vs 'fully synced'
Diffstat (limited to 'docs/explanation/monitoring.md')
-rw-r--r--docs/explanation/monitoring.md83
1 files changed, 50 insertions, 33 deletions
diff --git a/docs/explanation/monitoring.md b/docs/explanation/monitoring.md
index 9368bf4..d2d20c0 100644
--- a/docs/explanation/monitoring.md
+++ b/docs/explanation/monitoring.md
@@ -98,54 +98,64 @@ When GRASP-02 proactive sync is implemented, the following metrics will be added
98 98
99| Metric | Type | Labels | Description | 99| Metric | Type | Labels | Description |
100|--------|------|--------|-------------| 100|--------|------|--------|-------------|
101| `ngit_sync_relay_connected` | Gauge | relay | 1 if connected, 0 if not | 101| `ngit_sync_relay_connected` | Gauge | relay | Connection status (0=disconnected, 1=connecting, 2=syncing, 3=connected) |
102| `ngit_sync_connection_attempts_total` | Counter | relay, result | Connection attempt outcomes | 102| `ngit_sync_connection_attempts_total` | Counter | relay, result | Connection attempt outcomes |
103| `ngit_sync_relay_status` | Gauge | relay, status | 1 for current status, 0 otherwise | 103| `ngit_sync_relay_status` | Gauge | relay | Health status (1=healthy, 2=disconnected, 3=degraded, 4=dead, 5=rate_limited) |
104| `ngit_sync_relay_failures` | Gauge | relay | Current consecutive failure count | 104| `ngit_sync_relay_failures` | Gauge | relay | Current consecutive failure count |
105| `ngit_sync_events_total` | Counter | source | Events received by source type | 105| `ngit_sync_events_synced_total` | Counter | - | Events synced (newly saved events only) |
106| `ngit_sync_gap_events_total` | Counter | relay | Events found during catchup |
107| `ngit_sync_relays_tracked_total` | Gauge | - | Total relays discovered | 106| `ngit_sync_relays_tracked_total` | Gauge | - | Total relays discovered |
108| `ngit_sync_relays_connected_total` | Gauge | - | Currently connected relay count | 107| `ngit_sync_relays_connected_total` | Gauge | - | Currently connected relay count |
109| `ngit_sync_relays_dead_total` | Gauge | - | Relays marked as dead | 108| `ngit_sync_relays_dead_total` | Gauge | - | Relays marked as dead |
110 109
111### Event Sources 110### Connection Status Values
112 111
113The `source` label on `ngit_sync_events_total` tracks how events were received: 112The `ngit_sync_relay_connected` metric tracks the connection lifecycle:
114 113
115- `direct` - Submitted directly to our relay by a user 114- `0` = **Disconnected** - Not currently connected
116- `live_sync` - Received via live WebSocket subscription (expected path) 115- `1` = **Connecting** - Connection attempt in progress
117- `catchup` - Found during negentropy catchup after reconnect 116- `2` = **Syncing** - Connected, historic sync in progress
118- `daily_catchup` - Found during daily reconciliation 117- `3` = **Connected** - Connected, historic sync complete, live sync active
119 118
120**Catchup events indicate sync failures** - these should have been received via live sync. High catchup rates suggest connectivity issues or filter mismatches. 119This allows operators to distinguish between "connected but still catching up" (Syncing) vs "fully synced and live" (Connected).
121 120
122### Relay Health States 121### Relay Health States
123 122
124The `status` label on `ngit_sync_relay_status` tracks relay health: 123The `ngit_sync_relay_status` metric tracks relay health:
125 124
126- `healthy` - Normal operation, connections working 125- `1` = **Healthy** - Connected and stable
127- `backoff` - Exponential backoff after failures (5s → 10s → ... → 1h) 126- `2` = **Disconnected** - Not connected, but no issues detected
128- `dead` - 24h of continuous failures, daily retry only 127- `3` = **Degraded** - Connection problems or unstable after recovery
128- `4` = **Dead** - 24h+ of continuous failures
129- `5` = **RateLimited** - Rate limit cooldown active (65s)
129 130
130### Example Grafana Queries 131### Example Grafana Queries
131 132
132```promql 133```promql
133# Relay health overview - count by status 134# Relay connection status overview - count by status
134sum by (status) (ngit_sync_relay_status == 1) 135sum by (relay) (ngit_sync_relay_connected == 0) # Disconnected
136sum by (relay) (ngit_sync_relay_connected == 1) # Connecting
137sum by (relay) (ngit_sync_relay_connected == 2) # Syncing
138sum by (relay) (ngit_sync_relay_connected == 3) # Connected
139
140# Relays still syncing (not yet fully caught up)
141count(ngit_sync_relay_connected == 2)
135 142
136# Connection success rate over last hour 143# Connection success rate over last hour
137sum(rate(ngit_sync_connection_attempts_total{result="success"}[1h])) 144sum(rate(ngit_sync_connection_attempts_total{result="success"}[1h]))
138/ sum(rate(ngit_sync_connection_attempts_total[1h])) 145/ sum(rate(ngit_sync_connection_attempts_total[1h]))
139 146
140# Sync gap detection - events that should have been live synced 147# Event sync rate (newly saved events)
141sum(rate(ngit_sync_gap_events_total[1h])) by (relay) 148rate(ngit_sync_events_synced_total[5m])
142
143# Live sync effectiveness (lower is better - fewer gaps)
144sum(rate(ngit_sync_events_total{source=~"catchup|daily_catchup"}[1h]))
145/ sum(rate(ngit_sync_events_total[1h]))
146 149
147# Relays with high failure counts (potential issues) 150# Relays with high failure counts (potential issues)
148topk(10, ngit_sync_relay_failures) 151topk(10, ngit_sync_relay_failures)
152
153# Relay health overview - count by health state
154sum(ngit_sync_relay_status == 1) # Healthy
155sum(ngit_sync_relay_status == 2) # Disconnected
156sum(ngit_sync_relay_status == 3) # Degraded
157sum(ngit_sync_relay_status == 4) # Dead
158sum(ngit_sync_relay_status == 5) # RateLimited
149``` 159```
150 160
151### Example Alerts 161### Example Alerts
@@ -153,23 +163,30 @@ topk(10, ngit_sync_relay_failures)
153```yaml 163```yaml
154# Alert if relay stuck in dead state for > 1 day 164# Alert if relay stuck in dead state for > 1 day
155- alert: SyncRelayDead 165- alert: SyncRelayDead
156 expr: ngit_sync_relay_status{status="dead"} == 1 166 expr: ngit_sync_relay_status == 4 # Dead state
157 for: 1d 167 for: 1d
158 labels: 168 labels:
159 severity: warning 169 severity: warning
160 annotations: 170 annotations:
161 summary: "Sync relay {{ $labels.relay }} is dead" 171 summary: "Sync relay {{ $labels.relay }} is dead (24h+ failures)"
162 172
163# Alert if sync gap rate is high (>10% of events from catchup) 173# Alert if relay stuck in syncing state for > 1 hour
164- alert: SyncGapHigh 174- alert: SyncRelaySlow
165 expr: > 175 expr: ngit_sync_relay_connected == 2 # Syncing state
166 sum(rate(ngit_sync_events_total{source=~"catchup|daily_catchup"}[1h])) 176 for: 1h
167 / sum(rate(ngit_sync_events_total[1h])) > 0.1 177 labels:
168 for: 30m 178 severity: info
179 annotations:
180 summary: "Sync relay {{ $labels.relay }} taking >1h to complete historic sync"
181
182# Alert if too many relays are degraded
183- alert: SyncManyDegraded
184 expr: sum(ngit_sync_relay_status == 3) > 5 # Degraded state
185 for: 15m
169 labels: 186 labels:
170 severity: warning 187 severity: warning
171 annotations: 188 annotations:
172 summary: "High sync gap rate - {{ $value | humanizePercentage }} of events from catchup" 189 summary: "{{ $value }} relays in degraded state"
173``` 190```
174 191
175### Design Rationale 192### Design Rationale