upleb.uk

Public git repos — served from a NIP-34 GRASP relay at git.upleb.uk

summaryrefslogtreecommitdiff
path: root/docs/explanation/monitoring.md
diff options
context:
space:
mode:
Diffstat (limited to 'docs/explanation/monitoring.md')
-rw-r--r--docs/explanation/monitoring.md96
1 files changed, 91 insertions, 5 deletions
diff --git a/docs/explanation/monitoring.md b/docs/explanation/monitoring.md
index 3b1b1ac..9368bf4 100644
--- a/docs/explanation/monitoring.md
+++ b/docs/explanation/monitoring.md
@@ -90,10 +90,96 @@ For detailed per-repository investigation at scale, consider adding **Loki** (lo
90- Loki queries enable ad-hoc deep dives (e.g., find all transfers > 10MB) 90- Loki queries enable ad-hoc deep dives (e.g., find all transfers > 10MB)
91- Pairs with Prometheus for long-term trends 91- Pairs with Prometheus for long-term trends
92 92
93## Future: Sync Metrics (GRASP-02) 93## Sync Metrics (GRASP-02)
94 94
95When GRASP-02 proactive sync is implemented, additional metrics will track: 95When GRASP-02 proactive sync is implemented, the following metrics will be added to track relay synchronization health. These metrics use in-memory tracking with Prometheus for operator visibility (no database persistence needed for <100 relays).
96 96
97- Events received from sync (live vs catchup) 97### Sync Metrics Overview
98- Active outbound relay connections 98
99- Catchup gap (events found during catchup indicating sync failures) \ No newline at end of file 99| Metric | Type | Labels | Description |
100|--------|------|--------|-------------|
101| `ngit_sync_relay_connected` | Gauge | relay | 1 if connected, 0 if not |
102| `ngit_sync_connection_attempts_total` | Counter | relay, result | Connection attempt outcomes |
103| `ngit_sync_relay_status` | Gauge | relay, status | 1 for current status, 0 otherwise |
104| `ngit_sync_relay_failures` | Gauge | relay | Current consecutive failure count |
105| `ngit_sync_events_total` | Counter | source | Events received by source type |
106| `ngit_sync_gap_events_total` | Counter | relay | Events found during catchup |
107| `ngit_sync_relays_tracked_total` | Gauge | - | Total relays discovered |
108| `ngit_sync_relays_connected_total` | Gauge | - | Currently connected relay count |
109| `ngit_sync_relays_dead_total` | Gauge | - | Relays marked as dead |
110
111### Event Sources
112
113The `source` label on `ngit_sync_events_total` tracks how events were received:
114
115- `direct` - Submitted directly to our relay by a user
116- `live_sync` - Received via live WebSocket subscription (expected path)
117- `catchup` - Found during negentropy catchup after reconnect
118- `daily_catchup` - Found during daily reconciliation
119
120**Catchup events indicate sync failures** - these should have been received via live sync. High catchup rates suggest connectivity issues or filter mismatches.
121
122### Relay Health States
123
124The `status` label on `ngit_sync_relay_status` tracks relay health:
125
126- `healthy` - Normal operation, connections working
127- `backoff` - Exponential backoff after failures (5s → 10s → ... → 1h)
128- `dead` - 24h of continuous failures, daily retry only
129
130### Example Grafana Queries
131
132```promql
133# Relay health overview - count by status
134sum by (status) (ngit_sync_relay_status == 1)
135
136# Connection success rate over last hour
137sum(rate(ngit_sync_connection_attempts_total{result="success"}[1h]))
138/ sum(rate(ngit_sync_connection_attempts_total[1h]))
139
140# Sync gap detection - events that should have been live synced
141sum(rate(ngit_sync_gap_events_total[1h])) by (relay)
142
143# Live sync effectiveness (lower is better - fewer gaps)
144sum(rate(ngit_sync_events_total{source=~"catchup|daily_catchup"}[1h]))
145/ sum(rate(ngit_sync_events_total[1h]))
146
147# Relays with high failure counts (potential issues)
148topk(10, ngit_sync_relay_failures)
149```
150
151### Example Alerts
152
153```yaml
154# Alert if relay stuck in dead state for > 1 day
155- alert: SyncRelayDead
156 expr: ngit_sync_relay_status{status="dead"} == 1
157 for: 1d
158 labels:
159 severity: warning
160 annotations:
161 summary: "Sync relay {{ $labels.relay }} is dead"
162
163# Alert if sync gap rate is high (>10% of events from catchup)
164- alert: SyncGapHigh
165 expr: >
166 sum(rate(ngit_sync_events_total{source=~"catchup|daily_catchup"}[1h]))
167 / sum(rate(ngit_sync_events_total[1h])) > 0.1
168 for: 30m
169 labels:
170 severity: warning
171 annotations:
172 summary: "High sync gap rate - {{ $value | humanizePercentage }} of events from catchup"
173```
174
175### Design Rationale
176
177**In-memory health tracking with Prometheus visibility** was chosen over database persistence because:
178
1791. **Scale**: <100 relays means per-relay labels have acceptable cardinality
1802. **Simplicity**: No database schema, migrations, or cleanup needed
1813. **Operator visibility**: Prometheus + Grafana provide better dashboards than custom queries
1824. **Restart behavior**: Conservative initial backoff (5s + jitter) avoids thundering herd on restart
1835. **Historical data**: Prometheus retains health history; in-memory state only needs current status
184
185See [GRASP-02 Proactive Sync](grasp-02-proactive-sync.md) for full architecture details. \ No newline at end of file