From fdbc8895e1e9e712882bd854908295a95e7afcb9 Mon Sep 17 00:00:00 2001 From: DanConwayDev Date: Thu, 4 Dec 2025 16:54:38 +0000 Subject: docs: update GRASP-02 proactive sync event sync approach --- docs/explanation/monitoring.md | 96 +++++++++++++++++++++++++++++++++++++++--- 1 file changed, 91 insertions(+), 5 deletions(-) (limited to 'docs/explanation/monitoring.md') diff --git a/docs/explanation/monitoring.md b/docs/explanation/monitoring.md index 3b1b1ac..9368bf4 100644 --- a/docs/explanation/monitoring.md +++ b/docs/explanation/monitoring.md @@ -90,10 +90,96 @@ For detailed per-repository investigation at scale, consider adding **Loki** (lo - Loki queries enable ad-hoc deep dives (e.g., find all transfers > 10MB) - Pairs with Prometheus for long-term trends -## Future: Sync Metrics (GRASP-02) +## Sync Metrics (GRASP-02) -When GRASP-02 proactive sync is implemented, additional metrics will track: +When GRASP-02 proactive sync is implemented, the following metrics will be added to track relay synchronization health. These metrics use in-memory tracking with Prometheus for operator visibility (no database persistence needed for <100 relays). -- Events received from sync (live vs catchup) -- Active outbound relay connections -- Catchup gap (events found during catchup indicating sync failures) \ No newline at end of file +### Sync Metrics Overview + +| Metric | Type | Labels | Description | +|--------|------|--------|-------------| +| `ngit_sync_relay_connected` | Gauge | relay | 1 if connected, 0 if not | +| `ngit_sync_connection_attempts_total` | Counter | relay, result | Connection attempt outcomes | +| `ngit_sync_relay_status` | Gauge | relay, status | 1 for current status, 0 otherwise | +| `ngit_sync_relay_failures` | Gauge | relay | Current consecutive failure count | +| `ngit_sync_events_total` | Counter | source | Events received by source type | +| `ngit_sync_gap_events_total` | Counter | relay | Events found during catchup | +| `ngit_sync_relays_tracked_total` | Gauge | - | Total relays discovered | +| `ngit_sync_relays_connected_total` | Gauge | - | Currently connected relay count | +| `ngit_sync_relays_dead_total` | Gauge | - | Relays marked as dead | + +### Event Sources + +The `source` label on `ngit_sync_events_total` tracks how events were received: + +- `direct` - Submitted directly to our relay by a user +- `live_sync` - Received via live WebSocket subscription (expected path) +- `catchup` - Found during negentropy catchup after reconnect +- `daily_catchup` - Found during daily reconciliation + +**Catchup events indicate sync failures** - these should have been received via live sync. High catchup rates suggest connectivity issues or filter mismatches. + +### Relay Health States + +The `status` label on `ngit_sync_relay_status` tracks relay health: + +- `healthy` - Normal operation, connections working +- `backoff` - Exponential backoff after failures (5s → 10s → ... → 1h) +- `dead` - 24h of continuous failures, daily retry only + +### Example Grafana Queries + +```promql +# Relay health overview - count by status +sum by (status) (ngit_sync_relay_status == 1) + +# Connection success rate over last hour +sum(rate(ngit_sync_connection_attempts_total{result="success"}[1h])) +/ sum(rate(ngit_sync_connection_attempts_total[1h])) + +# Sync gap detection - events that should have been live synced +sum(rate(ngit_sync_gap_events_total[1h])) by (relay) + +# Live sync effectiveness (lower is better - fewer gaps) +sum(rate(ngit_sync_events_total{source=~"catchup|daily_catchup"}[1h])) +/ sum(rate(ngit_sync_events_total[1h])) + +# Relays with high failure counts (potential issues) +topk(10, ngit_sync_relay_failures) +``` + +### Example Alerts + +```yaml +# Alert if relay stuck in dead state for > 1 day +- alert: SyncRelayDead + expr: ngit_sync_relay_status{status="dead"} == 1 + for: 1d + labels: + severity: warning + annotations: + summary: "Sync relay {{ $labels.relay }} is dead" + +# Alert if sync gap rate is high (>10% of events from catchup) +- alert: SyncGapHigh + expr: > + sum(rate(ngit_sync_events_total{source=~"catchup|daily_catchup"}[1h])) + / sum(rate(ngit_sync_events_total[1h])) > 0.1 + for: 30m + labels: + severity: warning + annotations: + summary: "High sync gap rate - {{ $value | humanizePercentage }} of events from catchup" +``` + +### Design Rationale + +**In-memory health tracking with Prometheus visibility** was chosen over database persistence because: + +1. **Scale**: <100 relays means per-relay labels have acceptable cardinality +2. **Simplicity**: No database schema, migrations, or cleanup needed +3. **Operator visibility**: Prometheus + Grafana provide better dashboards than custom queries +4. **Restart behavior**: Conservative initial backoff (5s + jitter) avoids thundering herd on restart +5. **Historical data**: Prometheus retains health history; in-memory state only needs current status + +See [GRASP-02 Proactive Sync](grasp-02-proactive-sync.md) for full architecture details. \ No newline at end of file -- cgit v1.2.3