From dd403b17e7c74db9443d0891a9de1f0f0f9f89eb Mon Sep 17 00:00:00 2001 From: DanConwayDev Date: Thu, 4 Dec 2025 18:43:49 +0000 Subject: feat(sync): Phase 6 - observability and production readiness - Add SyncMetrics with full Prometheus integration - Track sync gaps via catchup events - Update Grafana dashboard with sync panels - Document all sync configuration options - Update design doc with implementation notes --- docs/explanation/grasp-02-proactive-sync.md | 128 +++++++++++ docs/grafana/ngit-grasp-dashboard.json | 334 ++++++++++++++++++++++++++++ docs/reference/configuration.md | 137 ++++++++++++ 3 files changed, 599 insertions(+) (limited to 'docs') diff --git a/docs/explanation/grasp-02-proactive-sync.md b/docs/explanation/grasp-02-proactive-sync.md index a8af3f4..98531ec 100644 --- a/docs/explanation/grasp-02-proactive-sync.md +++ b/docs/explanation/grasp-02-proactive-sync.md @@ -745,3 +745,131 @@ pub struct SyncConfig { 8. **Dynamic subscription addition** with periodic consolidation 9. **Custom acceptance policy** excluding rate limiting defaults 10. **Catchup as failure signal** - events found during catchup/daily indicate live sync gaps, tracked in Prometheus + +--- + +## Implementation Notes (Phase 6) + +This section documents the final implementation as of Phase 6 (Observability & Production Readiness). + +### What Was Actually Built + +The implementation closely follows the design document with the following completed components: + +#### Phase 1: Basic Sync (commit b167f1b) +- [`SyncManager`](../../src/sync/manager.rs) - Main coordinator for proactive sync +- Single relay sync via `NGIT_SYNC_RELAY_URL` configuration +- Event validation through existing [`Nip34WritePolicy`](../../src/nostr/builder.rs) + +#### Phase 2: Three-Layer Filters (commit bf558b0) +- [`FilterService`](../../src/sync/filter.rs) - Builds three-layer filter strategy +- Layer 1: All kind 30617+30618 (announcements) +- Layer 2: A/a tag filters for repository events +- Layer 3: E/e tag filters for related events (PRs, Issues) +- Multi-relay discovery from stored announcements + +#### Phase 3: Health Tracking (commit f639ecf) +- [`RelayHealthTracker`](../../src/sync/health.rs) - DashMap-based health tracking +- Three states: Healthy → Degraded → Dead +- Exponential backoff: 5s → 10s → 20s → ... → max (default 1h) +- Dead relay detection after 24h continuous failures +- Startup jitter (0-10s) to prevent thundering herd + +#### Phase 4: Dynamic Subscriptions (commit a19ff57) +- [`SubscriptionManager`](../../src/sync/subscription.rs) - Per-connection subscription tracking +- Dynamic Layer 2 subscriptions when new announcements arrive +- Dynamic Layer 3 subscriptions when new PRs/Issues arrive +- Filter consolidation at threshold (150 filters) + +#### Phase 5: Catchup & Gap Detection (commit 950c2e4) +- [`NegentropyService`](../../src/sync/negentropy.rs) - Gap-filling catchup operations +- Startup catchup (configurable delay) +- Reconnection catchup (limited lookback) +- Daily catchup (not yet implemented - placeholder) + +#### Phase 6: Observability (this phase) +- [`SyncMetrics`](../../src/sync/metrics.rs) - Full Prometheus integration +- Grafana dashboard panels for sync monitoring +- Documentation updates + +### Differences from Original Design + +1. **Negentropy (NIP-77)**: Simplified gap-filling was used instead of full NIP-77 negentropy reconciliation, as nostr-sdk 0.44 lacks built-in negentropy support. The current implementation uses timestamp-based catchup queries. + +2. **Filter Consolidation Threshold**: Set at 150 filters (as designed) based on typical relay filter limits. + +3. **Health Tracking**: Implemented exactly as designed - in-memory only (not persisted to database), which is acceptable for production as health state rebuilds quickly on restart. + +4. **Metric Label Strategy**: Used simpler numeric encoding for health status (1=healthy, 2=degraded, 3=dead) instead of multiple label values per relay, reducing cardinality. + +5. **Event Source Tracking**: Implemented four source types (`live`, `startup`, `reconnect`, `daily`) instead of the original (`direct`, `live_sync`, `catchup`, `daily_catchup`). + +### Three-Layer Filter Strategy (As Implemented) + +``` +Layer 1: Discovery Layer +├── Query: kinds [30617, 30618] (announcements) +├── Applied: At startup and during sync +└── Purpose: Discover all repositories across network + +Layer 2: Repository Events +├── Query: Events with A/a tags pointing to tracked repos +├── Format: A tag = "30617::" +├── Triggered: When new announcement is accepted +└── Purpose: Get PRs, issues, patches for repositories + +Layer 3: Related Events +├── Query: Events with E/e tags pointing to tracked PRs/Issues +├── Triggered: When new PR/Issue is accepted +└── Purpose: Get comments, reviews, status updates +``` + +### Prometheus Metrics (As Implemented) + +| Metric | Type | Labels | Description | +|--------|------|--------|-------------| +| `ngit_sync_relay_connected` | Gauge | relay | Connection status (1/0) | +| `ngit_sync_connection_attempts_total` | Counter | relay, result | Attempts by outcome | +| `ngit_sync_relay_status` | Gauge | relay | Health state (1/2/3) | +| `ngit_sync_relay_failures` | Gauge | relay | Consecutive failures | +| `ngit_sync_events_total` | Counter | source | Events by source type | +| `ngit_sync_gap_events_total` | Counter | relay | Gap events filled | +| `ngit_sync_relays_tracked_total` | Gauge | - | Total relays discovered | +| `ngit_sync_relays_connected_total` | Gauge | - | Currently connected | +| `ngit_sync_relays_dead_total` | Gauge | - | Dead relay count | + +### Configuration Options (As Implemented) + +All configuration via environment variables or CLI flags: + +| Option | Type | Default | Description | +|--------|------|---------|-------------| +| `NGIT_SYNC_RELAY_URL` | String | None | Primary sync relay URL | +| `NGIT_SYNC_MAX_BACKOFF_SECS` | u64 | 3600 | Max backoff delay (seconds) | +| `NGIT_SYNC_STARTUP_DELAY_SECS` | u64 | 30 | Catchup delay after startup | +| `NGIT_SYNC_RECONNECT_DELAY_SECS` | u64 | 10 | Catchup delay after reconnect | +| `NGIT_SYNC_RECONNECT_LOOKBACK_DAYS` | u64 | 3 | Days to look back on reconnect | + +### Module Structure (As Implemented) + +``` +src/sync/ +├── mod.rs # Module exports, constants +├── manager.rs # SyncManager - orchestrates sync +├── connection.rs # SyncConnection - per-relay WebSocket +├── filter.rs # FilterService - three-layer filters +├── health.rs # RelayHealthTracker - health states +├── metrics.rs # SyncMetrics - Prometheus integration +├── negentropy.rs # NegentropyService - gap-filling +└── subscription.rs # SubscriptionManager - dynamic subs +``` + +### Production Readiness Checklist + +- [x] All metrics exposed at `/metrics` endpoint +- [x] Health state tracking with configurable backoff +- [x] Dead relay detection and minimal retry +- [x] Startup jitter to prevent thundering herd +- [x] Grafana dashboard with sync panels +- [x] Configuration documented +- [x] Integration tests passing diff --git a/docs/grafana/ngit-grasp-dashboard.json b/docs/grafana/ngit-grasp-dashboard.json index bd1b6fe..3b9b216 100644 --- a/docs/grafana/ngit-grasp-dashboard.json +++ b/docs/grafana/ngit-grasp-dashboard.json @@ -641,6 +641,340 @@ ], "title": "Events Stored vs Rejected (5m)", "type": "timeseries" + }, + { + "collapsed": false, + "gridPos": { "h": 1, "w": 24, "x": 0, "y": 48 }, + "id": 40, + "title": "Proactive Sync", + "type": "row" + }, + { + "datasource": { "type": "prometheus", "uid": "${datasource}" }, + "fieldConfig": { + "defaults": { + "color": { "mode": "palette-classic" }, + "custom": { + "axisCenteredZero": false, + "axisColorMode": "text", + "axisLabel": "", + "axisPlacement": "auto", + "barAlignment": 0, + "drawStyle": "line", + "fillOpacity": 10, + "gradientMode": "none", + "hideFrom": { "legend": false, "tooltip": false, "viz": false }, + "lineInterpolation": "linear", + "lineWidth": 1, + "pointSize": 5, + "scaleDistribution": { "type": "linear" }, + "showPoints": "never", + "spanNulls": false, + "stacking": { "group": "A", "mode": "none" }, + "thresholdsStyle": { "mode": "off" } + }, + "mappings": [], + "thresholds": { "mode": "absolute", "steps": [] }, + "unit": "short" + } + }, + "gridPos": { "h": 8, "w": 12, "x": 0, "y": 49 }, + "id": 41, + "options": { + "legend": { "calcs": [], "displayMode": "list", "placement": "bottom", "showLegend": true }, + "tooltip": { "mode": "multi", "sort": "none" } + }, + "targets": [ + { + "expr": "ngit_sync_relays_connected_total", + "legendFormat": "Connected", + "refId": "A" + }, + { + "expr": "ngit_sync_relays_tracked_total", + "legendFormat": "Tracked", + "refId": "B" + }, + { + "expr": "ngit_sync_relays_dead_total", + "legendFormat": "Dead", + "refId": "C" + } + ], + "title": "Sync Relays Over Time", + "type": "timeseries" + }, + { + "datasource": { "type": "prometheus", "uid": "${datasource}" }, + "fieldConfig": { + "defaults": { + "color": { "mode": "palette-classic" }, + "custom": { "hideFrom": { "legend": false, "tooltip": false, "viz": false } }, + "mappings": [], + "unit": "short" + }, + "overrides": [ + { + "matcher": { "id": "byName", "options": "healthy" }, + "properties": [{ "id": "color", "value": { "fixedColor": "green", "mode": "fixed" } }] + }, + { + "matcher": { "id": "byName", "options": "degraded" }, + "properties": [{ "id": "color", "value": { "fixedColor": "yellow", "mode": "fixed" } }] + }, + { + "matcher": { "id": "byName", "options": "dead" }, + "properties": [{ "id": "color", "value": { "fixedColor": "red", "mode": "fixed" } }] + } + ] + }, + "gridPos": { "h": 8, "w": 6, "x": 12, "y": 49 }, + "id": 42, + "options": { + "legend": { "displayMode": "list", "placement": "right", "showLegend": true }, + "pieType": "pie", + "reduceOptions": { "calcs": ["lastNotNull"], "fields": "", "values": false }, + "tooltip": { "mode": "single", "sort": "none" } + }, + "targets": [ + { + "expr": "count(ngit_sync_relay_status == 1)", + "legendFormat": "healthy", + "refId": "A" + }, + { + "expr": "count(ngit_sync_relay_status == 2)", + "legendFormat": "degraded", + "refId": "B" + }, + { + "expr": "count(ngit_sync_relay_status == 3)", + "legendFormat": "dead", + "refId": "C" + } + ], + "title": "Relay Health Distribution", + "type": "piechart" + }, + { + "datasource": { "type": "prometheus", "uid": "${datasource}" }, + "fieldConfig": { + "defaults": { + "color": { "mode": "thresholds" }, + "mappings": [], + "thresholds": { + "mode": "absolute", + "steps": [ + { "color": "green", "value": null }, + { "color": "yellow", "value": 1 }, + { "color": "red", "value": 5 } + ] + }, + "unit": "short" + } + }, + "gridPos": { "h": 4, "w": 3, "x": 18, "y": 49 }, + "id": 43, + "options": { + "colorMode": "value", + "graphMode": "none", + "justifyMode": "auto", + "orientation": "auto", + "reduceOptions": { "calcs": ["lastNotNull"], "fields": "", "values": false }, + "textMode": "auto" + }, + "targets": [ + { + "expr": "ngit_sync_relays_dead_total", + "legendFormat": "Dead", + "refId": "A" + } + ], + "title": "Dead Relays", + "type": "stat" + }, + { + "datasource": { "type": "prometheus", "uid": "${datasource}" }, + "fieldConfig": { + "defaults": { + "color": { "mode": "thresholds" }, + "mappings": [], + "thresholds": { + "mode": "absolute", + "steps": [{ "color": "blue", "value": null }] + }, + "unit": "short" + } + }, + "gridPos": { "h": 4, "w": 3, "x": 21, "y": 49 }, + "id": 44, + "options": { + "colorMode": "value", + "graphMode": "none", + "justifyMode": "auto", + "orientation": "auto", + "reduceOptions": { "calcs": ["lastNotNull"], "fields": "", "values": false }, + "textMode": "auto" + }, + "targets": [ + { + "expr": "ngit_sync_relays_connected_total", + "legendFormat": "Connected", + "refId": "A" + } + ], + "title": "Connected Relays", + "type": "stat" + }, + { + "datasource": { "type": "prometheus", "uid": "${datasource}" }, + "fieldConfig": { + "defaults": { + "color": { "mode": "palette-classic" }, + "custom": { + "axisCenteredZero": false, + "axisColorMode": "text", + "axisLabel": "", + "axisPlacement": "auto", + "barAlignment": 0, + "drawStyle": "bars", + "fillOpacity": 50, + "gradientMode": "none", + "hideFrom": { "legend": false, "tooltip": false, "viz": false }, + "lineInterpolation": "linear", + "lineWidth": 1, + "pointSize": 5, + "scaleDistribution": { "type": "linear" }, + "showPoints": "never", + "spanNulls": false, + "stacking": { "group": "A", "mode": "normal" }, + "thresholdsStyle": { "mode": "off" } + }, + "mappings": [], + "thresholds": { "mode": "absolute", "steps": [] }, + "unit": "short" + }, + "overrides": [ + { + "matcher": { "id": "byName", "options": "success" }, + "properties": [{ "id": "color", "value": { "fixedColor": "green", "mode": "fixed" } }] + }, + { + "matcher": { "id": "byName", "options": "failure" }, + "properties": [{ "id": "color", "value": { "fixedColor": "red", "mode": "fixed" } }] + } + ] + }, + "gridPos": { "h": 4, "w": 6, "x": 18, "y": 53 }, + "id": 45, + "options": { + "legend": { "calcs": [], "displayMode": "list", "placement": "bottom", "showLegend": true }, + "tooltip": { "mode": "multi", "sort": "none" } + }, + "targets": [ + { + "expr": "increase(ngit_sync_connection_attempts_total{result=\"success\"}[5m])", + "legendFormat": "success", + "refId": "A" + }, + { + "expr": "increase(ngit_sync_connection_attempts_total{result=\"failure\"}[5m])", + "legendFormat": "failure", + "refId": "B" + } + ], + "title": "Connection Attempts (5m)", + "type": "timeseries" + }, + { + "datasource": { "type": "prometheus", "uid": "${datasource}" }, + "fieldConfig": { + "defaults": { + "color": { "mode": "palette-classic" }, + "custom": { + "axisCenteredZero": false, + "axisColorMode": "text", + "axisLabel": "", + "axisPlacement": "auto", + "barAlignment": 0, + "drawStyle": "line", + "fillOpacity": 10, + "gradientMode": "none", + "hideFrom": { "legend": false, "tooltip": false, "viz": false }, + "lineInterpolation": "linear", + "lineWidth": 1, + "pointSize": 5, + "scaleDistribution": { "type": "linear" }, + "showPoints": "never", + "spanNulls": false, + "stacking": { "group": "A", "mode": "none" }, + "thresholdsStyle": { "mode": "off" } + }, + "mappings": [], + "thresholds": { "mode": "absolute", "steps": [] }, + "unit": "short" + } + }, + "gridPos": { "h": 8, "w": 12, "x": 0, "y": 57 }, + "id": 46, + "options": { + "legend": { "calcs": ["sum"], "displayMode": "table", "placement": "right", "showLegend": true }, + "tooltip": { "mode": "multi", "sort": "none" } + }, + "targets": [ + { + "expr": "rate(ngit_sync_events_total[5m])", + "legendFormat": "{{source}}", + "refId": "A" + } + ], + "title": "Synced Events by Source (5m)", + "type": "timeseries" + }, + { + "datasource": { "type": "prometheus", "uid": "${datasource}" }, + "fieldConfig": { + "defaults": { + "color": { "mode": "palette-classic" }, + "custom": { + "axisCenteredZero": false, + "axisColorMode": "text", + "axisLabel": "", + "axisPlacement": "auto", + "barAlignment": 0, + "drawStyle": "bars", + "fillOpacity": 50, + "gradientMode": "none", + "hideFrom": { "legend": false, "tooltip": false, "viz": false }, + "lineInterpolation": "linear", + "lineWidth": 1, + "pointSize": 5, + "scaleDistribution": { "type": "linear" }, + "showPoints": "never", + "spanNulls": false, + "stacking": { "group": "A", "mode": "normal" }, + "thresholdsStyle": { "mode": "off" } + }, + "mappings": [], + "thresholds": { "mode": "absolute", "steps": [] }, + "unit": "short" + } + }, + "gridPos": { "h": 8, "w": 12, "x": 12, "y": 57 }, + "id": 47, + "options": { + "legend": { "calcs": ["sum"], "displayMode": "table", "placement": "right", "showLegend": true }, + "tooltip": { "mode": "multi", "sort": "none" } + }, + "targets": [ + { + "expr": "increase(ngit_sync_gap_events_total[1h])", + "legendFormat": "{{relay}}", + "refId": "A" + } + ], + "title": "Gap Events Filled by Relay (1h)", + "type": "timeseries" } ], "refresh": "30s", diff --git a/docs/reference/configuration.md b/docs/reference/configuration.md index e2ec9aa..80ae45c 100644 --- a/docs/reference/configuration.md +++ b/docs/reference/configuration.md @@ -265,6 +265,143 @@ NGIT_DATABASE_BACKEND=lmdb --- +### Proactive Sync Configuration (GRASP-02) + +These options configure the proactive sync feature that synchronizes events from other relays. + +#### `NGIT_SYNC_RELAY_URL` + +**Description:** URL of the primary relay to sync events from +**Type:** String (WebSocket URL) +**Default:** None (sync disabled) +**Required:** No + +**Examples:** +```bash +# Sync from a public relay +NGIT_SYNC_RELAY_URL=wss://relay.example.com + +# Sync from another GRASP relay +NGIT_SYNC_RELAY_URL=wss://git.nostr.dev + +# Local testing +NGIT_SYNC_RELAY_URL=ws://127.0.0.1:8081 +``` + +**Notes:** +- When set, enables proactive sync feature +- The relay will discover additional relays from repository announcements +- Synced events go through the same validation as directly-submitted events +- Use WebSocket protocol (`ws://` or `wss://`) + +--- + +#### `NGIT_SYNC_MAX_BACKOFF_SECS` + +**Description:** Maximum backoff time in seconds for sync relay reconnection +**Type:** Integer (seconds) +**Default:** `3600` (1 hour) +**Required:** No + +**Examples:** +```bash +# Default: 1 hour max backoff +NGIT_SYNC_MAX_BACKOFF_SECS=3600 + +# Aggressive: 5 minute max backoff +NGIT_SYNC_MAX_BACKOFF_SECS=300 + +# Conservative: 2 hour max backoff +NGIT_SYNC_MAX_BACKOFF_SECS=7200 +``` + +**Notes:** +- Backoff starts at 5 seconds and doubles on each failure +- Capped at this maximum value +- After 24 hours of failures, relay is marked "dead" and retried daily +- Lower values mean more reconnection attempts + +--- + +#### `NGIT_SYNC_STARTUP_DELAY_SECS` + +**Description:** Delay in seconds before running startup catchup +**Type:** Integer (seconds) +**Default:** `30` +**Required:** No + +**Examples:** +```bash +# Default: 30 second delay +NGIT_SYNC_STARTUP_DELAY_SECS=30 + +# Quick startup (testing) +NGIT_SYNC_STARTUP_DELAY_SECS=5 + +# Production: longer warm-up +NGIT_SYNC_STARTUP_DELAY_SECS=60 +``` + +**Notes:** +- Allows connections to stabilize before catchup +- Reduces load on remote relays at startup +- Set to 0 for immediate catchup (not recommended) + +--- + +#### `NGIT_SYNC_RECONNECT_DELAY_SECS` + +**Description:** Delay in seconds before running catchup after reconnection +**Type:** Integer (seconds) +**Default:** `10` +**Required:** No + +**Examples:** +```bash +# Default: 10 second delay +NGIT_SYNC_RECONNECT_DELAY_SECS=10 + +# Quick reconnect catchup +NGIT_SYNC_RECONNECT_DELAY_SECS=5 + +# Conservative +NGIT_SYNC_RECONNECT_DELAY_SECS=30 +``` + +**Notes:** +- Prevents rate limiting from remote relays +- Applied after each successful reconnection +- Only catches up on recent events (see lookback days) + +--- + +#### `NGIT_SYNC_RECONNECT_LOOKBACK_DAYS` + +**Description:** Number of days to look back for reconnect catchup +**Type:** Integer (days) +**Default:** `3` +**Required:** No + +**Examples:** +```bash +# Default: 3 days lookback +NGIT_SYNC_RECONNECT_LOOKBACK_DAYS=3 + +# Short lookback (frequent reconnects expected) +NGIT_SYNC_RECONNECT_LOOKBACK_DAYS=1 + +# Extended lookback +NGIT_SYNC_RECONNECT_LOOKBACK_DAYS=7 +``` + +**Notes:** +- Limits catchup queries to recent events only +- Reduces load compared to full historical sync +- Balance between completeness and performance +- Longer lookback useful for less reliable connections + +--- + ### Logging Configuration #### `RUST_LOG` -- cgit v1.2.3