diff options
| author | DanConwayDev <DanConwayDev@protonmail.com> | 2025-12-04 16:54:38 +0000 |
|---|---|---|
| committer | DanConwayDev <DanConwayDev@protonmail.com> | 2025-12-04 16:54:38 +0000 |
| commit | fdbc8895e1e9e712882bd854908295a95e7afcb9 (patch) | |
| tree | 01a22f9d4b412d0099702afdd9272af2b7be3de5 /docs/explanation/monitoring.md | |
| parent | 8c129a4aeab3288f8193ccb820adf00860c50d74 (diff) | |
docs: update GRASP-02 proactive sync event sync approach
Diffstat (limited to 'docs/explanation/monitoring.md')
| -rw-r--r-- | docs/explanation/monitoring.md | 96 |
1 files changed, 91 insertions, 5 deletions
diff --git a/docs/explanation/monitoring.md b/docs/explanation/monitoring.md index 3b1b1ac..9368bf4 100644 --- a/docs/explanation/monitoring.md +++ b/docs/explanation/monitoring.md | |||
| @@ -90,10 +90,96 @@ For detailed per-repository investigation at scale, consider adding **Loki** (lo | |||
| 90 | - Loki queries enable ad-hoc deep dives (e.g., find all transfers > 10MB) | 90 | - Loki queries enable ad-hoc deep dives (e.g., find all transfers > 10MB) |
| 91 | - Pairs with Prometheus for long-term trends | 91 | - Pairs with Prometheus for long-term trends |
| 92 | 92 | ||
| 93 | ## Future: Sync Metrics (GRASP-02) | 93 | ## Sync Metrics (GRASP-02) |
| 94 | 94 | ||
| 95 | When GRASP-02 proactive sync is implemented, additional metrics will track: | 95 | When GRASP-02 proactive sync is implemented, the following metrics will be added to track relay synchronization health. These metrics use in-memory tracking with Prometheus for operator visibility (no database persistence needed for <100 relays). |
| 96 | 96 | ||
| 97 | - Events received from sync (live vs catchup) | 97 | ### Sync Metrics Overview |
| 98 | - Active outbound relay connections | 98 | |
| 99 | - Catchup gap (events found during catchup indicating sync failures) \ No newline at end of file | 99 | | Metric | Type | Labels | Description | |
| 100 | |--------|------|--------|-------------| | ||
| 101 | | `ngit_sync_relay_connected` | Gauge | relay | 1 if connected, 0 if not | | ||
| 102 | | `ngit_sync_connection_attempts_total` | Counter | relay, result | Connection attempt outcomes | | ||
| 103 | | `ngit_sync_relay_status` | Gauge | relay, status | 1 for current status, 0 otherwise | | ||
| 104 | | `ngit_sync_relay_failures` | Gauge | relay | Current consecutive failure count | | ||
| 105 | | `ngit_sync_events_total` | Counter | source | Events received by source type | | ||
| 106 | | `ngit_sync_gap_events_total` | Counter | relay | Events found during catchup | | ||
| 107 | | `ngit_sync_relays_tracked_total` | Gauge | - | Total relays discovered | | ||
| 108 | | `ngit_sync_relays_connected_total` | Gauge | - | Currently connected relay count | | ||
| 109 | | `ngit_sync_relays_dead_total` | Gauge | - | Relays marked as dead | | ||
| 110 | |||
| 111 | ### Event Sources | ||
| 112 | |||
| 113 | The `source` label on `ngit_sync_events_total` tracks how events were received: | ||
| 114 | |||
| 115 | - `direct` - Submitted directly to our relay by a user | ||
| 116 | - `live_sync` - Received via live WebSocket subscription (expected path) | ||
| 117 | - `catchup` - Found during negentropy catchup after reconnect | ||
| 118 | - `daily_catchup` - Found during daily reconciliation | ||
| 119 | |||
| 120 | **Catchup events indicate sync failures** - these should have been received via live sync. High catchup rates suggest connectivity issues or filter mismatches. | ||
| 121 | |||
| 122 | ### Relay Health States | ||
| 123 | |||
| 124 | The `status` label on `ngit_sync_relay_status` tracks relay health: | ||
| 125 | |||
| 126 | - `healthy` - Normal operation, connections working | ||
| 127 | - `backoff` - Exponential backoff after failures (5s → 10s → ... → 1h) | ||
| 128 | - `dead` - 24h of continuous failures, daily retry only | ||
| 129 | |||
| 130 | ### Example Grafana Queries | ||
| 131 | |||
| 132 | ```promql | ||
| 133 | # Relay health overview - count by status | ||
| 134 | sum by (status) (ngit_sync_relay_status == 1) | ||
| 135 | |||
| 136 | # Connection success rate over last hour | ||
| 137 | sum(rate(ngit_sync_connection_attempts_total{result="success"}[1h])) | ||
| 138 | / sum(rate(ngit_sync_connection_attempts_total[1h])) | ||
| 139 | |||
| 140 | # Sync gap detection - events that should have been live synced | ||
| 141 | sum(rate(ngit_sync_gap_events_total[1h])) by (relay) | ||
| 142 | |||
| 143 | # Live sync effectiveness (lower is better - fewer gaps) | ||
| 144 | sum(rate(ngit_sync_events_total{source=~"catchup|daily_catchup"}[1h])) | ||
| 145 | / sum(rate(ngit_sync_events_total[1h])) | ||
| 146 | |||
| 147 | # Relays with high failure counts (potential issues) | ||
| 148 | topk(10, ngit_sync_relay_failures) | ||
| 149 | ``` | ||
| 150 | |||
| 151 | ### Example Alerts | ||
| 152 | |||
| 153 | ```yaml | ||
| 154 | # Alert if relay stuck in dead state for > 1 day | ||
| 155 | - alert: SyncRelayDead | ||
| 156 | expr: ngit_sync_relay_status{status="dead"} == 1 | ||
| 157 | for: 1d | ||
| 158 | labels: | ||
| 159 | severity: warning | ||
| 160 | annotations: | ||
| 161 | summary: "Sync relay {{ $labels.relay }} is dead" | ||
| 162 | |||
| 163 | # Alert if sync gap rate is high (>10% of events from catchup) | ||
| 164 | - alert: SyncGapHigh | ||
| 165 | expr: > | ||
| 166 | sum(rate(ngit_sync_events_total{source=~"catchup|daily_catchup"}[1h])) | ||
| 167 | / sum(rate(ngit_sync_events_total[1h])) > 0.1 | ||
| 168 | for: 30m | ||
| 169 | labels: | ||
| 170 | severity: warning | ||
| 171 | annotations: | ||
| 172 | summary: "High sync gap rate - {{ $value | humanizePercentage }} of events from catchup" | ||
| 173 | ``` | ||
| 174 | |||
| 175 | ### Design Rationale | ||
| 176 | |||
| 177 | **In-memory health tracking with Prometheus visibility** was chosen over database persistence because: | ||
| 178 | |||
| 179 | 1. **Scale**: <100 relays means per-relay labels have acceptable cardinality | ||
| 180 | 2. **Simplicity**: No database schema, migrations, or cleanup needed | ||
| 181 | 3. **Operator visibility**: Prometheus + Grafana provide better dashboards than custom queries | ||
| 182 | 4. **Restart behavior**: Conservative initial backoff (5s + jitter) avoids thundering herd on restart | ||
| 183 | 5. **Historical data**: Prometheus retains health history; in-memory state only needs current status | ||
| 184 | |||
| 185 | See [GRASP-02 Proactive Sync](grasp-02-proactive-sync.md) for full architecture details. \ No newline at end of file | ||