1 files changed, 91 insertions, 5 deletions
diff --git a/docs/explanation/monitoring.md b/docs/explanation/monitoring.md
index 3b1b1ac..9368bf4 100644
--- a/docs/explanation/monitoring.md
+++ b/docs/explanation/monitoring.md
@@ -90,10 +90,96 @@ For detailed per-repository investigation at scale, consider adding **Loki** (lo
 - Loki queries enable ad-hoc deep dives (e.g., find all transfers > 10MB)
 - Pairs with Prometheus for long-term trends
-## Future: Sync Metrics (GRASP-02)
+## Sync Metrics (GRASP-02)
-When GRASP-02 proactive sync is implemented, additional metrics will track:
+When GRASP-02 proactive sync is implemented, the following metrics will be added to track relay synchronization health. These metrics use in-memory tracking with Prometheus for operator visibility (no database persistence needed for <100 relays).
- Events received from sync (live vs catchup)
+### Sync Metrics Overview
- Active outbound relay connections
- Catchup gap (events found during catchup indicating sync failures)
-\ No newline at end of file
+| Metric | Type | Labels | Description |
+|--------|------|--------|-------------|
+| `ngit_sync_relay_connected` | Gauge | relay | 1 if connected, 0 if not |
+| `ngit_sync_connection_attempts_total` | Counter | relay, result | Connection attempt outcomes |
+| `ngit_sync_relay_status` | Gauge | relay, status | 1 for current status, 0 otherwise |
+| `ngit_sync_relay_failures` | Gauge | relay | Current consecutive failure count |
+| `ngit_sync_events_total` | Counter | source | Events received by source type |
+| `ngit_sync_gap_events_total` | Counter | relay | Events found during catchup |
+| `ngit_sync_relays_tracked_total` | Gauge | - | Total relays discovered |
+| `ngit_sync_relays_connected_total` | Gauge | - | Currently connected relay count |
+| `ngit_sync_relays_dead_total` | Gauge | - | Relays marked as dead |
+### Event Sources
+The `source` label on `ngit_sync_events_total` tracks how events were received:
+- `direct` - Submitted directly to our relay by a user
+- `live_sync` - Received via live WebSocket subscription (expected path)
+- `catchup` - Found during negentropy catchup after reconnect
+- `daily_catchup` - Found during daily reconciliation
+**Catchup events indicate sync failures** - these should have been received via live sync. High catchup rates suggest connectivity issues or filter mismatches.
+### Relay Health States
+The `status` label on `ngit_sync_relay_status` tracks relay health:
+- `healthy` - Normal operation, connections working
+- `backoff` - Exponential backoff after failures (5s → 10s → ... → 1h)
+- `dead` - 24h of continuous failures, daily retry only
+### Example Grafana Queries
+```promql
+# Relay health overview - count by status
+sum by (status) (ngit_sync_relay_status == 1)
+# Connection success rate over last hour
+sum(rate(ngit_sync_connection_attempts_total{result="success"}[1h]))
+/ sum(rate(ngit_sync_connection_attempts_total[1h]))
+# Sync gap detection - events that should have been live synced
+sum(rate(ngit_sync_gap_events_total[1h])) by (relay)
+# Live sync effectiveness (lower is better - fewer gaps)
+sum(rate(ngit_sync_events_total{source=~"catchup|daily_catchup"}[1h]))
+/ sum(rate(ngit_sync_events_total[1h]))
+# Relays with high failure counts (potential issues)
+topk(10, ngit_sync_relay_failures)
+```
+### Example Alerts
+```yaml
+# Alert if relay stuck in dead state for > 1 day
+- alert: SyncRelayDead
+  expr: ngit_sync_relay_status{status="dead"} == 1
+  for: 1d
+  labels:
+    severity: warning
+  annotations:
+    summary: "Sync relay {{ $labels.relay }} is dead"
+# Alert if sync gap rate is high (>10% of events from catchup)
+- alert: SyncGapHigh
+  expr: >
+    sum(rate(ngit_sync_events_total{source=~"catchup|daily_catchup"}[1h]))
+    / sum(rate(ngit_sync_events_total[1h])) > 0.1
+  for: 30m
+  labels:
+    severity: warning
+  annotations:
+    summary: "High sync gap rate - {{ $value | humanizePercentage }} of events from catchup"
+```
+### Design Rationale
+**In-memory health tracking with Prometheus visibility** was chosen over database persistence because:
+1. **Scale**: <100 relays means per-relay labels have acceptable cardinality
+2. **Simplicity**: No database schema, migrations, or cleanup needed
+3. **Operator visibility**: Prometheus + Grafana provide better dashboards than custom queries
+4. **Restart behavior**: Conservative initial backoff (5s + jitter) avoids thundering herd on restart
+5. **Historical data**: Prometheus retains health history; in-memory state only needs current status
+See [GRASP-02 Proactive Sync](grasp-02-proactive-sync.md) for full architecture details.
+\ No newline at end of file

diff --git a/docs/explanation/monitoring.md b/docs/explanation/monitoring.md index 3b1b1ac..9368bf4 100644 --- a/docs/explanation/monitoring.md +++ b/docs/explanation/monitoring.md
@@ -90,10 +90,96 @@ For detailed per-repository investigation at scale, consider adding Loki (lo
90	- Loki queries enable ad-hoc deep dives (e.g., find all transfers > 10MB)	90	- Loki queries enable ad-hoc deep dives (e.g., find all transfers > 10MB)
91	- Pairs with Prometheus for long-term trends	91	- Pairs with Prometheus for long-term trends
92		92
93	## Future: Sync Metrics (GRASP-02)	93	## Sync Metrics (GRASP-02)
94		94
95	When GRASP-02 proactive sync is implemented, additional metrics will track:	95	When GRASP-02 proactive sync is implemented, the following metrics will be added to track relay synchronization health. These metrics use in-memory tracking with Prometheus for operator visibility (no database persistence needed for <100 relays).
96		96
97	- Events received from sync (live vs catchup)	97	### Sync Metrics Overview
98	- Active outbound relay connections	98
99	- Catchup gap (events found during catchup indicating sync failures) \ No newline at end of file	99	\| Metric \| Type \| Labels \| Description \|
		100	\|--------\|------\|--------\|-------------\|
		101	\| `ngit_sync_relay_connected` \| Gauge \| relay \| 1 if connected, 0 if not \|
		102	\| `ngit_sync_connection_attempts_total` \| Counter \| relay, result \| Connection attempt outcomes \|
		103	\| `ngit_sync_relay_status` \| Gauge \| relay, status \| 1 for current status, 0 otherwise \|
		104	\| `ngit_sync_relay_failures` \| Gauge \| relay \| Current consecutive failure count \|
		105	\| `ngit_sync_events_total` \| Counter \| source \| Events received by source type \|
		106	\| `ngit_sync_gap_events_total` \| Counter \| relay \| Events found during catchup \|
		107	\| `ngit_sync_relays_tracked_total` \| Gauge \| - \| Total relays discovered \|
		108	\| `ngit_sync_relays_connected_total` \| Gauge \| - \| Currently connected relay count \|
		109	\| `ngit_sync_relays_dead_total` \| Gauge \| - \| Relays marked as dead \|
		110
		111	### Event Sources
		112
		113	The `source` label on `ngit_sync_events_total` tracks how events were received:
		114
		115	- `direct` - Submitted directly to our relay by a user
		116	- `live_sync` - Received via live WebSocket subscription (expected path)
		117	- `catchup` - Found during negentropy catchup after reconnect
		118	- `daily_catchup` - Found during daily reconciliation
		119
		120	Catchup events indicate sync failures - these should have been received via live sync. High catchup rates suggest connectivity issues or filter mismatches.
		121
		122	### Relay Health States
		123
		124	The `status` label on `ngit_sync_relay_status` tracks relay health:
		125
		126	- `healthy` - Normal operation, connections working
		127	- `backoff` - Exponential backoff after failures (5s → 10s → ... → 1h)
		128	- `dead` - 24h of continuous failures, daily retry only
		129
		130	### Example Grafana Queries
		131
		132	```promql
		133	# Relay health overview - count by status
		134	sum by (status) (ngit_sync_relay_status == 1)
		135
		136	# Connection success rate over last hour
		137	sum(rate(ngit_sync_connection_attempts_total{result="success"}[1h]))
		138	/ sum(rate(ngit_sync_connection_attempts_total[1h]))
		139
		140	# Sync gap detection - events that should have been live synced
		141	sum(rate(ngit_sync_gap_events_total[1h])) by (relay)
		142
		143	# Live sync effectiveness (lower is better - fewer gaps)
		144	sum(rate(ngit_sync_events_total{source=~"catchup\|daily_catchup"}[1h]))
		145	/ sum(rate(ngit_sync_events_total[1h]))
		146
		147	# Relays with high failure counts (potential issues)
		148	topk(10, ngit_sync_relay_failures)
		149	```
		150
		151	### Example Alerts
		152
		153	```yaml
		154	# Alert if relay stuck in dead state for > 1 day
		155	- alert: SyncRelayDead
		156	expr: ngit_sync_relay_status{status="dead"} == 1
		157	for: 1d
		158	labels:
		159	severity: warning
		160	annotations:
		161	summary: "Sync relay {{ $labels.relay }} is dead"
		162
		163	# Alert if sync gap rate is high (>10% of events from catchup)
		164	- alert: SyncGapHigh
		165	expr: >
		166	sum(rate(ngit_sync_events_total{source=~"catchup\|daily_catchup"}[1h]))
		167	/ sum(rate(ngit_sync_events_total[1h])) > 0.1
		168	for: 30m
		169	labels:
		170	severity: warning
		171	annotations:
		172	summary: "High sync gap rate - {{ $value \| humanizePercentage }} of events from catchup"
		173	```
		174
		175	### Design Rationale
		176
		177	In-memory health tracking with Prometheus visibility was chosen over database persistence because:
		178
		179	1. Scale: <100 relays means per-relay labels have acceptable cardinality
		180	2. Simplicity: No database schema, migrations, or cleanup needed
		181	3. Operator visibility: Prometheus + Grafana provide better dashboards than custom queries
		182	4. Restart behavior: Conservative initial backoff (5s + jitter) avoids thundering herd on restart
		183	5. Historical data: Prometheus retains health history; in-memory state only needs current status
		184
		185	See [GRASP-02 Proactive Sync](grasp-02-proactive-sync.md) for full architecture details. \ No newline at end of file