upleb.uk

Public git repos — served from a NIP-34 GRASP relay at git.upleb.uk

summaryrefslogtreecommitdiff
path: root/docs/explanation/monitoring.md
diff options
context:
space:
mode:
Diffstat (limited to 'docs/explanation/monitoring.md')
-rw-r--r--docs/explanation/monitoring.md96
1 files changed, 95 insertions, 1 deletions
diff --git a/docs/explanation/monitoring.md b/docs/explanation/monitoring.md
index 7520813..bd652be 100644
--- a/docs/explanation/monitoring.md
+++ b/docs/explanation/monitoring.md
@@ -204,4 +204,98 @@ sum(ngit_sync_relay_status == 5) # RateLimited
2044. **Restart behavior**: Conservative initial backoff (5s + jitter) avoids thundering herd on restart 2044. **Restart behavior**: Conservative initial backoff (5s + jitter) avoids thundering herd on restart
2055. **Historical data**: Prometheus retains health history; in-memory state only needs current status 2055. **Historical data**: Prometheus retains health history; in-memory state only needs current status
206 206
207See [GRASP-02 Proactive Sync](grasp-02-proactive-sync.md) for full architecture details. \ No newline at end of file 207See [GRASP-02 Proactive Sync](grasp-02-proactive-sync.md) for full architecture details.
208
209## Rejected Events Index Metrics
210
211The rejected events index tracks rejected repository announcements and state events to prevent wasteful re-fetching during negentropy sync and enable race condition resolution.
212
213### Rejected Events Metrics
214
215All metrics are parameterized by `event_type` label with values "announcement" or "state":
216
217| Metric | Type | Labels | Description |
218|--------|------|--------|-------------|
219| `ngit_rejected_hot_cache_current` | Gauge | event_type | Current number of entries in hot cache |
220| `ngit_rejected_cold_index_current` | Gauge | event_type | Current number of entries in cold index |
221| `ngit_rejected_hot_cache_hits` | Counter | event_type | Events successfully retrieved from hot cache for re-processing |
222| `ngit_rejected_hot_cache_misses` | Counter | event_type | Events expired from hot cache before dependency arrived |
223| `ngit_rejected_hot_cache_expired` | Counter | event_type | Entries cleaned up from hot cache (2 min expiry) |
224| `ngit_rejected_cold_index_expired` | Counter | event_type | Entries cleaned up from cold index (7 day expiry) |
225| `ngit_rejected_invalidated` | Counter | event_type | Entries invalidated when dependency was satisfied |
226
227### Example Grafana Queries
228
229```promql
230# Hot cache efficiency - how often we successfully re-process from cache
231rate(ngit_rejected_hot_cache_hits_total[5m])
232/ (rate(ngit_rejected_hot_cache_hits_total[5m]) + rate(ngit_rejected_hot_cache_misses_total[5m]))
233
234# Current rejected events by type
235ngit_rejected_hot_cache_current{event_type="announcement"}
236ngit_rejected_hot_cache_current{event_type="state"}
237ngit_rejected_cold_index_current{event_type="announcement"}
238ngit_rejected_cold_index_current{event_type="state"}
239
240# Race condition resolution rate - invalidations indicate successful dependency arrival
241rate(ngit_rejected_invalidated_total[5m])
242
243# Cache hit ratio over time (higher is better, means dependencies arriving quickly)
244sum(rate(ngit_rejected_hot_cache_hits_total[5m]))
245/ sum(rate(ngit_rejected_hot_cache_hits_total[5m]) + rate(ngit_rejected_hot_cache_misses_total[5m]))
246```
247
248### Example Alerts
249
250```yaml
251# Alert if hot cache hit rate is too low (suggests timing issues)
252- alert: RejectedEventsCacheMissRate
253 expr: |
254 sum(rate(ngit_rejected_hot_cache_misses_total[5m]))
255 / sum(rate(ngit_rejected_hot_cache_hits_total[5m]) + rate(ngit_rejected_hot_cache_misses_total[5m]))
256 > 0.8
257 for: 15m
258 labels:
259 severity: warning
260 annotations:
261 summary: "High rejected events cache miss rate ({{ $value | humanizePercentage }})"
262 description: "Most rejected events are expiring before dependencies arrive"
263
264# Alert if cold index growing too large
265- alert: RejectedEventsColdIndexSize
266 expr: ngit_rejected_cold_index_current > 10000
267 for: 1h
268 labels:
269 severity: info
270 annotations:
271 summary: "Rejected events cold index has {{ $value }} entries"
272 description: "Consider investigating why many events are being rejected"
273```
274
275### Two-Tier Architecture
276
277**Hot Cache (2 minutes):**
278- Stores full event objects
279- Enables immediate re-processing when dependencies arrive
280- Cleaned up every 60 seconds
281- Memory: ~200 KB typical, ~20 MB worst case
282
283**Cold Index (7 days):**
284- Stores metadata only (event ID, pubkey, identifier, reason)
285- Prevents re-downloading during negentropy sync
286- Cleaned up daily
287- Memory: ~1 MB typical
288
289### Use Cases
290
291**Race Condition Resolution:**
292When a maintainer announcement arrives before the owner announcement:
2931. Maintainer event rejected → hot cache + cold index
2942. Owner announcement accepted → invalidate from cold index
2953. If still in hot cache → immediate re-processing (<1 second)
2964. If expired from hot cache → will be re-fetched on next sync
297
298**Negentropy Sync Efficiency:**
299During sync, cold index IDs are excluded from "missing events" calculation, preventing wasteful re-download of events that will be rejected again.
300
301See [work/rejected-events-index-summary.md](../../work/rejected-events-index-summary.md) for complete implementation details. \ No newline at end of file