diff options
| author | DanConwayDev <DanConwayDev@protonmail.com> | 2026-01-09 21:12:51 +0000 |
|---|---|---|
| committer | DanConwayDev <DanConwayDev@protonmail.com> | 2026-01-09 21:12:51 +0000 |
| commit | 5fed2e2f32cfb15fff042a39f3ac82abe8948ca0 (patch) | |
| tree | 9eeabc12bcadc43d18c772d9705dbf4b65d03ed2 /docs/explanation/monitoring.md | |
| parent | a68e23733e78d33ca1d48b83414a8db63ca3d5fd (diff) | |
docs: integrate rejected events index into architecture documentation
- Add rejected events index to architecture.md with two-tier system explanation
- Document NGIT_REJECTED_HOT_CACHE_DURATION_SECS and NGIT_REJECTED_COLD_INDEX_EXPIRY_SECS in configuration.md
- Add comprehensive rejected events metrics section to monitoring.md with Grafana queries and alerts
- Explain negentropy integration with rejected index in grasp-02-proactive-sync.md
- Document state event authorization defense-in-depth and rejection tracking in inline-authorization.md
This integrates information from work/rejected-events-index-summary.md into the main documentation,
ensuring architecture docs accurately reflect the implemented rejected events index system.
Diffstat (limited to 'docs/explanation/monitoring.md')
| -rw-r--r-- | docs/explanation/monitoring.md | 96 |
1 files changed, 95 insertions, 1 deletions
diff --git a/docs/explanation/monitoring.md b/docs/explanation/monitoring.md index 7520813..bd652be 100644 --- a/docs/explanation/monitoring.md +++ b/docs/explanation/monitoring.md | |||
| @@ -204,4 +204,98 @@ sum(ngit_sync_relay_status == 5) # RateLimited | |||
| 204 | 4. **Restart behavior**: Conservative initial backoff (5s + jitter) avoids thundering herd on restart | 204 | 4. **Restart behavior**: Conservative initial backoff (5s + jitter) avoids thundering herd on restart |
| 205 | 5. **Historical data**: Prometheus retains health history; in-memory state only needs current status | 205 | 5. **Historical data**: Prometheus retains health history; in-memory state only needs current status |
| 206 | 206 | ||
| 207 | See [GRASP-02 Proactive Sync](grasp-02-proactive-sync.md) for full architecture details. \ No newline at end of file | 207 | See [GRASP-02 Proactive Sync](grasp-02-proactive-sync.md) for full architecture details. |
| 208 | |||
| 209 | ## Rejected Events Index Metrics | ||
| 210 | |||
| 211 | The rejected events index tracks rejected repository announcements and state events to prevent wasteful re-fetching during negentropy sync and enable race condition resolution. | ||
| 212 | |||
| 213 | ### Rejected Events Metrics | ||
| 214 | |||
| 215 | All metrics are parameterized by `event_type` label with values "announcement" or "state": | ||
| 216 | |||
| 217 | | Metric | Type | Labels | Description | | ||
| 218 | |--------|------|--------|-------------| | ||
| 219 | | `ngit_rejected_hot_cache_current` | Gauge | event_type | Current number of entries in hot cache | | ||
| 220 | | `ngit_rejected_cold_index_current` | Gauge | event_type | Current number of entries in cold index | | ||
| 221 | | `ngit_rejected_hot_cache_hits` | Counter | event_type | Events successfully retrieved from hot cache for re-processing | | ||
| 222 | | `ngit_rejected_hot_cache_misses` | Counter | event_type | Events expired from hot cache before dependency arrived | | ||
| 223 | | `ngit_rejected_hot_cache_expired` | Counter | event_type | Entries cleaned up from hot cache (2 min expiry) | | ||
| 224 | | `ngit_rejected_cold_index_expired` | Counter | event_type | Entries cleaned up from cold index (7 day expiry) | | ||
| 225 | | `ngit_rejected_invalidated` | Counter | event_type | Entries invalidated when dependency was satisfied | | ||
| 226 | |||
| 227 | ### Example Grafana Queries | ||
| 228 | |||
| 229 | ```promql | ||
| 230 | # Hot cache efficiency - how often we successfully re-process from cache | ||
| 231 | rate(ngit_rejected_hot_cache_hits_total[5m]) | ||
| 232 | / (rate(ngit_rejected_hot_cache_hits_total[5m]) + rate(ngit_rejected_hot_cache_misses_total[5m])) | ||
| 233 | |||
| 234 | # Current rejected events by type | ||
| 235 | ngit_rejected_hot_cache_current{event_type="announcement"} | ||
| 236 | ngit_rejected_hot_cache_current{event_type="state"} | ||
| 237 | ngit_rejected_cold_index_current{event_type="announcement"} | ||
| 238 | ngit_rejected_cold_index_current{event_type="state"} | ||
| 239 | |||
| 240 | # Race condition resolution rate - invalidations indicate successful dependency arrival | ||
| 241 | rate(ngit_rejected_invalidated_total[5m]) | ||
| 242 | |||
| 243 | # Cache hit ratio over time (higher is better, means dependencies arriving quickly) | ||
| 244 | sum(rate(ngit_rejected_hot_cache_hits_total[5m])) | ||
| 245 | / sum(rate(ngit_rejected_hot_cache_hits_total[5m]) + rate(ngit_rejected_hot_cache_misses_total[5m])) | ||
| 246 | ``` | ||
| 247 | |||
| 248 | ### Example Alerts | ||
| 249 | |||
| 250 | ```yaml | ||
| 251 | # Alert if hot cache hit rate is too low (suggests timing issues) | ||
| 252 | - alert: RejectedEventsCacheMissRate | ||
| 253 | expr: | | ||
| 254 | sum(rate(ngit_rejected_hot_cache_misses_total[5m])) | ||
| 255 | / sum(rate(ngit_rejected_hot_cache_hits_total[5m]) + rate(ngit_rejected_hot_cache_misses_total[5m])) | ||
| 256 | > 0.8 | ||
| 257 | for: 15m | ||
| 258 | labels: | ||
| 259 | severity: warning | ||
| 260 | annotations: | ||
| 261 | summary: "High rejected events cache miss rate ({{ $value | humanizePercentage }})" | ||
| 262 | description: "Most rejected events are expiring before dependencies arrive" | ||
| 263 | |||
| 264 | # Alert if cold index growing too large | ||
| 265 | - alert: RejectedEventsColdIndexSize | ||
| 266 | expr: ngit_rejected_cold_index_current > 10000 | ||
| 267 | for: 1h | ||
| 268 | labels: | ||
| 269 | severity: info | ||
| 270 | annotations: | ||
| 271 | summary: "Rejected events cold index has {{ $value }} entries" | ||
| 272 | description: "Consider investigating why many events are being rejected" | ||
| 273 | ``` | ||
| 274 | |||
| 275 | ### Two-Tier Architecture | ||
| 276 | |||
| 277 | **Hot Cache (2 minutes):** | ||
| 278 | - Stores full event objects | ||
| 279 | - Enables immediate re-processing when dependencies arrive | ||
| 280 | - Cleaned up every 60 seconds | ||
| 281 | - Memory: ~200 KB typical, ~20 MB worst case | ||
| 282 | |||
| 283 | **Cold Index (7 days):** | ||
| 284 | - Stores metadata only (event ID, pubkey, identifier, reason) | ||
| 285 | - Prevents re-downloading during negentropy sync | ||
| 286 | - Cleaned up daily | ||
| 287 | - Memory: ~1 MB typical | ||
| 288 | |||
| 289 | ### Use Cases | ||
| 290 | |||
| 291 | **Race Condition Resolution:** | ||
| 292 | When a maintainer announcement arrives before the owner announcement: | ||
| 293 | 1. Maintainer event rejected → hot cache + cold index | ||
| 294 | 2. Owner announcement accepted → invalidate from cold index | ||
| 295 | 3. If still in hot cache → immediate re-processing (<1 second) | ||
| 296 | 4. If expired from hot cache → will be re-fetched on next sync | ||
| 297 | |||
| 298 | **Negentropy Sync Efficiency:** | ||
| 299 | During sync, cold index IDs are excluded from "missing events" calculation, preventing wasteful re-download of events that will be rejected again. | ||
| 300 | |||
| 301 | See [work/rejected-events-index-summary.md](../../work/rejected-events-index-summary.md) for complete implementation details. \ No newline at end of file | ||