diff options
| author | DanConwayDev <DanConwayDev@protonmail.com> | 2025-12-04 15:17:04 +0000 |
|---|---|---|
| committer | DanConwayDev <DanConwayDev@protonmail.com> | 2025-12-04 15:24:19 +0000 |
| commit | fd0c87c787d0626b3546fa571541c9c809711821 (patch) | |
| tree | 934f20d973127f380b807d2bd44b25c197cf349c /docs/explanation/monitoring.md | |
| parent | 762cd8e815e797f173f541795de774fbbf978fc3 (diff) | |
add prometheus metrics
Diffstat (limited to 'docs/explanation/monitoring.md')
| -rw-r--r-- | docs/explanation/monitoring.md | 99 |
1 files changed, 99 insertions, 0 deletions
diff --git a/docs/explanation/monitoring.md b/docs/explanation/monitoring.md new file mode 100644 index 0000000..3b1b1ac --- /dev/null +++ b/docs/explanation/monitoring.md | |||
| @@ -0,0 +1,99 @@ | |||
| 1 | # Monitoring | ||
| 2 | |||
| 3 | ngit-grasp exposes Prometheus metrics at `/metrics` for monitoring WebSocket connections, Git operations, Nostr events, and system health. | ||
| 4 | |||
| 5 | ## Architecture | ||
| 6 | |||
| 7 | ```mermaid | ||
| 8 | flowchart TB | ||
| 9 | subgraph ngit-grasp | ||
| 10 | HTTP[HTTP Service] | ||
| 11 | WS[WebSocket Handler] | ||
| 12 | GIT[Git Handlers] | ||
| 13 | RELAY[Nostr Relay] | ||
| 14 | |||
| 15 | subgraph Metrics Module | ||
| 16 | REG[Prometheus Registry] | ||
| 17 | CT[ConnectionTracker] | ||
| 18 | MC[Metric Counters] | ||
| 19 | end | ||
| 20 | |||
| 21 | ME[/metrics endpoint] | ||
| 22 | end | ||
| 23 | |||
| 24 | subgraph External | ||
| 25 | PROM[Prometheus Server] | ||
| 26 | GRAF[Grafana] | ||
| 27 | ADMIN[Admin Browser] | ||
| 28 | end | ||
| 29 | |||
| 30 | HTTP --> ME | ||
| 31 | WS --> CT | ||
| 32 | WS --> MC | ||
| 33 | GIT --> MC | ||
| 34 | RELAY --> MC | ||
| 35 | |||
| 36 | CT --> REG | ||
| 37 | MC --> REG | ||
| 38 | REG --> ME | ||
| 39 | |||
| 40 | PROM -->|scrape /metrics| ME | ||
| 41 | GRAF -->|query| PROM | ||
| 42 | ADMIN -->|view dashboards| GRAF | ||
| 43 | ``` | ||
| 44 | |||
| 45 | ## Configuration | ||
| 46 | |||
| 47 | | Option | CLI Flag | Environment Variable | Default | Description | | ||
| 48 | |--------|----------|---------------------|---------|-------------| | ||
| 49 | | Metrics enabled | `--metrics-enabled` | `NGIT_METRICS_ENABLED` | `true` | Enable /metrics endpoint | | ||
| 50 | | Abuse threshold | `--abuse-threshold` | `NGIT_ABUSE_THRESHOLD` | `10` | Max connections per IP before flagging | | ||
| 51 | | Top N repos | `--top-n-repos` | `NGIT_TOP_N_REPOS` | `10` | Number of top bandwidth repos to track | | ||
| 52 | |||
| 53 | ## Privacy Model | ||
| 54 | |||
| 55 | IP addresses are **never exposed in Prometheus metrics**. The connection tracker maintains per-IP counts internally only for abuse detection: | ||
| 56 | |||
| 57 | | Data | Exposed in Metrics? | | ||
| 58 | |------|---------------------| | ||
| 59 | | Total connections | ✅ Yes | | ||
| 60 | | Unique IP count | ✅ Yes | | ||
| 61 | | Flagged abuser count | ✅ Yes | | ||
| 62 | | Actual IP addresses | ❌ No (internal only) | | ||
| 63 | | IP + abuse flag | ⚠️ Logs only (when flagged) | | ||
| 64 | |||
| 65 | When an IP exceeds the abuse threshold, a warning is logged but the IP is never exposed via Prometheus. | ||
| 66 | |||
| 67 | ## Deployment | ||
| 68 | |||
| 69 | See [Prometheus Setup Guide](../how-to/prometheus-setup.md) for NixOS configuration and Grafana dashboard provisioning. | ||
| 70 | |||
| 71 | ## Future: Load-Based Sync Scheduling (GRASP-02) | ||
| 72 | |||
| 73 | The metrics infrastructure enables future load-based scheduling for GRASP-02 sync jobs: | ||
| 74 | |||
| 75 | ```mermaid | ||
| 76 | flowchart TD | ||
| 77 | SYNC[Sync Manager] --> CHECK{Check Load} | ||
| 78 | CHECK --> MET[Query Metrics] | ||
| 79 | MET --> CONN{Connections > N?} | ||
| 80 | CONN -->|Yes| DELAY[Delay 5 min] | ||
| 81 | CONN -->|No| RUN[Run Sync Job] | ||
| 82 | DELAY --> CHECK | ||
| 83 | ``` | ||
| 84 | |||
| 85 | ## Future: Loki for Detailed Logging | ||
| 86 | |||
| 87 | For detailed per-repository investigation at scale, consider adding **Loki** (log aggregation): | ||
| 88 | |||
| 89 | - Structured logging with tracing crate already in place | ||
| 90 | - Loki queries enable ad-hoc deep dives (e.g., find all transfers > 10MB) | ||
| 91 | - Pairs with Prometheus for long-term trends | ||
| 92 | |||
| 93 | ## Future: Sync Metrics (GRASP-02) | ||
| 94 | |||
| 95 | When GRASP-02 proactive sync is implemented, additional metrics will track: | ||
| 96 | |||
| 97 | - Events received from sync (live vs catchup) | ||
| 98 | - Active outbound relay connections | ||
| 99 | - Catchup gap (events found during catchup indicating sync failures) \ No newline at end of file | ||