upleb.uk

Public git repos — served from a NIP-34 GRASP relay at git.upleb.uk

summaryrefslogtreecommitdiff
path: root/docs
diff options
context:
space:
mode:
authorDanConwayDev <DanConwayDev@protonmail.com>2025-12-04 15:17:04 +0000
committerDanConwayDev <DanConwayDev@protonmail.com>2025-12-04 15:24:19 +0000
commitfd0c87c787d0626b3546fa571541c9c809711821 (patch)
tree934f20d973127f380b807d2bd44b25c197cf349c /docs
parent762cd8e815e797f173f541795de774fbbf978fc3 (diff)
add prometheus metrics
Diffstat (limited to 'docs')
-rw-r--r--docs/explanation/monitoring-strategy.md462
-rw-r--r--docs/explanation/monitoring.md99
-rw-r--r--docs/grafana/ngit-grasp-dashboard.json675
-rw-r--r--docs/how-to/prometheus-setup.md178
4 files changed, 952 insertions, 462 deletions
diff --git a/docs/explanation/monitoring-strategy.md b/docs/explanation/monitoring-strategy.md
deleted file mode 100644
index 4668305..0000000
--- a/docs/explanation/monitoring-strategy.md
+++ /dev/null
@@ -1,462 +0,0 @@
1# Monitoring Strategy - Design Document
2
3## Overview
4
5This document describes the logging and monitoring strategy for ngit-grasp, designed to help administrators:
6
71. Monitor WebSocket connections per unique IP
82. Correlate resource spikes (memory, CPU) with usage patterns
93. Detect potential abuse (too many connections from single IP)
104. Support future load-based scheduling of background jobs (GRASP-02 sync)
11
12## Architecture
13
14```mermaid
15flowchart TB
16 subgraph ngit-grasp
17 HTTP[HTTP Service]
18 WS[WebSocket Handler]
19 GIT[Git Handlers]
20 RELAY[Nostr Relay]
21
22 subgraph Metrics Module
23 REG[Prometheus Registry]
24 CT[ConnectionTracker]
25 MC[Metric Counters]
26 end
27
28 ME[/metrics endpoint]
29 end
30
31 subgraph External
32 PROM[Prometheus Server]
33 GRAF[Grafana]
34 ADMIN[Admin Browser]
35 end
36
37 HTTP --> ME
38 WS --> CT
39 WS --> MC
40 GIT --> MC
41 RELAY --> MC
42
43 CT --> REG
44 MC --> REG
45 REG --> ME
46
47 PROM -->|scrape /metrics| ME
48 GRAF -->|query| PROM
49 ADMIN -->|view dashboards| GRAF
50```
51
52## Metric Categories
53
54### 1. WebSocket Connection Metrics
55
56| Metric Name | Type | Labels | Description |
57|------------|------|--------|-------------|
58| `ngit_websocket_connections_total` | Counter | - | Total WebSocket connections since startup |
59| `ngit_websocket_connections_active` | Gauge | - | Current active WebSocket connections |
60| `ngit_websocket_unique_ips` | Gauge | - | Number of unique IP addresses connected (NOT the IPs themselves) |
61| `ngit_websocket_flagged_abusers` | Gauge | - | Number of IPs exceeding connection threshold |
62| `ngit_websocket_connection_duration_seconds` | Histogram | - | Duration of WebSocket connections |
63| `ngit_websocket_messages_received_total` | Counter | `type` | Messages received (REQ, EVENT, CLOSE) |
64| `ngit_websocket_messages_sent_total` | Counter | `type` | Messages sent (EVENT, EOSE, OK, NOTICE) |
65
66**Privacy Note:** IP addresses are NEVER exposed in metrics. The `ConnectionTracker` maintains per-IP counts internally only for abuse detection, logging warnings when thresholds are exceeded.
67
68### 2. Git Operation Metrics
69
70| Metric Name | Type | Labels | Description |
71|------------|------|--------|-------------|
72| `ngit_git_operations_total` | Counter | `operation`, `status` | Git operations (clone, fetch, push) |
73| `ngit_git_operation_duration_seconds` | Histogram | `operation` | Duration of git operations |
74| `ngit_git_bytes_total` | Counter | `direction` | Total bytes in/out for git operations |
75| `ngit_git_push_authorization_total` | Counter | `result` | Push auth results (allowed, denied, error) |
76
77### 3. Top-N Repository Bandwidth Tracking
78
79To identify high-bandwidth repositories without creating cardinality explosion (which doesn't scale to 1000+ repos), we use a hybrid approach:
80
81| Metric Name | Type | Labels | Description |
82|------------|------|--------|-------------|
83| `ngit_git_top_repos_bytes` | Gauge | `repo` | Top 10 repositories by bandwidth (refreshed every 60s) |
84
85**How it works:**
86- All per-repo bandwidth is tracked internally in a `HashMap<RepoId, u64>`
87- Every 60 seconds, the top 10 are calculated and exposed to Prometheus
88- Previous repo labels are cleared before setting new ones
89- Prometheus only ever sees ~10 label values, keeping cardinality low
90
91```rust
92struct BandwidthTracker {
93 // Internal: tracks ALL repos (memory only, not exposed)
94 all_repos: DashMap<String, u64>,
95
96 // Exposed to Prometheus: only top 10
97 top_repos_gauge: GaugeVec,
98
99 // Refresh interval
100 last_refresh: Instant,
101}
102
103impl BandwidthTracker {
104 fn record_transfer(&self, repo_id: &str, bytes: u64) {
105 self.all_repos
106 .entry(repo_id.to_string())
107 .and_modify(|v| *v += bytes)
108 .or_insert(bytes);
109 }
110
111 fn maybe_refresh_top_n(&self) {
112 if self.last_refresh.elapsed() > Duration::from_secs(60) {
113 self.refresh_top_n();
114 }
115 }
116
117 fn refresh_top_n(&self) {
118 let mut sorted: Vec<_> = self.all_repos.iter()
119 .map(|r| (r.key().clone(), *r.value()))
120 .collect();
121 sorted.sort_by(|a, b| b.1.cmp(&a.1));
122
123 // Clear old labels, set new top 10
124 self.top_repos_gauge.reset();
125 for (repo, bytes) in sorted.into_iter().take(10) {
126 self.top_repos_gauge
127 .with_label_values(&[&repo])
128 .set(bytes as i64);
129 }
130 }
131}
132```
133
134### 4. Nostr Event Metrics
135
136| Metric Name | Type | Labels | Description |
137|------------|------|--------|-------------|
138| `ngit_events_received_total` | Counter | `kind` | Events received by kind |
139| `ngit_events_stored_total` | Counter | `kind` | Events successfully stored |
140| `ngit_events_rejected_total` | Counter | `kind`, `reason` | Events rejected and why |
141
142### 5. Repository Metrics
143
144| Metric Name | Type | Labels | Description |
145|------------|------|--------|-------------|
146| `ngit_repositories_total` | Gauge | - | Total repositories hosted |
147
148### 6. System Health Metrics
149
150| Metric Name | Type | Labels | Description |
151|------------|------|--------|-------------|
152| `ngit_uptime_seconds` | Counter | - | Seconds since startup |
153| `ngit_build_info` | Gauge | `version`, `commit` | Build information |
154
155### 7. Future: Sync Metrics (GRASP-02)
156
157| Metric Name | Type | Labels | Description |
158|------------|------|--------|-------------|
159| `ngit_sync_events_received_total` | Counter | `source` | Events from sync (live vs catchup) |
160| `ngit_sync_relay_connections_active` | Gauge | - | Active outbound relay connections |
161| `ngit_sync_catchup_gap_total` | Counter | - | Events found during catchup (sync failures) |
162
163## Connection Tracker Design
164
165The `ConnectionTracker` maintains per-IP connection counts internally for abuse detection. **IP addresses are never exposed in metrics** - only aggregate counts.
166
167```mermaid
168flowchart LR
169 subgraph ConnectionTracker
170 HM[Internal: HashMap IP to Count]
171 TH[Abuse Threshold]
172 CNT[Exposed: Unique IP Count]
173 FLAG[Exposed: Abuse Flag Count]
174 end
175
176 CONN[New Connection] --> CHECK{Count >= Threshold?}
177 CHECK -->|No| INC[Increment Count]
178 CHECK -->|Yes| FLAG_IT[Flag as Abuse]
179 FLAG_IT --> LOG[Log Warning - IP in log only]
180 FLAG_IT --> FLAG
181
182 DISC[Disconnection] --> DEC[Decrement Count]
183 DEC --> CLEAN{Count == 0?}
184 CLEAN -->|Yes| RM[Remove from Map]
185
186 HM --> CNT
187```
188
189### Data Structure
190
191```rust
192pub struct ConnectionTracker {
193 /// Active connections per IP (INTERNAL ONLY - never exposed to metrics)
194 connections: DashMap<IpAddr, ConnectionInfo>,
195 /// Threshold for abuse flagging
196 abuse_threshold: u32,
197 /// Prometheus gauges (aggregate counts only, no IPs)
198 active_connections: IntGauge, // Total connections
199 unique_ips: IntGauge, // len() of HashMap
200 flagged_abusers: IntGauge, // Count where flagged_as_abuse == true
201}
202
203struct ConnectionInfo {
204 count: u32,
205 first_seen: Instant,
206 flagged_as_abuse: bool,
207}
208```
209
210### What Gets Exposed vs Internal
211
212| Data | Location | Exposed? |
213|------|----------|----------|
214| Total connections | Prometheus | ✅ Yes |
215| Unique IP count | Prometheus | ✅ Yes |
216| Flagged abuser count | Prometheus | ✅ Yes |
217| Actual IP addresses | Internal HashMap | ❌ No |
218| IP + abuse flag | Logs (when flagged) | ⚠️ Logs only |
219
220### Thread Safety
221
222Using `DashMap` for lock-free concurrent access, as connection tracking happens across multiple tokio tasks.
223
224## /metrics Endpoint
225
226The `/metrics` endpoint returns Prometheus text format:
227
228```
229# HELP ngit_websocket_connections_active Current active WebSocket connections
230# TYPE ngit_websocket_connections_active gauge
231ngit_websocket_connections_active 23
232
233# HELP ngit_websocket_connections_by_ip Active connections per IP
234# TYPE ngit_websocket_connections_by_ip gauge
235ngit_websocket_connections_by_ip{ip="192.168.1.100"} 2
236ngit_websocket_connections_by_ip{ip="10.0.0.50"} 5
237
238# HELP ngit_git_operations_total Git operations by type and status
239# TYPE ngit_git_operations_total counter
240ngit_git_operations_total{operation="clone",status="success"} 1247
241ngit_git_operations_total{operation="push",status="denied"} 12
242```
243
244## Integration Points
245
246### HTTP Service Integration
247
248In [`src/http/mod.rs`](../../src/http/mod.rs):
249
250```rust
251// Add to HttpService
252struct HttpService {
253 // ... existing fields ...
254 metrics: Arc<Metrics>,
255}
256
257// Add /metrics route handling
258if path == "/metrics" {
259 let metrics_output = self.metrics.render();
260 return Ok(Response::builder()
261 .status(200)
262 .header("content-type", "text/plain; version=0.0.4")
263 .body(Full::new(Bytes::from(metrics_output)))
264 .unwrap());
265}
266```
267
268### WebSocket Connection Tracking
269
270In the WebSocket upgrade handler:
271
272```rust
273// On connection
274let ip = addr.ip();
275metrics.connection_tracker.on_connect(ip);
276
277// Spawn connection handler
278tokio::spawn(async move {
279 // ... handle connection ...
280 // On disconnect
281 metrics.connection_tracker.on_disconnect(ip);
282});
283```
284
285### Git Handler Integration
286
287In [`src/git/handlers.rs`](../../src/git/handlers.rs):
288
289```rust
290// Wrap git operations with metrics
291let timer = metrics.git_operation_duration.start_timer();
292let result = git::handlers::handle_upload_pack(repo_path, body_bytes).await;
293timer.observe_duration();
294
295metrics.git_operations_total
296 .with_label_values(&["clone", result_status])
297 .inc();
298```
299
300## Configuration
301
302New configuration options in [`src/config.rs`](../../src/config.rs):
303
304| Option | CLI Flag | Environment Variable | Default | Description |
305|--------|----------|---------------------|---------|-------------|
306| Metrics enabled | `--metrics-enabled` | `NGIT_METRICS_ENABLED` | `true` | Enable /metrics endpoint |
307| Abuse threshold | `--abuse-threshold` | `NGIT_ABUSE_THRESHOLD` | `10` | Max connections per IP before flagging |
308| Metrics path | `--metrics-path` | `NGIT_METRICS_PATH` | `/metrics` | Path for metrics endpoint |
309
310## Crate Dependencies
311
312Add to `Cargo.toml`:
313
314```toml
315# Metrics
316prometheus = "0.13"
317dashmap = "5" # Lock-free concurrent HashMap
318lazy_static = "1.4" # For static metric registration
319```
320
321## Module Structure
322
323```
324src/
325├── metrics/
326│ ├── mod.rs # Module exports, Metrics struct
327│ ├── connection.rs # ConnectionTracker implementation
328│ ├── definitions.rs # Metric definitions (lazy_static!)
329│ └── render.rs # Prometheus format rendering
330├── http/
331│ └── mod.rs # Add /metrics route
332└── ...
333```
334
335## Grafana Dashboard
336
337A pre-built Grafana dashboard will be provided at `docs/grafana/ngit-grasp-dashboard.json` with panels for:
338
3391. **Overview Row**
340 - Active connections (gauge)
341 - Requests per second (graph)
342 - Git operations per minute (graph)
343
3442. **Connections Row**
345 - Active connections over time
346 - Connections by IP (top 10)
347 - Flagged abuse IPs (table)
348
3493. **Git Operations Row**
350 - Clone/fetch/push rates
351 - Push authorization results (pie chart)
352 - Operation duration percentiles
353
3544. **Events Row**
355 - Events received by kind
356 - Events rejected by reason
357 - Active subscriptions
358
359## Deployment: Prometheus on NixOS
360
361Example NixOS configuration for Prometheus:
362
363```nix
364services.prometheus = {
365 enable = true;
366 scrapeConfigs = [
367 {
368 job_name = "ngit-grasp";
369 static_configs = [{
370 targets = [ "localhost:8080" ]; # ngit-grasp bind address
371 }];
372 scrape_interval = "15s";
373 metrics_path = "/metrics";
374 }
375 ];
376};
377
378services.grafana = {
379 enable = true;
380 settings.server.http_port = 3000;
381 provision.datasources.settings.datasources = [{
382 name = "Prometheus";
383 type = "prometheus";
384 url = "http://localhost:9090";
385 }];
386};
387```
388
389## Future: Load-Based Sync Scheduling
390
391The metrics infrastructure enables future load-based scheduling for GRASP-02 sync jobs:
392
393```mermaid
394flowchart TD
395 SYNC[Sync Manager] --> CHECK{Check Load}
396 CHECK --> MET[Query Metrics]
397 MET --> CPU{CPU > 80%?}
398 CPU -->|Yes| DELAY[Delay 5 min]
399 CPU -->|No| CONN{Connections > N?}
400 CONN -->|Yes| DELAY
401 CONN -->|No| RUN[Run Sync Job]
402 DELAY --> CHECK
403```
404
405The `Metrics` struct will expose a method for checking load:
406
407```rust
408impl Metrics {
409 /// Check if system is under high load
410 pub fn is_high_load(&self) -> bool {
411 let active = self.websocket_connections_active.get();
412 active > self.config.high_load_threshold
413 }
414}
415```
416
417## Future Enhancement: Loki for Detailed Logging
418
419For detailed per-repository investigation at scale, consider adding **Loki** (log aggregation) in a future iteration:
420
421```rust
422// Structured logging with tracing
423tracing::info!(
424 repo = %repo_id,
425 npub = %npub,
426 bytes = bytes_transferred,
427 operation = "clone",
428 duration_ms = elapsed.as_millis(),
429 "git_transfer_complete"
430);
431```
432
433Loki query examples:
434```logql
435# Find all transfers > 10MB
436{job="ngit-grasp"} |= "git_transfer_complete" | json | bytes > 10000000
437
438# Sum bytes by repo in last hour
439sum by (repo) (
440 {job="ngit-grasp"} |= "git_transfer_complete" | json | unwrap bytes
441)
442```
443
444This pairs with Prometheus for long-term trends while enabling ad-hoc deep dives.
445
446## Privacy Considerations
447
448- IP addresses are stored only in memory (not logged to disk by default)
449- Per-IP metrics can be disabled via configuration
450- Consider IP anonymization for GDPR compliance if needed
451
452## Summary
453
454| Component | Purpose |
455|-----------|---------|
456| `Metrics` struct | Central registry and access point |
457| `ConnectionTracker` | Per-IP tracking with abuse detection |
458| `/metrics` endpoint | Prometheus scraping interface |
459| Grafana dashboard | Visualization and analysis |
460| NixOS config | Easy deployment for operators |
461
462This strategy provides comprehensive observability without requiring a separate database - Prometheus handles all time-series storage and Grafana provides the visualization layer. \ No newline at end of file
diff --git a/docs/explanation/monitoring.md b/docs/explanation/monitoring.md
new file mode 100644
index 0000000..3b1b1ac
--- /dev/null
+++ b/docs/explanation/monitoring.md
@@ -0,0 +1,99 @@
1# Monitoring
2
3ngit-grasp exposes Prometheus metrics at `/metrics` for monitoring WebSocket connections, Git operations, Nostr events, and system health.
4
5## Architecture
6
7```mermaid
8flowchart TB
9 subgraph ngit-grasp
10 HTTP[HTTP Service]
11 WS[WebSocket Handler]
12 GIT[Git Handlers]
13 RELAY[Nostr Relay]
14
15 subgraph Metrics Module
16 REG[Prometheus Registry]
17 CT[ConnectionTracker]
18 MC[Metric Counters]
19 end
20
21 ME[/metrics endpoint]
22 end
23
24 subgraph External
25 PROM[Prometheus Server]
26 GRAF[Grafana]
27 ADMIN[Admin Browser]
28 end
29
30 HTTP --> ME
31 WS --> CT
32 WS --> MC
33 GIT --> MC
34 RELAY --> MC
35
36 CT --> REG
37 MC --> REG
38 REG --> ME
39
40 PROM -->|scrape /metrics| ME
41 GRAF -->|query| PROM
42 ADMIN -->|view dashboards| GRAF
43```
44
45## Configuration
46
47| Option | CLI Flag | Environment Variable | Default | Description |
48|--------|----------|---------------------|---------|-------------|
49| Metrics enabled | `--metrics-enabled` | `NGIT_METRICS_ENABLED` | `true` | Enable /metrics endpoint |
50| Abuse threshold | `--abuse-threshold` | `NGIT_ABUSE_THRESHOLD` | `10` | Max connections per IP before flagging |
51| Top N repos | `--top-n-repos` | `NGIT_TOP_N_REPOS` | `10` | Number of top bandwidth repos to track |
52
53## Privacy Model
54
55IP addresses are **never exposed in Prometheus metrics**. The connection tracker maintains per-IP counts internally only for abuse detection:
56
57| Data | Exposed in Metrics? |
58|------|---------------------|
59| Total connections | ✅ Yes |
60| Unique IP count | ✅ Yes |
61| Flagged abuser count | ✅ Yes |
62| Actual IP addresses | ❌ No (internal only) |
63| IP + abuse flag | ⚠️ Logs only (when flagged) |
64
65When an IP exceeds the abuse threshold, a warning is logged but the IP is never exposed via Prometheus.
66
67## Deployment
68
69See [Prometheus Setup Guide](../how-to/prometheus-setup.md) for NixOS configuration and Grafana dashboard provisioning.
70
71## Future: Load-Based Sync Scheduling (GRASP-02)
72
73The metrics infrastructure enables future load-based scheduling for GRASP-02 sync jobs:
74
75```mermaid
76flowchart TD
77 SYNC[Sync Manager] --> CHECK{Check Load}
78 CHECK --> MET[Query Metrics]
79 MET --> CONN{Connections > N?}
80 CONN -->|Yes| DELAY[Delay 5 min]
81 CONN -->|No| RUN[Run Sync Job]
82 DELAY --> CHECK
83```
84
85## Future: Loki for Detailed Logging
86
87For detailed per-repository investigation at scale, consider adding **Loki** (log aggregation):
88
89- Structured logging with tracing crate already in place
90- Loki queries enable ad-hoc deep dives (e.g., find all transfers > 10MB)
91- Pairs with Prometheus for long-term trends
92
93## Future: Sync Metrics (GRASP-02)
94
95When GRASP-02 proactive sync is implemented, additional metrics will track:
96
97- Events received from sync (live vs catchup)
98- Active outbound relay connections
99- Catchup gap (events found during catchup indicating sync failures) \ No newline at end of file
diff --git a/docs/grafana/ngit-grasp-dashboard.json b/docs/grafana/ngit-grasp-dashboard.json
new file mode 100644
index 0000000..bd1b6fe
--- /dev/null
+++ b/docs/grafana/ngit-grasp-dashboard.json
@@ -0,0 +1,675 @@
1{
2 "annotations": {
3 "list": []
4 },
5 "editable": true,
6 "fiscalYearStartMonth": 0,
7 "graphTooltip": 0,
8 "id": null,
9 "links": [],
10 "liveNow": false,
11 "panels": [
12 {
13 "collapsed": false,
14 "gridPos": { "h": 1, "w": 24, "x": 0, "y": 0 },
15 "id": 1,
16 "title": "Overview",
17 "type": "row"
18 },
19 {
20 "datasource": { "type": "prometheus", "uid": "${datasource}" },
21 "fieldConfig": {
22 "defaults": {
23 "color": { "mode": "thresholds" },
24 "mappings": [],
25 "thresholds": {
26 "mode": "absolute",
27 "steps": [
28 { "color": "green", "value": null },
29 { "color": "yellow", "value": 50 },
30 { "color": "red", "value": 100 }
31 ]
32 },
33 "unit": "short"
34 }
35 },
36 "gridPos": { "h": 4, "w": 4, "x": 0, "y": 1 },
37 "id": 2,
38 "options": {
39 "colorMode": "value",
40 "graphMode": "area",
41 "justifyMode": "auto",
42 "orientation": "auto",
43 "reduceOptions": { "calcs": ["lastNotNull"], "fields": "", "values": false },
44 "textMode": "auto"
45 },
46 "pluginVersion": "10.0.0",
47 "targets": [
48 {
49 "expr": "ngit_websocket_connections_active",
50 "legendFormat": "Active",
51 "refId": "A"
52 }
53 ],
54 "title": "Active Connections",
55 "type": "stat"
56 },
57 {
58 "datasource": { "type": "prometheus", "uid": "${datasource}" },
59 "fieldConfig": {
60 "defaults": {
61 "color": { "mode": "thresholds" },
62 "mappings": [],
63 "thresholds": {
64 "mode": "absolute",
65 "steps": [
66 { "color": "green", "value": null }
67 ]
68 },
69 "unit": "short"
70 }
71 },
72 "gridPos": { "h": 4, "w": 4, "x": 4, "y": 1 },
73 "id": 3,
74 "options": {
75 "colorMode": "value",
76 "graphMode": "none",
77 "justifyMode": "auto",
78 "orientation": "auto",
79 "reduceOptions": { "calcs": ["lastNotNull"], "fields": "", "values": false },
80 "textMode": "auto"
81 },
82 "targets": [
83 {
84 "expr": "ngit_websocket_unique_ips",
85 "legendFormat": "Unique IPs",
86 "refId": "A"
87 }
88 ],
89 "title": "Unique IPs",
90 "type": "stat"
91 },
92 {
93 "datasource": { "type": "prometheus", "uid": "${datasource}" },
94 "fieldConfig": {
95 "defaults": {
96 "color": { "mode": "thresholds" },
97 "mappings": [],
98 "thresholds": {
99 "mode": "absolute",
100 "steps": [
101 { "color": "green", "value": null },
102 { "color": "red", "value": 1 }
103 ]
104 },
105 "unit": "short"
106 }
107 },
108 "gridPos": { "h": 4, "w": 4, "x": 8, "y": 1 },
109 "id": 4,
110 "options": {
111 "colorMode": "value",
112 "graphMode": "none",
113 "justifyMode": "auto",
114 "orientation": "auto",
115 "reduceOptions": { "calcs": ["lastNotNull"], "fields": "", "values": false },
116 "textMode": "auto"
117 },
118 "targets": [
119 {
120 "expr": "ngit_websocket_flagged_abusers",
121 "legendFormat": "Flagged",
122 "refId": "A"
123 }
124 ],
125 "title": "Flagged Abusers",
126 "type": "stat"
127 },
128 {
129 "datasource": { "type": "prometheus", "uid": "${datasource}" },
130 "fieldConfig": {
131 "defaults": {
132 "color": { "mode": "thresholds" },
133 "mappings": [],
134 "thresholds": {
135 "mode": "absolute",
136 "steps": [{ "color": "blue", "value": null }]
137 },
138 "unit": "short"
139 }
140 },
141 "gridPos": { "h": 4, "w": 4, "x": 12, "y": 1 },
142 "id": 5,
143 "options": {
144 "colorMode": "value",
145 "graphMode": "none",
146 "justifyMode": "auto",
147 "orientation": "auto",
148 "reduceOptions": { "calcs": ["lastNotNull"], "fields": "", "values": false },
149 "textMode": "auto"
150 },
151 "targets": [
152 {
153 "expr": "ngit_repositories_total",
154 "legendFormat": "Repos",
155 "refId": "A"
156 }
157 ],
158 "title": "Total Repositories",
159 "type": "stat"
160 },
161 {
162 "datasource": { "type": "prometheus", "uid": "${datasource}" },
163 "fieldConfig": {
164 "defaults": {
165 "color": { "mode": "thresholds" },
166 "mappings": [],
167 "thresholds": {
168 "mode": "absolute",
169 "steps": [{ "color": "green", "value": null }]
170 },
171 "unit": "s"
172 }
173 },
174 "gridPos": { "h": 4, "w": 4, "x": 16, "y": 1 },
175 "id": 6,
176 "options": {
177 "colorMode": "value",
178 "graphMode": "none",
179 "justifyMode": "auto",
180 "orientation": "auto",
181 "reduceOptions": { "calcs": ["lastNotNull"], "fields": "", "values": false },
182 "textMode": "auto"
183 },
184 "targets": [
185 {
186 "expr": "ngit_uptime_seconds",
187 "legendFormat": "Uptime",
188 "refId": "A"
189 }
190 ],
191 "title": "Uptime",
192 "type": "stat"
193 },
194 {
195 "datasource": { "type": "prometheus", "uid": "${datasource}" },
196 "fieldConfig": {
197 "defaults": {
198 "color": { "mode": "thresholds" },
199 "mappings": [],
200 "thresholds": {
201 "mode": "absolute",
202 "steps": [{ "color": "purple", "value": null }]
203 },
204 "unit": "short"
205 }
206 },
207 "gridPos": { "h": 4, "w": 4, "x": 20, "y": 1 },
208 "id": 7,
209 "options": {
210 "colorMode": "value",
211 "graphMode": "none",
212 "justifyMode": "auto",
213 "orientation": "auto",
214 "reduceOptions": { "calcs": ["lastNotNull"], "fields": "", "values": false },
215 "textMode": "value_and_name"
216 },
217 "targets": [
218 {
219 "expr": "ngit_build_info",
220 "legendFormat": "{{version}}",
221 "refId": "A"
222 }
223 ],
224 "title": "Version",
225 "type": "stat"
226 },
227 {
228 "collapsed": false,
229 "gridPos": { "h": 1, "w": 24, "x": 0, "y": 5 },
230 "id": 10,
231 "title": "WebSocket Connections",
232 "type": "row"
233 },
234 {
235 "datasource": { "type": "prometheus", "uid": "${datasource}" },
236 "fieldConfig": {
237 "defaults": {
238 "color": { "mode": "palette-classic" },
239 "custom": {
240 "axisCenteredZero": false,
241 "axisColorMode": "text",
242 "axisLabel": "",
243 "axisPlacement": "auto",
244 "barAlignment": 0,
245 "drawStyle": "line",
246 "fillOpacity": 10,
247 "gradientMode": "none",
248 "hideFrom": { "legend": false, "tooltip": false, "viz": false },
249 "lineInterpolation": "linear",
250 "lineWidth": 1,
251 "pointSize": 5,
252 "scaleDistribution": { "type": "linear" },
253 "showPoints": "never",
254 "spanNulls": false,
255 "stacking": { "group": "A", "mode": "none" },
256 "thresholdsStyle": { "mode": "off" }
257 },
258 "mappings": [],
259 "thresholds": { "mode": "absolute", "steps": [] },
260 "unit": "short"
261 }
262 },
263 "gridPos": { "h": 8, "w": 12, "x": 0, "y": 6 },
264 "id": 11,
265 "options": {
266 "legend": { "calcs": [], "displayMode": "list", "placement": "bottom", "showLegend": true },
267 "tooltip": { "mode": "multi", "sort": "none" }
268 },
269 "targets": [
270 {
271 "expr": "ngit_websocket_connections_active",
272 "legendFormat": "Active Connections",
273 "refId": "A"
274 },
275 {
276 "expr": "ngit_websocket_unique_ips",
277 "legendFormat": "Unique IPs",
278 "refId": "B"
279 }
280 ],
281 "title": "Connections Over Time",
282 "type": "timeseries"
283 },
284 {
285 "datasource": { "type": "prometheus", "uid": "${datasource}" },
286 "fieldConfig": {
287 "defaults": {
288 "color": { "mode": "palette-classic" },
289 "custom": {
290 "axisCenteredZero": false,
291 "axisColorMode": "text",
292 "axisLabel": "",
293 "axisPlacement": "auto",
294 "barAlignment": 0,
295 "drawStyle": "line",
296 "fillOpacity": 10,
297 "gradientMode": "none",
298 "hideFrom": { "legend": false, "tooltip": false, "viz": false },
299 "lineInterpolation": "linear",
300 "lineWidth": 1,
301 "pointSize": 5,
302 "scaleDistribution": { "type": "linear" },
303 "showPoints": "never",
304 "spanNulls": false,
305 "stacking": { "group": "A", "mode": "none" },
306 "thresholdsStyle": { "mode": "off" }
307 },
308 "mappings": [],
309 "thresholds": { "mode": "absolute", "steps": [] },
310 "unit": "short"
311 }
312 },
313 "gridPos": { "h": 8, "w": 12, "x": 12, "y": 6 },
314 "id": 12,
315 "options": {
316 "legend": { "calcs": ["sum"], "displayMode": "table", "placement": "right", "showLegend": true },
317 "tooltip": { "mode": "multi", "sort": "none" }
318 },
319 "targets": [
320 {
321 "expr": "rate(ngit_websocket_messages_received_total[5m])",
322 "legendFormat": "Received: {{type}}",
323 "refId": "A"
324 },
325 {
326 "expr": "rate(ngit_websocket_messages_sent_total[5m])",
327 "legendFormat": "Sent: {{type}}",
328 "refId": "B"
329 }
330 ],
331 "title": "Message Rate (5m)",
332 "type": "timeseries"
333 },
334 {
335 "datasource": { "type": "prometheus", "uid": "${datasource}" },
336 "fieldConfig": {
337 "defaults": {
338 "color": { "mode": "palette-classic" },
339 "custom": { "hideFrom": { "legend": false, "tooltip": false, "viz": false } },
340 "mappings": [],
341 "unit": "s"
342 }
343 },
344 "gridPos": { "h": 8, "w": 12, "x": 0, "y": 14 },
345 "id": 13,
346 "options": {
347 "legend": { "displayMode": "list", "placement": "right", "showLegend": true },
348 "pieType": "pie",
349 "reduceOptions": { "calcs": ["lastNotNull"], "fields": "", "values": false },
350 "tooltip": { "mode": "single", "sort": "none" }
351 },
352 "targets": [
353 {
354 "expr": "histogram_quantile(0.5, rate(ngit_websocket_connection_duration_seconds_bucket[1h]))",
355 "legendFormat": "p50",
356 "refId": "A"
357 },
358 {
359 "expr": "histogram_quantile(0.95, rate(ngit_websocket_connection_duration_seconds_bucket[1h]))",
360 "legendFormat": "p95",
361 "refId": "B"
362 },
363 {
364 "expr": "histogram_quantile(0.99, rate(ngit_websocket_connection_duration_seconds_bucket[1h]))",
365 "legendFormat": "p99",
366 "refId": "C"
367 }
368 ],
369 "title": "Connection Duration Percentiles",
370 "type": "piechart"
371 },
372 {
373 "collapsed": false,
374 "gridPos": { "h": 1, "w": 24, "x": 0, "y": 22 },
375 "id": 20,
376 "title": "Git Operations",
377 "type": "row"
378 },
379 {
380 "datasource": { "type": "prometheus", "uid": "${datasource}" },
381 "fieldConfig": {
382 "defaults": {
383 "color": { "mode": "palette-classic" },
384 "custom": {
385 "axisCenteredZero": false,
386 "axisColorMode": "text",
387 "axisLabel": "",
388 "axisPlacement": "auto",
389 "barAlignment": 0,
390 "drawStyle": "bars",
391 "fillOpacity": 50,
392 "gradientMode": "none",
393 "hideFrom": { "legend": false, "tooltip": false, "viz": false },
394 "lineInterpolation": "linear",
395 "lineWidth": 1,
396 "pointSize": 5,
397 "scaleDistribution": { "type": "linear" },
398 "showPoints": "never",
399 "spanNulls": false,
400 "stacking": { "group": "A", "mode": "normal" },
401 "thresholdsStyle": { "mode": "off" }
402 },
403 "mappings": [],
404 "thresholds": { "mode": "absolute", "steps": [] },
405 "unit": "ops"
406 }
407 },
408 "gridPos": { "h": 8, "w": 12, "x": 0, "y": 23 },
409 "id": 21,
410 "options": {
411 "legend": { "calcs": ["sum"], "displayMode": "table", "placement": "right", "showLegend": true },
412 "tooltip": { "mode": "multi", "sort": "none" }
413 },
414 "targets": [
415 {
416 "expr": "rate(ngit_git_operations_total{status=\"success\"}[5m])",
417 "legendFormat": "{{operation}} (success)",
418 "refId": "A"
419 },
420 {
421 "expr": "rate(ngit_git_operations_total{status=\"error\"}[5m])",
422 "legendFormat": "{{operation}} (error)",
423 "refId": "B"
424 }
425 ],
426 "title": "Git Operations Rate (5m)",
427 "type": "timeseries"
428 },
429 {
430 "datasource": { "type": "prometheus", "uid": "${datasource}" },
431 "fieldConfig": {
432 "defaults": {
433 "color": { "mode": "palette-classic" },
434 "custom": {
435 "axisCenteredZero": false,
436 "axisColorMode": "text",
437 "axisLabel": "",
438 "axisPlacement": "auto",
439 "barAlignment": 0,
440 "drawStyle": "line",
441 "fillOpacity": 10,
442 "gradientMode": "none",
443 "hideFrom": { "legend": false, "tooltip": false, "viz": false },
444 "lineInterpolation": "linear",
445 "lineWidth": 1,
446 "pointSize": 5,
447 "scaleDistribution": { "type": "linear" },
448 "showPoints": "never",
449 "spanNulls": false,
450 "stacking": { "group": "A", "mode": "none" },
451 "thresholdsStyle": { "mode": "off" }
452 },
453 "mappings": [],
454 "thresholds": { "mode": "absolute", "steps": [] },
455 "unit": "bytes"
456 }
457 },
458 "gridPos": { "h": 8, "w": 12, "x": 12, "y": 23 },
459 "id": 22,
460 "options": {
461 "legend": { "calcs": ["sum"], "displayMode": "table", "placement": "right", "showLegend": true },
462 "tooltip": { "mode": "multi", "sort": "none" }
463 },
464 "targets": [
465 {
466 "expr": "rate(ngit_git_bytes_total[5m])",
467 "legendFormat": "{{direction}}",
468 "refId": "A"
469 }
470 ],
471 "title": "Git Bandwidth (5m)",
472 "type": "timeseries"
473 },
474 {
475 "datasource": { "type": "prometheus", "uid": "${datasource}" },
476 "fieldConfig": {
477 "defaults": {
478 "color": { "mode": "palette-classic" },
479 "custom": { "hideFrom": { "legend": false, "tooltip": false, "viz": false } },
480 "mappings": [],
481 "unit": "short"
482 },
483 "overrides": [
484 {
485 "matcher": { "id": "byName", "options": "denied" },
486 "properties": [{ "id": "color", "value": { "fixedColor": "red", "mode": "fixed" } }]
487 },
488 {
489 "matcher": { "id": "byName", "options": "allowed" },
490 "properties": [{ "id": "color", "value": { "fixedColor": "green", "mode": "fixed" } }]
491 }
492 ]
493 },
494 "gridPos": { "h": 8, "w": 6, "x": 0, "y": 31 },
495 "id": 23,
496 "options": {
497 "legend": { "displayMode": "list", "placement": "right", "showLegend": true },
498 "pieType": "pie",
499 "reduceOptions": { "calcs": ["sum"], "fields": "", "values": false },
500 "tooltip": { "mode": "single", "sort": "none" }
501 },
502 "targets": [
503 {
504 "expr": "increase(ngit_git_push_authorization_total[24h])",
505 "legendFormat": "{{result}}",
506 "refId": "A"
507 }
508 ],
509 "title": "Push Authorization (24h)",
510 "type": "piechart"
511 },
512 {
513 "datasource": { "type": "prometheus", "uid": "${datasource}" },
514 "fieldConfig": {
515 "defaults": {
516 "color": { "mode": "palette-classic" },
517 "mappings": [],
518 "thresholds": { "mode": "absolute", "steps": [{ "color": "green", "value": null }] },
519 "unit": "bytes"
520 }
521 },
522 "gridPos": { "h": 8, "w": 18, "x": 6, "y": 31 },
523 "id": 24,
524 "options": {
525 "displayMode": "gradient",
526 "minVizHeight": 10,
527 "minVizWidth": 0,
528 "orientation": "horizontal",
529 "reduceOptions": { "calcs": ["lastNotNull"], "fields": "", "values": false },
530 "showUnfilled": true,
531 "valueMode": "color"
532 },
533 "targets": [
534 {
535 "expr": "topk(10, ngit_git_top_repos_bytes)",
536 "legendFormat": "{{repo}}",
537 "refId": "A"
538 }
539 ],
540 "title": "Top Repositories by Bandwidth",
541 "type": "bargauge"
542 },
543 {
544 "collapsed": false,
545 "gridPos": { "h": 1, "w": 24, "x": 0, "y": 39 },
546 "id": 30,
547 "title": "Nostr Events",
548 "type": "row"
549 },
550 {
551 "datasource": { "type": "prometheus", "uid": "${datasource}" },
552 "fieldConfig": {
553 "defaults": {
554 "color": { "mode": "palette-classic" },
555 "custom": {
556 "axisCenteredZero": false,
557 "axisColorMode": "text",
558 "axisLabel": "",
559 "axisPlacement": "auto",
560 "barAlignment": 0,
561 "drawStyle": "line",
562 "fillOpacity": 10,
563 "gradientMode": "none",
564 "hideFrom": { "legend": false, "tooltip": false, "viz": false },
565 "lineInterpolation": "linear",
566 "lineWidth": 1,
567 "pointSize": 5,
568 "scaleDistribution": { "type": "linear" },
569 "showPoints": "never",
570 "spanNulls": false,
571 "stacking": { "group": "A", "mode": "none" },
572 "thresholdsStyle": { "mode": "off" }
573 },
574 "mappings": [],
575 "thresholds": { "mode": "absolute", "steps": [] },
576 "unit": "short"
577 }
578 },
579 "gridPos": { "h": 8, "w": 12, "x": 0, "y": 40 },
580 "id": 31,
581 "options": {
582 "legend": { "calcs": ["sum"], "displayMode": "table", "placement": "right", "showLegend": true },
583 "tooltip": { "mode": "multi", "sort": "none" }
584 },
585 "targets": [
586 {
587 "expr": "rate(ngit_events_received_total[5m])",
588 "legendFormat": "Kind {{kind}}",
589 "refId": "A"
590 }
591 ],
592 "title": "Events Received by Kind (5m)",
593 "type": "timeseries"
594 },
595 {
596 "datasource": { "type": "prometheus", "uid": "${datasource}" },
597 "fieldConfig": {
598 "defaults": {
599 "color": { "mode": "palette-classic" },
600 "custom": {
601 "axisCenteredZero": false,
602 "axisColorMode": "text",
603 "axisLabel": "",
604 "axisPlacement": "auto",
605 "barAlignment": 0,
606 "drawStyle": "line",
607 "fillOpacity": 10,
608 "gradientMode": "none",
609 "hideFrom": { "legend": false, "tooltip": false, "viz": false },
610 "lineInterpolation": "linear",
611 "lineWidth": 1,
612 "pointSize": 5,
613 "scaleDistribution": { "type": "linear" },
614 "showPoints": "never",
615 "spanNulls": false,
616 "stacking": { "group": "A", "mode": "none" },
617 "thresholdsStyle": { "mode": "off" }
618 },
619 "mappings": [],
620 "thresholds": { "mode": "absolute", "steps": [] },
621 "unit": "short"
622 }
623 },
624 "gridPos": { "h": 8, "w": 12, "x": 12, "y": 40 },
625 "id": 32,
626 "options": {
627 "legend": { "calcs": ["sum"], "displayMode": "table", "placement": "right", "showLegend": true },
628 "tooltip": { "mode": "multi", "sort": "none" }
629 },
630 "targets": [
631 {
632 "expr": "rate(ngit_events_stored_total[5m])",
633 "legendFormat": "Stored: Kind {{kind}}",
634 "refId": "A"
635 },
636 {
637 "expr": "rate(ngit_events_rejected_total[5m])",
638 "legendFormat": "Rejected: {{reason}}",
639 "refId": "B"
640 }
641 ],
642 "title": "Events Stored vs Rejected (5m)",
643 "type": "timeseries"
644 }
645 ],
646 "refresh": "30s",
647 "schemaVersion": 38,
648 "style": "dark",
649 "tags": ["ngit-grasp", "nostr", "git"],
650 "templating": {
651 "list": [
652 {
653 "current": { "selected": false, "text": "Prometheus", "value": "Prometheus" },
654 "hide": 0,
655 "includeAll": false,
656 "label": "Datasource",
657 "multi": false,
658 "name": "datasource",
659 "options": [],
660 "query": "prometheus",
661 "refresh": 1,
662 "regex": "",
663 "skipUrlSync": false,
664 "type": "datasource"
665 }
666 ]
667 },
668 "time": { "from": "now-6h", "to": "now" },
669 "timepicker": {},
670 "timezone": "browser",
671 "title": "ngit-grasp",
672 "uid": "ngit-grasp",
673 "version": 1,
674 "weekStart": ""
675} \ No newline at end of file
diff --git a/docs/how-to/prometheus-setup.md b/docs/how-to/prometheus-setup.md
new file mode 100644
index 0000000..741255b
--- /dev/null
+++ b/docs/how-to/prometheus-setup.md
@@ -0,0 +1,178 @@
1# Prometheus and Grafana Setup
2
3This guide shows how to configure Prometheus and Grafana to monitor ngit-grasp.
4
5## Prerequisites
6
7- ngit-grasp running with metrics enabled (default: `--metrics-enabled true`)
8- Prometheus server
9- Grafana (optional, for dashboards)
10
11## Verify Metrics Endpoint
12
13First, verify that ngit-grasp is exposing metrics:
14
15```bash
16curl http://localhost:8080/metrics
17```
18
19You should see Prometheus-formatted metrics like:
20
21```
22# HELP ngit_websocket_connections_active Current active WebSocket connections
23# TYPE ngit_websocket_connections_active gauge
24ngit_websocket_connections_active 5
25
26# HELP ngit_git_operations_total Git operations by type and status
27# TYPE ngit_git_operations_total counter
28ngit_git_operations_total{operation="clone",status="success"} 42
29```
30
31## NixOS Configuration
32
33### Prometheus
34
35Add ngit-grasp as a scrape target:
36
37```nix
38services.prometheus = {
39 enable = true;
40 scrapeConfigs = [
41 {
42 job_name = "ngit-grasp";
43 static_configs = [{
44 targets = [ "localhost:8080" ]; # ngit-grasp bind address
45 }];
46 scrape_interval = "15s";
47 metrics_path = "/metrics";
48 }
49 ];
50};
51```
52
53### Grafana with Prometheus Datasource
54
55```nix
56services.grafana = {
57 enable = true;
58 settings.server.http_port = 3000;
59
60 provision.datasources.settings.datasources = [{
61 name = "Prometheus";
62 type = "prometheus";
63 url = "http://localhost:9090";
64 isDefault = true;
65 }];
66
67 # Optional: provision the ngit-grasp dashboard
68 provision.dashboards.settings.providers = [{
69 name = "ngit-grasp";
70 options.path = "/path/to/ngit-grasp/docs/grafana";
71 }];
72};
73```
74
75## Docker Compose Configuration
76
77For non-NixOS deployments:
78
79```yaml
80version: '3.8'
81services:
82 prometheus:
83 image: prom/prometheus:latest
84 volumes:
85 - ./prometheus.yml:/etc/prometheus/prometheus.yml
86 ports:
87 - "9090:9090"
88
89 grafana:
90 image: grafana/grafana:latest
91 ports:
92 - "3000:3000"
93 volumes:
94 - ./docs/grafana:/var/lib/grafana/dashboards
95 environment:
96 - GF_DASHBOARDS_DEFAULT_HOME_DASHBOARD_PATH=/var/lib/grafana/dashboards/ngit-grasp-dashboard.json
97```
98
99With `prometheus.yml`:
100
101```yaml
102global:
103 scrape_interval: 15s
104
105scrape_configs:
106 - job_name: 'ngit-grasp'
107 static_configs:
108 - targets: ['host.docker.internal:8080'] # or your ngit-grasp host
109 metrics_path: /metrics
110```
111
112## Import Dashboard
113
1141. Open Grafana at `http://localhost:3000`
1152. Go to **Dashboards** → **Import**
1163. Upload `docs/grafana/ngit-grasp-dashboard.json`
1174. Select your Prometheus datasource
1185. Click **Import**
119
120## Key Metrics to Monitor
121
122### Connection Health
123- `ngit_websocket_connections_active` - Current active connections
124- `ngit_websocket_unique_ips` - Number of unique client IPs
125- `ngit_websocket_flagged_abusers` - IPs exceeding connection threshold
126
127### Git Operations
128- `ngit_git_operations_total` - Operations by type (clone/fetch/push) and status
129- `ngit_git_bytes_total` - Bandwidth by direction (in/out)
130- `ngit_git_top_repos_bytes` - Top N repositories by bandwidth
131
132### Nostr Events
133- `ngit_events_received_total` - Events received by kind
134- `ngit_events_stored_total` - Events successfully stored
135- `ngit_events_rejected_total` - Events rejected by reason
136
137### System
138- `ngit_uptime_seconds` - Server uptime
139- `ngit_build_info` - Version and commit info
140- `ngit_repositories_total` - Total hosted repositories
141
142## Example Alerts
143
144Add to your Prometheus alerting rules:
145
146```yaml
147groups:
148 - name: ngit-grasp
149 rules:
150 - alert: HighConnectionCount
151 expr: ngit_websocket_connections_active > 100
152 for: 5m
153 labels:
154 severity: warning
155 annotations:
156 summary: "High number of WebSocket connections"
157
158 - alert: AbusiveIPs
159 expr: ngit_websocket_flagged_abusers > 0
160 for: 1m
161 labels:
162 severity: warning
163 annotations:
164 summary: "{{ $value }} IPs flagged for excessive connections"
165
166 - alert: PushAuthorizationFailures
167 expr: rate(ngit_git_operations_total{operation="push",status="denied"}[5m]) > 0.1
168 for: 5m
169 labels:
170 severity: info
171 annotations:
172 summary: "Elevated push authorization failures"
173```
174
175## See Also
176
177- [Monitoring Overview](../explanation/monitoring.md) - Architecture and design
178- [Configuration Reference](../reference/configuration.md) - All config options \ No newline at end of file