upleb.uk

Public git repos — served from a NIP-34 GRASP relay at git.upleb.uk

summaryrefslogtreecommitdiff
path: root/docs/how-to/production-sync-testing.md
blob: 3b1d4e4481776b3f655c7f9e920a2638da5d060b (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
# How-To: Test Sync Against Production Data

> **Quick Start Prompt:** Check work/active-issues/ for existing issues. If issues exist, pick the most important, fix it, test with cargo test, run clippy and fmt, commit, and report back with a brief 1-2 sentence summary of each issue you identified. If no issues exist, run a 30-second production sync test, analyze logs, create individual issue files in work/active-issues/ (one per issue with minimal description), then report summary listing each issue in 1-2 sentences.

**Problem:** Debug and improve sync behavior using real-world data from production relays  
**Difficulty:** Intermediate  
**Time:** 30 minutes per iteration

## Two-Mode Workflow

This guide operates in two modes:

### Mode 1: Fix Existing Issues
**When:** There are files in `work/active-issues/` (excluding README.md)

1. Check for active issues: `ls work/active-issues/`
2. Pick the most important issue to fix
3. **Review proposed fix and ask for permission before implementing**
4. Implement the fix (after approval)
5. Run `cargo test` to verify tests pass
6. Run `cargo clippy` to check for warnings
7. Run `cargo fmt` to format code
8. Commit changes with descriptive message
9. Report back - **DO NOT** do another issue or run more tests

### Mode 2: Discover New Issues
**When:** No active issues in `work/active-issues/`

1. Run 30-second production sync test (logs saved to `tmp/run-{timestamp}/`)
2. Analyze logs for errors, warnings, unexpected patterns
3. Document each issue as a separate markdown file in `work/active-issues/`
4. Keep issue files minimal - just enough to identify the issue
5. Report brief summary listing each issue in 1-2 sentences
6. **DO NOT** create separate detailed analysis files
7. **DO NOT** do thorough investigation or root cause analysis

## Overview

This guide helps you run ngit-grasp's sync system against production relays to discover unexpected errors, inefficiencies, and edge cases that don't appear in controlled tests.

**Why production testing matters:**
- Real data has inconsistencies, malformed events, and edge cases
- Production relays may behave differently (rate limiting, timeouts, partial NIP-77 support)
- Volume and patterns reveal performance bottlenecks
- Sync discovery leads to cascading subscriptions we can't predict in tests

## Prerequisites

- ngit-grasp compiles successfully (`cargo build`)
- Familiarity with [GRASP-02 Proactive Sync](../explanation/grasp-02-proactive-sync.md)
- Understanding of log levels and tracing

## Test Setup

### 1. Choose a Test Identity

Pick a domain with manageable sync volume. Smaller domains mean fewer repos to sync, making logs tractable.

**Recommended starting point:**
```bash
--domain ngit.danconwaydev.com
```

This domain has few repo announcements listing it, so sync stays manageable.

### 2. Choose a Bootstrap Relay

The bootstrap relay provides the initial set of announcements to discover repos:

```bash
--sync-bootstrap-relay-url wss://git.shakespeare.diy
```

### 3. Run with Time Limit

Start with short runs (30 seconds) to capture manageable log volumes. Each run creates its own subdirectory in `tmp/` to keep data and logs isolated:

```bash
# Create run directory with timestamp
RUN_DIR="tmp/run-$(date +%Y%m%d-%H%M%S)"
mkdir -p "$RUN_DIR"

# Run for 30 seconds, saving both raw and sanitized logs
timeout 30s cargo run -- \
    --sync-bootstrap-relay-url wss://git.shakespeare.diy \
    --domain ngit.danconwaydev.com \
    --git-data-path "$RUN_DIR/git-data" \
    --relay-data-path "$RUN_DIR/relay-data" \
    2>&1 | tee "$RUN_DIR/sync-raw.log" | ./scripts/sanitize-logs.sh | tee "$RUN_DIR/sync.log"
```

**Note:** The `timeout` command returns exit code 124, which is expected.

**Directory structure after run:**
```
tmp/
└── run-20260109-143022/
    ├── git-data/       # Git repository data
    ├── relay-data/     # Relay database
    ├── sync.log        # Sanitized log (for quick analysis)
    └── sync-raw.log    # Raw log (for full details when needed)
```

**When to use which log:**
- **sync.log** - Use for quick scanning and pattern recognition (long lines truncated)
- **sync-raw.log** - Use when you need full details (e.g., complete rejection reasons, full event data)

## Log Sanitization

Raw logs include full events and hundreds of event IDs per line, making them unwieldy for analysis. The sanitizer truncates long lines:

```bash
./scripts/sanitize-logs.sh < raw.log > sanitized.log

# Or pipe directly
cargo run -- [args] 2>&1 | ./scripts/sanitize-logs.sh
```

**Options:**
- `--head-chars N` - First N characters to show (default: 200)
- `--tail-chars N` - Last N characters to show (default: 100)

Example output:
```
2024-01-09T10:00:00Z DEBUG sync: Processing events ids=[abc123, def456, ghi789, jkl012...<1847 chars>...xyz999, end123]
```

### Retrieving Full Details from Raw Logs

When sanitized logs show truncated messages (e.g., rejection reasons), use the raw log to see the complete content:

```bash
# Find specific error in raw log
grep "Rejected repository announcement" "$RUN_DIR/sync-raw.log"

# Extract full line for specific event ID
grep "note1z5ys7wf3ms5yxhnp3kfw7hpu5asfkx4jngzt5zgs4tm4tnvggnsqjfqeyt" "$RUN_DIR/sync-raw.log"

# View context around a truncated warning
grep -A 2 -B 2 "pattern from sanitized log" "$RUN_DIR/sync-raw.log"
```

The raw log contains complete, untruncated messages including full rejection reasons, event data, and debug details.

## What to Look For

### Phase 1: Connection & Bootstrap (0-5 seconds)

**Expected behavior:**
- Connection to bootstrap relay succeeds
- Layer 1 (announcement) subscription starts
- First batch of 30617/30618 events received

**Red flags:**
- Connection timeout or failure
- NIP-77 negentropy errors (should fall back gracefully)
- Immediate rate limiting

### Phase 2: Discovery Cascade (5-15 seconds)

**Expected behavior:**
- Self-subscriber batches fire as announcements are processed
- New relays discovered from announcement `relays` tags
- Layer 2 (repo tags) subscriptions created

**Red flags:**
- Excessive relay discovery (>10 relays rapidly)
- Filter consolidation warnings (>70 filters)
- Missing self-subscriber batch logs

### Phase 3: Steady State (15+ seconds)

**Expected behavior:**
- Historic sync batches completing (EOSE received)
- Periodic health checks running
- Events being saved to database

**Red flags:**
- Pending batches never confirming
- Repeated connection/disconnect cycles
- Memory growth (check with `top` in another terminal)

## Debugging Checklist

When analyzing logs, look for these patterns:

### Errors to Investigate

| Pattern | Possible Cause | Action |
|---------|----------------|--------|
| `error` (any) | Unexpected failure | Investigate immediately |
| `connection failed` | Network/relay issue | Check relay URL, try different relay |
| `rate limit` | Too many requests | Check consolidation, increase backoff |
| `negentropy` + `error` | NIP-77 incompatibility | Should fall back - verify it does |
| `timeout` | Slow relay or large sync | Increase timeouts or reduce scope |

### Warnings to Monitor

| Pattern | Meaning | Action |
|---------|---------|--------|
| `consolidating filters` | Filter count high | Expected, but frequent = problem |
| `backing off` | Health tracker retry | Normal, but watch for excessive |
| `batch failed` | Historic sync incomplete | Check which batches, why |

### Debug Patterns to Verify

| Pattern | What it shows |
|---------|---------------|
| `fresh_start` | Full sync initiated |
| `quick_reconnect` | Incremental sync (<15min gap) |
| `historic sync complete` | Sync finished successfully |
| `sync_live` | Live subscriptions active |
| `PendingBatch` | Items awaiting EOSE confirmation |

## Mode 1: Fix Existing Issues (Detailed)

When `work/active-issues/` contains issue files:

### Step 1: Check for Active Issues

```bash
ls work/active-issues/
```

If any `.md` files exist (excluding README.md), you're in Mode 1.

### Step 2: Pick Most Important Issue

Review issue files and select based on:
- Severity (errors > warnings > log quality)
- Impact (functionality > performance > UX)
- Complexity (quick fixes first to clear backlog)

### Step 3: Review Proposed Fix and Get Permission

**IMPORTANT:** Before implementing any changes:

1. Read relevant code files to understand the issue
2. Analyze the root cause
3. Propose a fix with explanation of what will change and why
4. Summarize the proposed fix in 2-3 sentences
5. **Ask for user permission to proceed**

**Do NOT implement changes without explicit approval.**

### Step 4: Implement the Fix

After receiving permission, make the necessary code changes based on the issue description and approved plan.

### Step 5: Test, Lint, Format

```bash
# Run tests
cargo test

# Check for warnings
cargo clippy

# Format code
cargo fmt
```

### Step 6: Commit

```bash
git add .
git commit -m "fix: [brief description of what was fixed]"
```

### Step 7: Report Back

**STOP HERE.** Report what was fixed. Do NOT:
- Fix another issue
- Run production sync test
- Do additional investigation

The workflow will cycle back through Mode 1 if more issues remain.

## Mode 2: Discover New Issues (Detailed)

When `work/active-issues/` is empty (or only contains README.md):

### Step 1: Run Production Sync Test

```bash
# Create run directory with timestamp
RUN_DIR="tmp/run-$(date +%Y%m%d-%H%M%S)"
mkdir -p "$RUN_DIR"

# Run 30-second test, saving both raw and sanitized logs
timeout 30s cargo run -- \
    --sync-bootstrap-relay-url wss://git.shakespeare.diy \
    --domain ngit.danconwaydev.com \
    --git-data-path "$RUN_DIR/git-data" \
    --relay-data-path "$RUN_DIR/relay-data" \
    2>&1 | tee "$RUN_DIR/sync-raw.log" | ./scripts/sanitize-logs.sh | tee "$RUN_DIR/sync.log"
```

Each run is isolated in its own timestamped directory under `tmp/`, keeping data and logs organized. Both raw and sanitized logs are saved for flexible analysis.

### Step 2: Analyze Logs

Scan for errors and unexpected patterns:
```bash
# Find the most recent run
LATEST_RUN=$(ls -1t tmp/run-*/sync.log | head -n1)
LATEST_RAW=$(ls -1t tmp/run-*/sync-raw.log | head -n1)

# Analyze sanitized log for quick scanning
grep -i error "$LATEST_RUN"
grep -i warn "$LATEST_RUN"
grep -i panic "$LATEST_RUN"

# If you find truncated messages, check the raw log for full details
grep "pattern from truncated message" "$LATEST_RAW"
```

### Step 3: Document Issues

Create **one markdown file per issue** in `work/active-issues/`:

```bash
# Example: Minimal issue documentation
cat > work/active-issues/bootstrap-disconnect.md <<'EOF'
# Bootstrap relay disconnects when empty

Bootstrap relay wss://git.shakespeare.diy disconnects after sync finds 0 events. Should persist since user-specified.

Log: "Disconnecting empty relay relay=wss://git.shakespeare.diy"
File: src/sync/mod.rs (check_disconnects function)
EOF
```

**Keep each file brief:**
- Descriptive title (one line)
- What happens (1-2 sentences max)
- Relevant log excerpt (one line)
- File/function location if obvious (one line)
- **NO** separate detailed analysis files
- **NO** root cause analysis
- **NO** proposed solutions (unless immediately obvious)

### Step 4: Report Summary

Provide a brief closing message with 1-2 sentence summary of **each issue** identified:
- State what the issue is
- Where it occurs (file/component)
- Keep it concise

**STOP HERE.** Do NOT:
- Fix the issues immediately
- Create separate detailed analysis markdown files
- Do thorough investigations
- Write lengthy explanations

The workflow will cycle back through Mode 1 to fix issues one at a time.

## Logging Improvements

If the logs aren't helpful enough, improve them. Common needs:

### Add Context to Existing Logs

```rust
// Before
tracing::debug!("Processing events");

// After
tracing::debug!(
    relay = %relay_url,
    event_count = events.len(),
    "Processing events"
);
```

### Add New Log Points

Key places that may need more logging:
- `src/sync/mod.rs` - SyncManager state transitions
- `src/sync/relay_connection.rs` - Connection lifecycle
- `src/sync/self_subscriber.rs` - Batch processing

### Reduce Noise

If a log line appears too frequently:
```rust
// Change from debug! to trace!
tracing::trace!("Per-event detail that's too noisy");
```

## Managing Active Issues

Issues are tracked in `work/active-issues/` as individual markdown files.

**Check for active issues:**
```bash
ls work/active-issues/
```

**After fixing an issue:**
```bash
# Delete the resolved issue file
rm work/active-issues/issue-name.md

# Or archive if important for future reference
mv work/active-issues/issue-name.md docs/archive/2026-01-09-issue-name.md
```

**Issue file format (minimal):**
```markdown
# Brief title

What happens (1-2 sentences).

Log evidence: "relevant log line"
File: src/path/to/file.rs (function_name if known)
```

Keep documentation minimal - just enough to identify and locate the issue.

---

## Workflow Summary

```
Check work/active-issues/
    │
    ├─ Has issues? ──► Mode 1: Pick one issue
    │                          │
    │                          ├─ Review & propose fix
    │                          ├─ Ask permission
    │                          ├─ Fix code (after approval)
    │                          ├─ cargo test
    │                          ├─ cargo clippy
    │                          ├─ cargo fmt
    │                          ├─ git commit
    │                          └─ Report & STOP
    │
    └─ No issues? ──► Mode 2: Run production sync
                               │
                               ├─ timeout 30s cargo run ...
                               ├─ Analyze logs
                               ├─ Document issues (minimal)
                               └─ Report summary & STOP
```

**Key Rules:**
- Only do ONE thing per cycle (fix one issue OR discover issues)
- Always stop after reporting
- Keep issue documentation minimal
- No root cause analysis during discovery

---

## Quick Reference

### Minimal Test Command

```bash
# Create run directory
RUN_DIR="tmp/run-$(date +%Y%m%d-%H%M%S)"
mkdir -p "$RUN_DIR"

# Run test with both raw and sanitized logs
timeout 30s cargo run -- \
    --sync-bootstrap-relay-url wss://git.shakespeare.diy \
    --domain ngit.danconwaydev.com \
    --git-data-path "$RUN_DIR/git-data" \
    --relay-data-path "$RUN_DIR/relay-data" \
    2>&1 | tee "$RUN_DIR/sync-raw.log" | ./scripts/sanitize-logs.sh | tee "$RUN_DIR/sync.log"
```

### With Metrics Endpoint

```bash
# Create run directory
RUN_DIR="tmp/run-$(date +%Y%m%d-%H%M%S)"
mkdir -p "$RUN_DIR"

# Run with metrics and both log formats
timeout 30s cargo run -- \
    --sync-bootstrap-relay-url wss://git.shakespeare.diy \
    --domain ngit.danconwaydev.com \
    --git-data-path "$RUN_DIR/git-data" \
    --relay-data-path "$RUN_DIR/relay-data" \
    --metrics-address 127.0.0.1:9090 \
    2>&1 | tee "$RUN_DIR/sync-raw.log" | ./scripts/sanitize-logs.sh | tee "$RUN_DIR/sync.log"
```

Then in another terminal: `curl http://127.0.0.1:9090/metrics`

### Cleanup Old Runs

```bash
# Remove runs older than 7 days
find tmp/run-* -type d -mtime +7 -exec rm -rf {} +

# Remove all test runs
rm -rf tmp/run-*
```

### Different Log Level

The default is DEBUG. For more detail:
```bash
RUST_LOG=trace cargo run -- [args]
```

For less noise:
```bash
RUST_LOG=info cargo run -- [args]
```

---

*Part of the [ngit-grasp documentation](../README.md) using the [Diátaxis](https://diataxis.fr/) framework.*