diff options
| author | DanConwayDev <DanConwayDev@protonmail.com> | 2026-01-09 21:41:45 +0000 |
|---|---|---|
| committer | DanConwayDev <DanConwayDev@protonmail.com> | 2026-01-09 21:41:45 +0000 |
| commit | 6a8346441ea605248e09858d0e6b9096a1c87259 (patch) | |
| tree | a24768a427f1d4b7e9adaff2b37f1385440f1c29 /docs/how-to/production-sync-testing.md | |
| parent | 5fed2e2f32cfb15fff042a39f3ac82abe8948ca0 (diff) | |
docs: add production sync testing guide and log sanitizer script
Add infrastructure for iterative debugging of sync against production data:
- scripts/sanitize-logs.sh: Truncates verbose log lines for LLM analysis
- docs/how-to/production-sync-testing.md: Step-by-step guide for testing
sync against real relays, identifying issues, and improving logging
Diffstat (limited to 'docs/how-to/production-sync-testing.md')
| -rw-r--r-- | docs/how-to/production-sync-testing.md | 283 |
1 files changed, 283 insertions, 0 deletions
diff --git a/docs/how-to/production-sync-testing.md b/docs/how-to/production-sync-testing.md new file mode 100644 index 0000000..dfff20d --- /dev/null +++ b/docs/how-to/production-sync-testing.md | |||
| @@ -0,0 +1,283 @@ | |||
| 1 | # How-To: Test Sync Against Production Data | ||
| 2 | |||
| 3 | **Problem:** Debug and improve sync behavior using real-world data from production relays | ||
| 4 | **Difficulty:** Intermediate | ||
| 5 | **Time:** 30 minutes per iteration | ||
| 6 | |||
| 7 | ## Overview | ||
| 8 | |||
| 9 | This guide helps you run ngit-grasp's sync system against production relays to discover unexpected errors, inefficiencies, and edge cases that don't appear in controlled tests. | ||
| 10 | |||
| 11 | **Why production testing matters:** | ||
| 12 | - Real data has inconsistencies, malformed events, and edge cases | ||
| 13 | - Production relays may behave differently (rate limiting, timeouts, partial NIP-77 support) | ||
| 14 | - Volume and patterns reveal performance bottlenecks | ||
| 15 | - Sync discovery leads to cascading subscriptions we can't predict in tests | ||
| 16 | |||
| 17 | ## Prerequisites | ||
| 18 | |||
| 19 | - ngit-grasp compiles successfully (`cargo build`) | ||
| 20 | - Familiarity with [GRASP-02 Proactive Sync](../explanation/grasp-02-proactive-sync.md) | ||
| 21 | - Understanding of log levels and tracing | ||
| 22 | |||
| 23 | ## Test Setup | ||
| 24 | |||
| 25 | ### 1. Choose a Test Identity | ||
| 26 | |||
| 27 | Pick a domain with manageable sync volume. Smaller domains mean fewer repos to sync, making logs tractable. | ||
| 28 | |||
| 29 | **Recommended starting point:** | ||
| 30 | ```bash | ||
| 31 | --domain ngit.danconwaydev.com | ||
| 32 | ``` | ||
| 33 | |||
| 34 | This domain has few repo announcements listing it, so sync stays manageable. | ||
| 35 | |||
| 36 | ### 2. Choose a Bootstrap Relay | ||
| 37 | |||
| 38 | The bootstrap relay provides the initial set of announcements to discover repos: | ||
| 39 | |||
| 40 | ```bash | ||
| 41 | --sync-bootstrap-relay wss://git.shakespeare.diy | ||
| 42 | ``` | ||
| 43 | |||
| 44 | ### 3. Run with Time Limit | ||
| 45 | |||
| 46 | Start with short runs (30 seconds) to capture manageable log volumes: | ||
| 47 | |||
| 48 | ```bash | ||
| 49 | # Clear any existing data for clean state | ||
| 50 | rm -rf /tmp/ngit-test-* | ||
| 51 | |||
| 52 | # Run for 30 seconds with sanitized output | ||
| 53 | timeout 30s cargo run -- \ | ||
| 54 | --sync-bootstrap-relay wss://git.shakespeare.diy \ | ||
| 55 | --domain ngit.danconwaydev.com \ | ||
| 56 | --git-path /tmp/ngit-test-git \ | ||
| 57 | --relay-data-path /tmp/ngit-test-relay \ | ||
| 58 | 2>&1 | ./scripts/sanitize-logs.sh | tee sync-test.log | ||
| 59 | ``` | ||
| 60 | |||
| 61 | **Note:** The `timeout` command returns exit code 124, which is expected. | ||
| 62 | |||
| 63 | ## Log Sanitization | ||
| 64 | |||
| 65 | Raw logs include full events and hundreds of event IDs per line, making them unwieldy for analysis. The sanitizer truncates long lines: | ||
| 66 | |||
| 67 | ```bash | ||
| 68 | ./scripts/sanitize-logs.sh < raw.log > sanitized.log | ||
| 69 | |||
| 70 | # Or pipe directly | ||
| 71 | cargo run -- [args] 2>&1 | ./scripts/sanitize-logs.sh | ||
| 72 | ``` | ||
| 73 | |||
| 74 | **Options:** | ||
| 75 | - `--head-chars N` - First N characters to show (default: 100) | ||
| 76 | - `--tail-chars N` - Last N characters to show (default: 20) | ||
| 77 | |||
| 78 | Example output: | ||
| 79 | ``` | ||
| 80 | 2024-01-09T10:00:00Z DEBUG sync: Processing events ids=[abc123, def456, ghi789, jkl012...<1847 chars>...xyz999, end123] | ||
| 81 | ``` | ||
| 82 | |||
| 83 | ## What to Look For | ||
| 84 | |||
| 85 | ### Phase 1: Connection & Bootstrap (0-5 seconds) | ||
| 86 | |||
| 87 | **Expected behavior:** | ||
| 88 | - Connection to bootstrap relay succeeds | ||
| 89 | - Layer 1 (announcement) subscription starts | ||
| 90 | - First batch of 30617/30618 events received | ||
| 91 | |||
| 92 | **Red flags:** | ||
| 93 | - Connection timeout or failure | ||
| 94 | - NIP-77 negentropy errors (should fall back gracefully) | ||
| 95 | - Immediate rate limiting | ||
| 96 | |||
| 97 | ### Phase 2: Discovery Cascade (5-15 seconds) | ||
| 98 | |||
| 99 | **Expected behavior:** | ||
| 100 | - Self-subscriber batches fire as announcements are processed | ||
| 101 | - New relays discovered from announcement `relays` tags | ||
| 102 | - Layer 2 (repo tags) subscriptions created | ||
| 103 | |||
| 104 | **Red flags:** | ||
| 105 | - Excessive relay discovery (>10 relays rapidly) | ||
| 106 | - Filter consolidation warnings (>70 filters) | ||
| 107 | - Missing self-subscriber batch logs | ||
| 108 | |||
| 109 | ### Phase 3: Steady State (15+ seconds) | ||
| 110 | |||
| 111 | **Expected behavior:** | ||
| 112 | - Historic sync batches completing (EOSE received) | ||
| 113 | - Periodic health checks running | ||
| 114 | - Events being saved to database | ||
| 115 | |||
| 116 | **Red flags:** | ||
| 117 | - Pending batches never confirming | ||
| 118 | - Repeated connection/disconnect cycles | ||
| 119 | - Memory growth (check with `top` in another terminal) | ||
| 120 | |||
| 121 | ## Debugging Checklist | ||
| 122 | |||
| 123 | When analyzing logs, look for these patterns: | ||
| 124 | |||
| 125 | ### Errors to Investigate | ||
| 126 | |||
| 127 | | Pattern | Possible Cause | Action | | ||
| 128 | |---------|----------------|--------| | ||
| 129 | | `error` (any) | Unexpected failure | Investigate immediately | | ||
| 130 | | `connection failed` | Network/relay issue | Check relay URL, try different relay | | ||
| 131 | | `rate limit` | Too many requests | Check consolidation, increase backoff | | ||
| 132 | | `negentropy` + `error` | NIP-77 incompatibility | Should fall back - verify it does | | ||
| 133 | | `timeout` | Slow relay or large sync | Increase timeouts or reduce scope | | ||
| 134 | |||
| 135 | ### Warnings to Monitor | ||
| 136 | |||
| 137 | | Pattern | Meaning | Action | | ||
| 138 | |---------|---------|--------| | ||
| 139 | | `consolidating filters` | Filter count high | Expected, but frequent = problem | | ||
| 140 | | `backing off` | Health tracker retry | Normal, but watch for excessive | | ||
| 141 | | `batch failed` | Historic sync incomplete | Check which batches, why | | ||
| 142 | |||
| 143 | ### Debug Patterns to Verify | ||
| 144 | |||
| 145 | | Pattern | What it shows | | ||
| 146 | |---------|---------------| | ||
| 147 | | `fresh_start` | Full sync initiated | | ||
| 148 | | `quick_reconnect` | Incremental sync (<15min gap) | | ||
| 149 | | `historic sync complete` | Sync finished successfully | | ||
| 150 | | `sync_live` | Live subscriptions active | | ||
| 151 | | `PendingBatch` | Items awaiting EOSE confirmation | | ||
| 152 | |||
| 153 | ## Iterative Improvement Process | ||
| 154 | |||
| 155 | ### Step 1: Run and Capture | ||
| 156 | |||
| 157 | ```bash | ||
| 158 | timeout 30s cargo run -- [args] 2>&1 | ./scripts/sanitize-logs.sh > iteration-1.log | ||
| 159 | ``` | ||
| 160 | |||
| 161 | ### Step 2: Identify Issues | ||
| 162 | |||
| 163 | Scan logs for errors and unexpected patterns: | ||
| 164 | ```bash | ||
| 165 | grep -i error iteration-1.log | ||
| 166 | grep -i warn iteration-1.log | ||
| 167 | grep -i panic iteration-1.log | ||
| 168 | ``` | ||
| 169 | |||
| 170 | ### Step 3: Document Findings | ||
| 171 | |||
| 172 | Add findings to this file's [Known Issues](#known-issues) section or create GitHub issues. | ||
| 173 | |||
| 174 | ### Step 4: Fix and Re-test | ||
| 175 | |||
| 176 | After code changes, run again to verify the fix. | ||
| 177 | |||
| 178 | ### Step 5: Extend Duration | ||
| 179 | |||
| 180 | Once 30-second runs are clean, extend to 2 minutes, then 5 minutes: | ||
| 181 | ```bash | ||
| 182 | timeout 120s cargo run -- [args] 2>&1 | ./scripts/sanitize-logs.sh > iteration-2.log | ||
| 183 | ``` | ||
| 184 | |||
| 185 | ## Logging Improvements | ||
| 186 | |||
| 187 | If the logs aren't helpful enough, improve them. Common needs: | ||
| 188 | |||
| 189 | ### Add Context to Existing Logs | ||
| 190 | |||
| 191 | ```rust | ||
| 192 | // Before | ||
| 193 | tracing::debug!("Processing events"); | ||
| 194 | |||
| 195 | // After | ||
| 196 | tracing::debug!( | ||
| 197 | relay = %relay_url, | ||
| 198 | event_count = events.len(), | ||
| 199 | "Processing events" | ||
| 200 | ); | ||
| 201 | ``` | ||
| 202 | |||
| 203 | ### Add New Log Points | ||
| 204 | |||
| 205 | Key places that may need more logging: | ||
| 206 | - `src/sync/mod.rs` - SyncManager state transitions | ||
| 207 | - `src/sync/relay_connection.rs` - Connection lifecycle | ||
| 208 | - `src/sync/self_subscriber.rs` - Batch processing | ||
| 209 | |||
| 210 | ### Reduce Noise | ||
| 211 | |||
| 212 | If a log line appears too frequently: | ||
| 213 | ```rust | ||
| 214 | // Change from debug! to trace! | ||
| 215 | tracing::trace!("Per-event detail that's too noisy"); | ||
| 216 | ``` | ||
| 217 | |||
| 218 | ## Known Issues | ||
| 219 | |||
| 220 | *Document issues discovered during testing here. Delete this section when empty.* | ||
| 221 | |||
| 222 | ### Template for New Issues | ||
| 223 | |||
| 224 | ```markdown | ||
| 225 | ### Issue: [Short description] | ||
| 226 | |||
| 227 | **Discovered:** [Date] | ||
| 228 | **Status:** [Open/Fixed in PR#xxx] | ||
| 229 | |||
| 230 | **Symptoms:** | ||
| 231 | - Log pattern observed | ||
| 232 | |||
| 233 | **Root cause:** | ||
| 234 | - [If known] | ||
| 235 | |||
| 236 | **Fix:** | ||
| 237 | - [If known] | ||
| 238 | ``` | ||
| 239 | |||
| 240 | --- | ||
| 241 | |||
| 242 | ## Quick Reference | ||
| 243 | |||
| 244 | ### Minimal Test Command | ||
| 245 | |||
| 246 | ```bash | ||
| 247 | timeout 30s cargo run -- \ | ||
| 248 | --sync-bootstrap-relay wss://git.shakespeare.diy \ | ||
| 249 | --domain ngit.danconwaydev.com \ | ||
| 250 | --git-path /tmp/ngit-test-git \ | ||
| 251 | --relay-data-path /tmp/ngit-test-relay \ | ||
| 252 | 2>&1 | ./scripts/sanitize-logs.sh | ||
| 253 | ``` | ||
| 254 | |||
| 255 | ### With Metrics Endpoint | ||
| 256 | |||
| 257 | ```bash | ||
| 258 | timeout 30s cargo run -- \ | ||
| 259 | --sync-bootstrap-relay wss://git.shakespeare.diy \ | ||
| 260 | --domain ngit.danconwaydev.com \ | ||
| 261 | --git-path /tmp/ngit-test-git \ | ||
| 262 | --relay-data-path /tmp/ngit-test-relay \ | ||
| 263 | --metrics-address 127.0.0.1:9090 \ | ||
| 264 | 2>&1 | ./scripts/sanitize-logs.sh | ||
| 265 | ``` | ||
| 266 | |||
| 267 | Then in another terminal: `curl http://127.0.0.1:9090/metrics` | ||
| 268 | |||
| 269 | ### Different Log Level | ||
| 270 | |||
| 271 | The default is DEBUG. For more detail: | ||
| 272 | ```bash | ||
| 273 | RUST_LOG=trace cargo run -- [args] | ||
| 274 | ``` | ||
| 275 | |||
| 276 | For less noise: | ||
| 277 | ```bash | ||
| 278 | RUST_LOG=info cargo run -- [args] | ||
| 279 | ``` | ||
| 280 | |||
| 281 | --- | ||
| 282 | |||
| 283 | *Part of the [ngit-grasp documentation](../README.md) using the [Diátaxis](https://diataxis.fr/) framework.* | ||