From f1579d1c099869de67b1741b7775cbf651b34ef0 Mon Sep 17 00:00:00 2001 From: DanConwayDev Date: Fri, 9 Jan 2026 22:23:18 +0000 Subject: docs: update production sync testing workflow to two-mode process - Mode 1: Fix one existing issue, test, commit, report - Mode 2: Discover new issues with minimal documentation - Emphasize stopping after each cycle - Remove detailed investigation requirements - Simplify issue documentation format --- docs/how-to/production-sync-testing.md | 290 ++++++++++++++++++++++++--------- 1 file changed, 210 insertions(+), 80 deletions(-) (limited to 'docs') diff --git a/docs/how-to/production-sync-testing.md b/docs/how-to/production-sync-testing.md index b0f93b0..d5c11ea 100644 --- a/docs/how-to/production-sync-testing.md +++ b/docs/how-to/production-sync-testing.md @@ -1,11 +1,38 @@ # How-To: Test Sync Against Production Data -> **Quick Start Prompt:** Run a 30-second production sync test following docs/how-to/production-sync-testing.md. Use the minimal test command with sanitized output. Analyze the log for errors, warnings, and unexpected patterns. Document findings as individual markdown files in work/active-issues/ and suggest code fixes or logging improvements. +> **Quick Start Prompt:** Check work/active-issues/ for existing issues. If issues exist, pick the most important, fix it, test with cargo test, run clippy and fmt, commit, and report back with a brief 1-2 sentence summary of each issue you identified. If no issues exist, run a 30-second production sync test, analyze logs, create individual issue files in work/active-issues/ (one per issue with minimal description), then report summary listing each issue in 1-2 sentences. **Problem:** Debug and improve sync behavior using real-world data from production relays **Difficulty:** Intermediate **Time:** 30 minutes per iteration +## Two-Mode Workflow + +This guide operates in two modes: + +### Mode 1: Fix Existing Issues +**When:** There are files in `work/active-issues/` (excluding README.md) + +1. Check for active issues: `ls work/active-issues/` +2. Pick the most important issue to fix +3. Implement the fix +4. Run `cargo test` to verify tests pass +5. Run `cargo clippy` to check for warnings +6. Run `cargo fmt` to format code +7. Commit changes with descriptive message +8. Report back - **DO NOT** do another issue or run more tests + +### Mode 2: Discover New Issues +**When:** No active issues in `work/active-issues/` + +1. Run 30-second production sync test (logs saved to `tmp/run-{timestamp}/`) +2. Analyze logs for errors, warnings, unexpected patterns +3. Document each issue as a separate markdown file in `work/active-issues/` +4. Keep issue files minimal - just enough to identify the issue +5. Report brief summary listing each issue in 1-2 sentences +6. **DO NOT** create separate detailed analysis files +7. **DO NOT** do thorough investigation or root cause analysis + ## Overview This guide helps you run ngit-grasp's sync system against production relays to discover unexpected errors, inefficiencies, and edge cases that don't appear in controlled tests. @@ -45,23 +72,33 @@ The bootstrap relay provides the initial set of announcements to discover repos: ### 3. Run with Time Limit -Start with short runs (30 seconds) to capture manageable log volumes: +Start with short runs (30 seconds) to capture manageable log volumes. Each run creates its own subdirectory in `tmp/` to keep data and logs isolated: ```bash -# Clear any existing data for clean state -rm -rf /tmp/ngit-test-* +# Create run directory with timestamp +RUN_DIR="tmp/run-$(date +%Y%m%d-%H%M%S)" +mkdir -p "$RUN_DIR" # Run for 30 seconds with sanitized output timeout 30s cargo run -- \ --sync-bootstrap-relay-url wss://git.shakespeare.diy \ --domain ngit.danconwaydev.com \ - --git-data-path /tmp/ngit-test-git \ - --relay-data-path /tmp/ngit-test-relay \ - 2>&1 | ./scripts/sanitize-logs.sh | tee sync-test.log + --git-data-path "$RUN_DIR/git-data" \ + --relay-data-path "$RUN_DIR/relay-data" \ + 2>&1 | ./scripts/sanitize-logs.sh | tee "$RUN_DIR/sync.log" ``` **Note:** The `timeout` command returns exit code 124, which is expected. +**Directory structure after run:** +``` +tmp/ +└── run-20260109-143022/ + ├── git-data/ # Git repository data + ├── relay-data/ # Relay database + └── sync.log # Sanitized log output +``` + ## Log Sanitization Raw logs include full events and hundreds of event IDs per line, making them unwieldy for analysis. The sanitizer truncates long lines: @@ -152,73 +189,132 @@ When analyzing logs, look for these patterns: | `sync_live` | Live subscriptions active | | `PendingBatch` | Items awaiting EOSE confirmation | -## Iterative Improvement Process +## Mode 1: Fix Existing Issues (Detailed) + +When `work/active-issues/` contains issue files: + +### Step 1: Check for Active Issues + +```bash +ls work/active-issues/ +``` + +If any `.md` files exist (excluding README.md), you're in Mode 1. + +### Step 2: Pick Most Important Issue + +Review issue files and select based on: +- Severity (errors > warnings > log quality) +- Impact (functionality > performance > UX) +- Complexity (quick fixes first to clear backlog) + +### Step 3: Implement the Fix + +Make the necessary code changes based on the issue description. -### Step 1: Run and Capture +### Step 4: Test, Lint, Format ```bash -timeout 30s cargo run -- [args] 2>&1 | ./scripts/sanitize-logs.sh > iteration-1.log +# Run tests +cargo test + +# Check for warnings +cargo clippy + +# Format code +cargo fmt ``` -### Step 2: Identify Issues +### Step 5: Commit -Scan logs for errors and unexpected patterns: ```bash -grep -i error iteration-1.log -grep -i warn iteration-1.log -grep -i panic iteration-1.log +git add . +git commit -m "fix: [brief description of what was fixed]" ``` -### Step 3: Document Findings +### Step 6: Report Back + +**STOP HERE.** Report what was fixed. Do NOT: +- Fix another issue +- Run production sync test +- Do additional investigation + +The workflow will cycle back through Mode 1 if more issues remain. + +## Mode 2: Discover New Issues (Detailed) + +When `work/active-issues/` is empty (or only contains README.md): -Create individual markdown files in `work/active-issues/` for each issue discovered: +### Step 1: Run Production Sync Test ```bash -# Example: Document a connection timeout issue -cat > work/active-issues/connection-timeout-bootstrap.md <<'EOF' -# Issue: Connection Timeout on Bootstrap Relay +# Create run directory with timestamp +RUN_DIR="tmp/run-$(date +%Y%m%d-%H%M%S)" +mkdir -p "$RUN_DIR" -**Discovered:** 2026-01-09 -**Status:** Open +# Run 30-second test +timeout 30s cargo run -- \ + --sync-bootstrap-relay-url wss://git.shakespeare.diy \ + --domain ngit.danconwaydev.com \ + --git-data-path "$RUN_DIR/git-data" \ + --relay-data-path "$RUN_DIR/relay-data" \ + 2>&1 | ./scripts/sanitize-logs.sh | tee "$RUN_DIR/sync.log" +``` -## Symptoms +Each run is isolated in its own timestamped directory under `tmp/`, keeping data and logs organized. -- Connection to wss://git.shakespeare.diy fails after 10s timeout -- Log shows: `error: connection failed: timeout` -- Occurs 100% of time with this relay +### Step 2: Analyze Logs -## Root Cause +Scan for errors and unexpected patterns: +```bash +# Find the most recent run +LATEST_RUN=$(ls -1t tmp/run-*/sync.log | head -n1) -[To be determined] +# Analyze for issues +grep -i error "$LATEST_RUN" +grep -i warn "$LATEST_RUN" +grep -i panic "$LATEST_RUN" +``` -## Proposed Fix +### Step 3: Document Issues -- Increase connection timeout from 10s to 30s for initial bootstrap -- Add retry logic with exponential backoff -- Consider fallback bootstrap relays +Create **one markdown file per issue** in `work/active-issues/`: -## Code Location +```bash +# Example: Minimal issue documentation +cat > work/active-issues/bootstrap-disconnect.md <<'EOF' +# Bootstrap relay disconnects when empty -- `src/sync/relay_connection.rs:45` - connection timeout constant +Bootstrap relay wss://git.shakespeare.diy disconnects after sync finds 0 events. Should persist since user-specified. + +Log: "Disconnecting empty relay relay=wss://git.shakespeare.diy" +File: src/sync/mod.rs (check_disconnects function) EOF ``` -**Why individual files?** -- Keeps the how-to guide clean and focused -- Prevents accidental commits of transient issues to tracked files -- Easy to delete resolved issues or archive important ones -- Each file can be worked on independently +**Keep each file brief:** +- Descriptive title (one line) +- What happens (1-2 sentences max) +- Relevant log excerpt (one line) +- File/function location if obvious (one line) +- **NO** separate detailed analysis files +- **NO** root cause analysis +- **NO** proposed solutions (unless immediately obvious) -### Step 4: Fix and Re-test +### Step 4: Report Summary -After code changes, run again to verify the fix. +Provide a brief closing message with 1-2 sentence summary of **each issue** identified: +- State what the issue is +- Where it occurs (file/component) +- Keep it concise -### Step 5: Extend Duration +**STOP HERE.** Do NOT: +- Fix the issues immediately +- Create separate detailed analysis markdown files +- Do thorough investigations +- Write lengthy explanations -Once 30-second runs are clean, extend to 2 minutes, then 5 minutes: -```bash -timeout 120s cargo run -- [args] 2>&1 | ./scripts/sanitize-logs.sh > iteration-2.log -``` +The workflow will cycle back through Mode 1 to fix issues one at a time. ## Logging Improvements @@ -253,52 +349,66 @@ If a log line appears too frequently: tracing::trace!("Per-event detail that's too noisy"); ``` -## Active Issues +## Managing Active Issues -Issues discovered during production sync testing are tracked in `work/active-issues/` as individual markdown files. +Issues are tracked in `work/active-issues/` as individual markdown files. -**View current issues:** +**Check for active issues:** ```bash ls work/active-issues/ ``` -**Create a new issue:** +**After fixing an issue:** ```bash -# Use kebab-case filename describing the issue -cat > work/active-issues/[issue-name].md <<'EOF' -# Issue: [Short Description] +# Delete the resolved issue file +rm work/active-issues/issue-name.md -**Discovered:** [Date] -**Status:** Open - -## Symptoms +# Or archive if important for future reference +mv work/active-issues/issue-name.md docs/archive/2026-01-09-issue-name.md +``` -- Log patterns observed -- Reproduction steps if known +**Issue file format (minimal):** +```markdown +# Brief title -## Root Cause +What happens (1-2 sentences). -[To be determined / Known cause] +Log evidence: "relevant log line" +File: src/path/to/file.rs (function_name if known) +``` -## Proposed Fix +Keep documentation minimal - just enough to identify and locate the issue. -- Suggested code changes -- Alternative approaches +--- -## Code Location +## Workflow Summary -- File paths and line numbers where changes are needed -EOF ``` - -**Resolve an issue:** -```bash -# After fixing, either delete or move to archive -rm work/active-issues/resolved-issue.md -# OR -mv work/active-issues/important-issue.md docs/archive/2026-01-09-important-issue.md +Check work/active-issues/ + │ + ├─ Has issues? ──► Mode 1: Pick one issue + │ │ + │ ├─ Fix code + │ ├─ cargo test + │ ├─ cargo clippy + │ ├─ cargo fmt + │ ├─ git commit + │ └─ Report & STOP + │ + └─ No issues? ──► Mode 2: Run production sync + │ + ├─ timeout 30s cargo run ... + ├─ Analyze logs + ├─ Document issues (minimal) + └─ Report summary & STOP ``` +**Key Rules:** +- Only do ONE thing per cycle (fix one issue OR discover issues) +- Always stop after reporting +- Keep issue documentation minimal +- No root cause analysis during discovery + --- ## Quick Reference @@ -306,28 +416,48 @@ mv work/active-issues/important-issue.md docs/archive/2026-01-09-important-issue ### Minimal Test Command ```bash +# Create run directory +RUN_DIR="tmp/run-$(date +%Y%m%d-%H%M%S)" +mkdir -p "$RUN_DIR" + +# Run test timeout 30s cargo run -- \ --sync-bootstrap-relay-url wss://git.shakespeare.diy \ --domain ngit.danconwaydev.com \ - --git-data-path /tmp/ngit-test-git \ - --relay-data-path /tmp/ngit-test-relay \ - 2>&1 | ./scripts/sanitize-logs.sh + --git-data-path "$RUN_DIR/git-data" \ + --relay-data-path "$RUN_DIR/relay-data" \ + 2>&1 | ./scripts/sanitize-logs.sh | tee "$RUN_DIR/sync.log" ``` ### With Metrics Endpoint ```bash +# Create run directory +RUN_DIR="tmp/run-$(date +%Y%m%d-%H%M%S)" +mkdir -p "$RUN_DIR" + +# Run with metrics timeout 30s cargo run -- \ --sync-bootstrap-relay-url wss://git.shakespeare.diy \ --domain ngit.danconwaydev.com \ - --git-data-path /tmp/ngit-test-git \ - --relay-data-path /tmp/ngit-test-relay \ + --git-data-path "$RUN_DIR/git-data" \ + --relay-data-path "$RUN_DIR/relay-data" \ --metrics-address 127.0.0.1:9090 \ - 2>&1 | ./scripts/sanitize-logs.sh + 2>&1 | ./scripts/sanitize-logs.sh | tee "$RUN_DIR/sync.log" ``` Then in another terminal: `curl http://127.0.0.1:9090/metrics` +### Cleanup Old Runs + +```bash +# Remove runs older than 7 days +find tmp/run-* -type d -mtime +7 -exec rm -rf {} + + +# Remove all test runs +rm -rf tmp/run-* +``` + ### Different Log Level The default is DEBUG. For more detail: -- cgit v1.2.3