# How-To: Test Sync Against Production Data

> **Quick Start Prompt:** Check work/active-issues/ for existing issues. If issues exist, pick the most important, fix it, test with cargo test, run clippy and fmt, commit, and report back with a brief 1-2 sentence summary of each issue you identified. If no issues exist, run a 60-second production sync test, analyze logs, create individual issue files in work/active-issues/ (one per issue with minimal description), then report summary listing each issue in 1-2 sentences.

**Problem:** Debug and improve sync behavior using real-world data from production relays  
**Difficulty:** Intermediate  
**Time:** 30 minutes per iteration

## Two-Mode Workflow

This guide operates in two modes:

### Mode 1: Fix Existing Issues
**When:** There are files in `work/active-issues/` (excluding README.md)

1. Check for active issues: `ls work/active-issues/`
2. Pick the most important issue to fix
3. **Review proposed fix and ask for permission before implementing**
4. Implement the fix (after approval)
5. Run `cargo test` to verify tests pass
6. Run `cargo clippy` to check for warnings
7. Run `cargo fmt` to format code
8. Commit changes with descriptive message
9. Report back - **DO NOT** do another issue or run more tests

### Mode 2: Discover New Issues
**When:** No active issues in `work/active-issues/`

1. Run 60-second production sync test (logs saved to `tmp/run-{timestamp}/`)
2. Analyze logs for errors, warnings, unexpected patterns
3. Document each issue as a separate markdown file in `work/active-issues/`
4. Keep issue files minimal - just enough to identify the issue
5. Report brief summary listing each issue in 1-2 sentences
6. **DO NOT** create separate detailed analysis files
7. **DO NOT** do thorough investigation or root cause analysis

## Overview

This guide helps you run ngit-grasp's sync system against production relays to discover unexpected errors, inefficiencies, and edge cases that don't appear in controlled tests.

**Why production testing matters:**
- Real data has inconsistencies, malformed events, and edge cases
- Production relays may behave differently (rate limiting, timeouts, partial NIP-77 support)
- Volume and patterns reveal performance bottlenecks
- Sync discovery leads to cascading subscriptions we can't predict in tests

## Prerequisites

- ngit-grasp compiles successfully (`cargo build`)
- Familiarity with [GRASP-02 Proactive Sync](../explanation/grasp-02-proactive-sync.md)
- Understanding of log levels and tracing

## Test Setup

### 1. Choose a Test Identity

Pick a domain with manageable sync volume. Smaller domains mean fewer repos to sync, making logs tractable.

**Recommended starting point:**
```bash
--domain ngit.danconwaydev.com
```

This domain has few repo announcements listing it, so sync stays manageable.

### 2. Choose a Bootstrap Relay

The bootstrap relay provides the initial set of announcements to discover repos:

```bash
--sync-bootstrap-relay-url wss://git.shakespeare.diy
```

### 3. Run with Time Limit

Run for **60 seconds** to allow the full sync cascade to complete. This duration allows:
- Layer 1: Discovery of repo announcements (0-5s)
- Layer 2: Sending `#a`, `#A`, `#q` filters for repos (5-30s)
- Layer 3: Receiving issues, patches, PRs (30-60s)
- Layer 4: Receiving comments on root events (40-60s)

Each run creates its own subdirectory in `tmp/` to keep data and logs isolated:

```bash
# Create run directory with timestamp
RUN_DIR="tmp/run-$(date +%Y%m%d-%H%M%S)"
mkdir -p "$RUN_DIR"

# Run for 60 seconds, saving both raw and sanitized logs
timeout 60s cargo run -- \
    --sync-bootstrap-relay-url wss://git.shakespeare.diy \
    --domain ngit.danconwaydev.com \
    --git-data-path "$RUN_DIR/git-data" \
    --relay-data-path "$RUN_DIR/relay-data" \
    2>&1 | tee "$RUN_DIR/sync-raw.log" | ./scripts/sanitize-logs.sh | tee "$RUN_DIR/sync.log"
```

**Why 60 seconds?** The sync system uses a 5-second batch window for aggregating discovered repos. Repos discovered late in the sync need time for:
1. Batch window to expire (5s)
2. Layer 2 filters to be sent and processed
3. Layer 3 events (issues/patches/PRs) to be returned
4. Layer 4 filters for root events to be sent
5. Comments and threaded replies to be returned

Testing shows 30 seconds is too short for late-discovered repos to complete the Layer 2→3→4 cascade.

**Note:** The `timeout` command returns exit code 124, which is expected.

**Directory structure after run:**
```
tmp/
└── run-20260109-143022/
    ├── git-data/       # Git repository data
    ├── relay-data/     # Relay database
    ├── sync.log        # Sanitized log (for quick analysis)
    └── sync-raw.log    # Raw log (for full details when needed)
```

**When to use which log:**
- **sync.log** - Use for quick scanning and pattern recognition (long lines truncated)
- **sync-raw.log** - Use when you need full details (e.g., complete rejection reasons, full event data)

## Log Sanitization

Raw logs include full events and hundreds of event IDs per line, making them unwieldy for analysis. The sanitizer truncates long lines:

```bash
./scripts/sanitize-logs.sh < raw.log > sanitized.log

# Or pipe directly
cargo run -- [args] 2>&1 | ./scripts/sanitize-logs.sh
```

**Options:**
- `--head-chars N` - First N characters to show (default: 200)
- `--tail-chars N` - Last N characters to show (default: 100)

Example output:
```
2024-01-09T10:00:00Z DEBUG sync: Processing events ids=[abc123, def456, ghi789, jkl012...<1847 chars>...xyz999, end123]
```

### Retrieving Full Details from Raw Logs

When sanitized logs show truncated messages (e.g., rejection reasons), use the raw log to see the complete content:

```bash
# Find specific error in raw log
grep "Rejected repository announcement" "$RUN_DIR/sync-raw.log"

# Extract full line for specific event ID
grep "note1z5ys7wf3ms5yxhnp3kfw7hpu5asfkx4jngzt5zgs4tm4tnvggnsqjfqeyt" "$RUN_DIR/sync-raw.log"

# View context around a truncated warning
grep -A 2 -B 2 "pattern from sanitized log" "$RUN_DIR/sync-raw.log"
```

The raw log contains complete, untruncated messages including full rejection reasons, event data, and debug details.

## What to Look For

### Phase 1: Connection & Bootstrap (0-5 seconds)

**Expected behavior:**
- Connection to bootstrap relay succeeds
- Layer 1 (announcement) subscription starts
- First batch of 30617/30618 events received

**Red flags:**
- Connection timeout or failure
- NIP-77 negentropy errors (should fall back gracefully)
- Immediate rate limiting

### Phase 2: Discovery Cascade (5-15 seconds)

**Expected behavior:**
- Self-subscriber batches fire as announcements are processed
- New relays discovered from announcement `relays` tags
- Layer 2 (repo tags) subscriptions created

**Red flags:**
- Excessive relay discovery (>10 relays rapidly)
- Filter consolidation warnings (>70 filters)
- Missing self-subscriber batch logs

### Phase 3: Steady State (15+ seconds)

**Expected behavior:**
- Historic sync batches completing (EOSE received)
- Periodic health checks running
- Events being saved to database

**Red flags:**
- Pending batches never confirming
- Repeated connection/disconnect cycles
- Memory growth (check with `top` in another terminal)

## Debugging Checklist

When analyzing logs, look for these patterns:

### Errors to Investigate

| Pattern | Possible Cause | Action |
|---------|----------------|--------|
| `error` (any) | Unexpected failure | Investigate immediately |
| `connection failed` | Network/relay issue | Check relay URL, try different relay |
| `rate limit` | Too many requests | Check consolidation, increase backoff |
| `negentropy` + `error` | NIP-77 incompatibility | Should fall back - verify it does |
| `timeout` | Slow relay or large sync | Increase timeouts or reduce scope |

### Warnings to Monitor

| Pattern | Meaning | Action |
|---------|---------|--------|
| `consolidating filters` | Filter count high | Expected, but frequent = problem |
| `backing off` | Health tracker retry | Normal, but watch for excessive |
| `batch failed` | Historic sync incomplete | Check which batches, why |

### Debug Patterns to Verify

| Pattern | What it shows |
|---------|---------------|
| `fresh_start` | Full sync initiated |
| `quick_reconnect` | Incremental sync (<15min gap) |
| `historic sync complete` | Sync finished successfully |
| `sync_live` | Live subscriptions active |
| `PendingBatch` | Items awaiting EOSE confirmation |

## Mode 1: Fix Existing Issues (Detailed)

When `work/active-issues/` contains issue files:

### Step 1: Check for Active Issues

```bash
ls work/active-issues/
```

If any `.md` files exist (excluding README.md), you're in Mode 1.

### Step 2: Pick Most Important Issue

Review issue files and select based on:
- Severity (errors > warnings > log quality)
- Impact (functionality > performance > UX)
- Complexity (quick fixes first to clear backlog)

### Step 3: Review Proposed Fix and Get Permission

**IMPORTANT:** Before implementing any changes:

1. Read relevant code files to understand the issue
2. Analyze the root cause
3. Propose a fix with explanation of what will change and why
4. Summarize the proposed fix in 2-3 sentences
5. **Ask for user permission to proceed**

**Do NOT implement changes without explicit approval.**

### Step 4: Implement the Fix

After receiving permission, make the necessary code changes based on the issue description and approved plan.

### Step 5: Test, Lint, Format

```bash
# Run tests
cargo test

# Check for warnings
cargo clippy

# Format code
cargo fmt
```

### Step 6: Commit

```bash
git add .
git commit -m "fix: [brief description of what was fixed]"
```

### Step 7: Report Back

**STOP HERE.** Report what was fixed. Do NOT:
- Fix another issue
- Run production sync test
- Do additional investigation

The workflow will cycle back through Mode 1 if more issues remain.

## Mode 2: Discover New Issues (Detailed)

When `work/active-issues/` is empty (or only contains README.md):

### Step 1: Run Production Sync Test

```bash
# Create run directory with timestamp
RUN_DIR="tmp/run-$(date +%Y%m%d-%H%M%S)"
mkdir -p "$RUN_DIR"

# Run 60-second test, saving both raw and sanitized logs
timeout 60s cargo run -- \
    --sync-bootstrap-relay-url wss://git.shakespeare.diy \
    --domain ngit.danconwaydev.com \
    --git-data-path "$RUN_DIR/git-data" \
    --relay-data-path "$RUN_DIR/relay-data" \
    2>&1 | tee "$RUN_DIR/sync-raw.log" | ./scripts/sanitize-logs.sh | tee "$RUN_DIR/sync.log"
```

Each run is isolated in its own timestamped directory under `tmp/`, keeping data and logs organized. Both raw and sanitized logs are saved for flexible analysis.

**Note:** 60 seconds allows the full sync cascade (Layer 1→2→3→4) to complete for late-discovered repos.

### Step 2: Analyze Logs

Scan for errors and unexpected patterns:
```bash
# Find the most recent run
LATEST_RUN=$(ls -1t tmp/run-*/sync.log | head -n1)
LATEST_RAW=$(ls -1t tmp/run-*/sync-raw.log | head -n1)

# Analyze sanitized log for quick scanning
grep -i error "$LATEST_RUN"
grep -i warn "$LATEST_RUN"
grep -i panic "$LATEST_RUN"

# If you find truncated messages, check the raw log for full details
grep "pattern from truncated message" "$LATEST_RAW"
```

### Step 3: Document Issues

Create **one markdown file per issue** in `work/active-issues/`:

```bash
# Example: Minimal issue documentation
cat > work/active-issues/bootstrap-disconnect.md <<'EOF'
# Bootstrap relay disconnects when empty

Bootstrap relay wss://git.shakespeare.diy disconnects after sync finds 0 events. Should persist since user-specified.

Log: "Disconnecting empty relay relay=wss://git.shakespeare.diy"
File: src/sync/mod.rs (check_disconnects function)
EOF
```

**Keep each file brief:**
- Descriptive title (one line)
- What happens (1-2 sentences max)
- Relevant log excerpt (one line)
- File/function location if obvious (one line)
- **NO** separate detailed analysis files
- **NO** root cause analysis
- **NO** proposed solutions (unless immediately obvious)

### Step 4: Report Summary

Provide a brief closing message with 1-2 sentence summary of **each issue** identified:
- State what the issue is
- Where it occurs (file/component)
- Keep it concise

**STOP HERE.** Do NOT:
- Fix the issues immediately
- Create separate detailed analysis markdown files
- Do thorough investigations
- Write lengthy explanations

The workflow will cycle back through Mode 1 to fix issues one at a time.

## Logging Improvements

If the logs aren't helpful enough, improve them. Common needs:

### Add Context to Existing Logs

```rust
// Before
tracing::debug!("Processing events");

// After
tracing::debug!(
    relay = %relay_url,
    event_count = events.len(),
    "Processing events"
);
```

### Add New Log Points

Key places that may need more logging:
- `src/sync/mod.rs` - SyncManager state transitions
- `src/sync/relay_connection.rs` - Connection lifecycle
- `src/sync/self_subscriber.rs` - Batch processing

### Reduce Noise

If a log line appears too frequently:
```rust
// Change from debug! to trace!
tracing::trace!("Per-event detail that's too noisy");
```

## Managing Active Issues

Issues are tracked in `work/active-issues/` as individual markdown files.

**Check for active issues:**
```bash
ls work/active-issues/
```

**After fixing an issue:**
```bash
# Delete the resolved issue file
rm work/active-issues/issue-name.md

# Or archive if important for future reference
mv work/active-issues/issue-name.md docs/archive/2026-01-09-issue-name.md
```

**Issue file format (minimal):**
```markdown
# Brief title

What happens (1-2 sentences).

Log evidence: "relevant log line"
File: src/path/to/file.rs (function_name if known)
```

Keep documentation minimal - just enough to identify and locate the issue.

---

## Workflow Summary

```
Check work/active-issues/
    │
    ├─ Has issues? ──► Mode 1: Pick one issue
    │                          │
    │                          ├─ Review & propose fix
    │                          ├─ Ask permission
    │                          ├─ Fix code (after approval)
    │                          ├─ cargo test
    │                          ├─ cargo clippy
    │                          ├─ cargo fmt
    │                          ├─ git commit
    │                          └─ Report & STOP
    │
    └─ No issues? ──► Mode 2: Run production sync
                               │
                               ├─ timeout 60s cargo run ...
                               ├─ Analyze logs
                               ├─ Document issues (minimal)
                               └─ Report summary & STOP
```

**Key Rules:**
- Only do ONE thing per cycle (fix one issue OR discover issues)
- Always stop after reporting
- Keep issue documentation minimal
- No root cause analysis during discovery

---

## Quick Reference

### Minimal Test Command

```bash
# Create run directory
RUN_DIR="tmp/run-$(date +%Y%m%d-%H%M%S)"
mkdir -p "$RUN_DIR"

# Run test with both raw and sanitized logs (60s for full cascade)
timeout 60s cargo run -- \
    --sync-bootstrap-relay-url wss://git.shakespeare.diy \
    --domain ngit.danconwaydev.com \
    --git-data-path "$RUN_DIR/git-data" \
    --relay-data-path "$RUN_DIR/relay-data" \
    2>&1 | tee "$RUN_DIR/sync-raw.log" | ./scripts/sanitize-logs.sh | tee "$RUN_DIR/sync.log"
```

### With Metrics Endpoint

```bash
# Create run directory
RUN_DIR="tmp/run-$(date +%Y%m%d-%H%M%S)"
mkdir -p "$RUN_DIR"

# Run with metrics and both log formats (60s for full cascade)
timeout 60s cargo run -- \
    --sync-bootstrap-relay-url wss://git.shakespeare.diy \
    --domain ngit.danconwaydev.com \
    --git-data-path "$RUN_DIR/git-data" \
    --relay-data-path "$RUN_DIR/relay-data" \
    --metrics-address 127.0.0.1:9090 \
    2>&1 | tee "$RUN_DIR/sync-raw.log" | ./scripts/sanitize-logs.sh | tee "$RUN_DIR/sync.log"
```

Then in another terminal: `curl http://127.0.0.1:9090/metrics`

### Cleanup Old Runs

```bash
# Remove runs older than 7 days
find tmp/run-* -type d -mtime +7 -exec rm -rf {} +

# Remove all test runs
rm -rf tmp/run-*
```

### Different Log Level

The default is DEBUG. For more detail:
```bash
RUST_LOG=trace cargo run -- [args]
```

For less noise:
```bash
RUST_LOG=info cargo run -- [args]
```

---

*Part of the [ngit-grasp documentation](../README.md) using the [Diátaxis](https://diataxis.fr/) framework.*