From 543d9e66dd44b70ed467c61635e6c8056fef8555 Mon Sep 17 00:00:00 2001 From: DanConwayDev Date: Thu, 8 Jan 2026 00:26:51 +0000 Subject: docs: update docs with sync and purgatory and git data sync --- .../grasp-02-proactive-sync-purgatory-git-data.md | 675 +++++++++++++++++++++ 1 file changed, 675 insertions(+) create mode 100644 docs/explanation/grasp-02-proactive-sync-purgatory-git-data.md (limited to 'docs/explanation/grasp-02-proactive-sync-purgatory-git-data.md') diff --git a/docs/explanation/grasp-02-proactive-sync-purgatory-git-data.md b/docs/explanation/grasp-02-proactive-sync-purgatory-git-data.md new file mode 100644 index 0000000..31c3e46 --- /dev/null +++ b/docs/explanation/grasp-02-proactive-sync-purgatory-git-data.md @@ -0,0 +1,675 @@ +# GRASP-02 Proactive Sync: Purgatory Git Data Fetching + +**Status**: ✅ Implemented +**Implementation**: [`src/purgatory/sync/`](../../src/purgatory/sync/) +**Related**: + +- [Purgatory Design](purgatory-design.md) - Core purgatory concepts +- [GRASP-02 Proactive Sync](grasp-02-proactive-sync.md) - Full GRASP-02 implementation +- [Unified Git Data Sync](unify-git-data-sync.md) - Shared processing logic + +--- + +## Overview + +When Nostr events arrive before their git data, they enter **purgatory** waiting to be served. But they don't wait passively—ngit-grasp **actively hunts** for the missing git data across all git servers assoicated with the repo until it finds what it needs. + +### How It Works + +**If the data exists, we'll find it.** + +The system scours git servers listed in repository announcements and PR events, checking every **2 minutes** for **30 minutes**. If we find the data, events are released immediately. If not, they expire from purgatory after 30 minutes. + +**Smart timing based on how events arrive:** + +- **User-submitted events**: Wait **3 minutes** before hunting—we expect a `git push` to follow shortly +- **Sync-received events**: Start hunting after just **500ms**—batch burst arrivals, then get to work + +**Playing nicely with other servers:** + +We respect remote server capacity with: + +- **Throttling**: Max 5 concurrent requests per domain, 30 requests/minute +- **Backoff**: Start at 20 seconds, double each attempt, cap at 2 minutes +- **Round-robin**: Fair distribution across repositories waiting for the same domain +- **Fresh start**: New events reset retry count—recent updates often mean fresh data + +**The result**: If git data is available anywhere in the clone URL list, we'll find it within minutes. If it's not available within 30 minutes, the events expire cleanly. + +### Key Features + +✅ **Proactive hunting** - Scours git servers every 2 min (backoff), finds data automatically +✅ **Respectful throttling** - 5 concurrent + 30/min per domain, plays nice with other implementations +✅ **Smart timing** - 3min delay for user pushes, 500ms for synced events +✅ **30min expiry** - Auto-cleanup of events when data never arrives +✅ **Fully testable** - Mock-based architecture for reliable unit tests + +--- + +## The Problem: Out-of-Order Arrival + +In a distributed system, git data and Nostr events can arrive in any order: + +``` +Timeline A: Event arrives first (user push expected) + t=0s: State event received → enters purgatory + t=180s: (3min wait - expecting git push) + t=30s: Git push arrives → event released ✅ + +Timeline B: Git arrives first + t=0s: Git push received → data available + t=30s: State event received → immediately served ✅ + +Timeline C: Sync scenario (hunt for data) + t=0s: State event received from relay X → enters purgatory + t=0.5s: (500ms delay to batch bursts) + t=0.5s: Start hunting git servers → check server1, server2, server3... + t=45s: Git data found on server2 → event released ✅ + +Timeline D: Data never arrives + t=0s: State event received → enters purgatory + t=0.5s: Start hunting → server1 (not found), server2 (timeout), server3 (not found) + t=20s: Retry → server1 (not found), server2 (not found), server3 (not found) + t=60s: Retry → all servers checked, no data + ... + t=1800s: 30 minutes expired → event discarded, purgatory cleaned up 🗑️ +``` + +**Without proactive sync**: Events in Timeline C would wait indefinitely (or until manual git push). +**With proactive sync**: System automatically hunts for data across all known servers, releasing events as soon as the data is found. + +--- + +## Architecture: Two-Path Sync Design + +The system uses **two independent execution paths** that work together: + +### Path 1: Main Sync Loop (Non-Throttled URLs) + +Runs every **1 second**, processes identifiers ready for sync: + +1. Find ready identifiers (where `!in_progress && next_attempt <= now`) +2. Spawn parallel tasks for each identifier +3. Each task tries non-throttled URLs until: + - ✅ All OIDs fetched (complete) → remove from queue + - ⏸️ Only throttled URLs remain → enqueue with throttled domains, apply backoff + - ❌ No URLs left (all tried/throttled) → apply backoff, retry later + +**Key insight**: Main loop doesn't wait for throttled domains. It quickly tries available servers, then hands off to domain queues for rate-limited processing. + +### Path 2: Domain Throttle Queues (Throttled URLs) + +**Trigger-based** (no polling), processes when capacity frees: + +1. Identifier enqueued with throttled domain (from main loop) +2. When domain has capacity (slot frees or rate limit window passes): + - Pick next identifier (round-robin for fairness) + - Try one URL from that domain + - Mark URL as tried, release slot +3. Trigger repeats until queue empty or capacity exhausted + +**Key insight**: Each domain independently manages its queue, ensuring we respect rate limits while maximizing throughput. + +--- + +## Data Flow: From Event to Release + +```mermaid +graph TB + A[Event Arrives] --> B{Git Data
Available?} + B -->|Yes| C[Serve Immediately] + B -->|No| D[Enter Purgatory] + + D --> E[Enqueue for Sync] + E --> F{Event Source?} + F -->|User Submit| G[3min Delay
expect push] + F -->|Relay Sync| H[500ms Delay
batch burst] + + G --> I[Main Sync Loop
1s interval] + H --> I + + I --> J{Ready?} + J -->|Not Yet| I + J -->|Yes| K[Spawn Sync Task] + + K --> L[Try Non-Throttled URLs] + L --> M{Got All OIDs?} + M -->|Yes| N[Process & Release] + M -->|Partial| O[Enqueue Throttled Domains] + M -->|None| P[Apply Backoff] + + O --> Q[Domain Queue] + Q --> R{Has Capacity?} + R -->|No| Q + R -->|Yes| S[Try Domain URL] + S --> T{Got OIDs?} + T -->|Yes| N + T -->|No| U[Try Next in Queue] + + P --> I + N --> V[Event Served] + + style D fill:#fff3cd + style N fill:#d4edda + style V fill:#d1ecf1 +``` + +--- + +## Retry Strategy: Exponential Backoff with Fresh Start + +### Backoff Schedule + +When sync attempts don't complete (OIDs still needed), backoff increases: + +| Attempt | Delay | Formula | +| ------- | ------------- | ---------------------- | +| 1 | 20s | `20s * 2^0` | +| 2 | 40s | `20s * 2^1` | +| 3 | 80s | `20s * 2^2` | +| 4+ | 120s (capped) | `min(20s * 2^n, 120s)` | + +**Implementation**: [`src/purgatory/sync/queue.rs:SyncQueueEntry::backoff()`](../../src/purgatory/sync/queue.rs) + +### Fresh Start on New Events + +**Critical feature**: When a new event arrives for an identifier already in the sync queue, the `attempt_count` resets to 0. + +**Why?** New events often mean: + +- A maintainer just updated the repository +- Fresh git data might be available at new clone URLs +- Previous failures might have been temporary + +**Example**: + +``` +t=0s: State A arrives → queue with 3min delay, attempt_count=0 +t=180s: First sync attempt fails → backoff 20s, attempt_count=1 +t=200s: Second attempt fails → backoff 40s, attempt_count=2 +t=210s: State B arrives (same identifier) → attempt_count=0 ✨ +t=210s: Immediate retry (new event delay) → success! +``` + +--- + +## Debounced Delays: Smart Timing + +### User-Submitted Events: 3 Minutes + +When a user submits an event via `EVENT` message, we expect a `git push` to follow shortly: + +``` +t=0s: User submits state event → purgatory + 3min delay +t=30s: User runs `git push` → data arrives → event released ✅ +``` + +**Why 3 minutes?** Gives users time to: + +- Finish composing their commit message +- Run `git push` command +- Handle network delays + +**Configuration**: Hardcoded in [`src/purgatory/mod.rs:DEFAULT_SYNC_DELAY`](../../src/purgatory/mod.rs) + +### Sync-Triggered Events: 500ms + +When events arrive during relay sync (e.g., negentropy catchup), they often come in bursts: + +``` +t=0s: State A arrives → purgatory + 500ms delay +t=0.1s: State B arrives → purgatory + 500ms delay (same repo) +t=0.2s: State C arrives → purgatory + 500ms delay (same repo) +t=0.5s: Single sync attempt fetches data for all three ✅ +``` + +**Why 500ms?** Batches burst arrivals without excessive delay. + +**Configuration**: Hardcoded in [`src/purgatory/mod.rs:IMMEDIATE_SYNC_DELAY`](../../src/purgatory/mod.rs) + +### Debouncing Mechanism + +Multiple events for the same identifier **don't create multiple sync tasks**. The `enqueue_sync` method: + +1. If identifier not in queue → create new entry with delay +2. If identifier already queued → reset `attempt_count`, update `next_attempt` if sooner + +**Result**: Rapid event arrivals → single sync attempt after debounce window. + +**Implementation**: [`src/purgatory/mod.rs:Purgatory::enqueue_sync()`](../../src/purgatory/mod.rs) + +--- + +## Domain Throttling: Respectful Rate Limiting + +### Why Throttle? + +Git servers have finite resources. Without throttling: + +- ❌ We could overwhelm small servers with concurrent requests +- ❌ Servers might rate-limit or ban us +- ❌ Other clients sharing the server suffer degraded performance + +With throttling: + +- ✅ Respect server capacity (5 concurrent max per domain) +- ✅ Stay under rate limits (30 requests/min per domain) +- ✅ Fair access for all clients + +### Two-Level Limits + +Each domain has **two independent limits**: + +#### 1. Concurrent Request Limit (Default: 5) + +Maximum in-flight requests to a domain at any moment. + +**Example**: + +``` +Domain: github.com +In-flight: [fetch-1, fetch-2, fetch-3, fetch-4, fetch-5] +Status: AT CAPACITY (throttled) + +fetch-3 completes → in-flight: 4 +Status: HAS CAPACITY (process next queued identifier) +``` + +#### 2. Rate Limit (Default: 30/min) + +Maximum requests in any 60-second sliding window. + +**Example**: + +``` +t=0s: Request 1 → request_times: [0s] +t=1s: Request 2 → request_times: [0s, 1s] +... +t=30s: Request 30 → request_times: [0s, 1s, ..., 30s] +t=31s: Request 31? → THROTTLED (30 requests in last 60s) +t=61s: Request at t=0s aged out → request_times: [1s, ..., 30s] +t=61s: Request 31 → ALLOWED (only 29 in last 60s) +``` + +**Implementation**: [`src/purgatory/sync/throttle.rs:DomainThrottle::has_capacity()`](../../src/purgatory/sync/throttle.rs) + +### Round-Robin Fairness + +When multiple identifiers are queued for a throttled domain, we use **round-robin** to ensure fairness: + +``` +Queue: [repo-A, repo-B, repo-C] +Round-robin index: 0 + +Attempt 1: Try repo-A (index=0) → fetch → index=1 +Attempt 2: Try repo-B (index=1) → fetch → index=2 +Attempt 3: Try repo-C (index=2) → fetch → index=0 +Attempt 4: Try repo-A (index=0) → ... +``` + +**Why round-robin?** Prevents head-of-line blocking. Without it, repo-A might consume all slots while repo-B and repo-C wait indefinitely. + +**Implementation**: [`src/purgatory/sync/throttle.rs:DomainThrottle::next_ready_identifier()`](../../src/purgatory/sync/throttle.rs) + +### Trigger-Based Processing (Not Polling) + +Domain queues **don't poll** for capacity. Instead, processing is triggered by two events: + +1. **`complete_request()`** - A request finishes, slot frees +2. **`enqueue_identifier()`** - New identifier added to queue + +Both methods check `has_capacity()` and trigger `try_process_next()` if true. + +**Why trigger-based?** + +- ✅ Lower CPU usage (no busy-waiting) +- ✅ Instant response when capacity frees +- ✅ Simpler reasoning (event-driven) + +**Implementation**: [`src/purgatory/sync/throttle.rs:ThrottleManager`](../../src/purgatory/sync/throttle.rs) + +--- + +## 30-Minute Purgatory Expiry + +Purgatory entries **automatically expire** after 30 minutes to prevent unbounded memory growth. + +### Why 30 Minutes? + +From the [GRASP-01 spec](https://github.com/DanConwayDev/grasp/blob/main/01.md#purgatory): + +> Events should be kept in purgatory and otherwise discarded after 30 minutes. + +This balances: + +- ⏰ **Long enough** for typical sync scenarios (git data usually arrives within minutes) +- 🧹 **Short enough** to prevent memory leaks from abandoned events +- 🔄 **Recoverable** events are still on other relays and can be re-submitted + +### Implementation + +Each purgatory entry tracks: + +- `created_at: Instant` - When added to purgatory +- `expires_at: Instant` - When to discard (created_at + 30min) + +The main sync loop checks expiry before processing: + +```rust +if !self.has_pending_events(&identifier) { + // No events remain (expired or released) → remove from sync queue + self.sync_queue.remove(&identifier); +} +``` + +**Note**: Expiry is checked implicitly via `has_pending_events()`. If all events for an identifier have expired, the identifier is removed from the sync queue. + +**Implementation**: [`src/purgatory/mod.rs:DEFAULT_EXPIRY`](../../src/purgatory/mod.rs) + +--- + +## Testability: Mock-Based Architecture + +A key design goal was **100% unit test coverage** without requiring real git servers or databases. + +### SyncContext Trait + +All external dependencies are abstracted behind the `SyncContext` trait: + +```rust +#[async_trait] +pub trait SyncContext: Send + Sync { + async fn fetch_repository_data(&self, identifier: &str) -> Result; + fn collect_needed_oids(&self, identifier: &str) -> HashSet; + async fn oid_exists(&self, repo_path: &Path, oid: &str) -> bool; + async fn fetch_oids(&self, repo_path: &Path, url: &str, oids: &[String]) -> Result>; + async fn process_newly_available_git_data(&self, ...) -> Result; + fn has_pending_events(&self, identifier: &str) -> bool; + fn find_target_repo(&self, data: &RepositoryData) -> Option; + fn our_domain(&self) -> Option<&str>; +} +``` + +**Two Implementations**: + +1. **`RealSyncContext`** - Production implementation connecting to real systems +2. **`MockSyncContext`** - Test implementation with configurable behavior + +### MockSyncContext Features + +The mock supports builder-pattern configuration: + +```rust +let mock = MockSyncContext::new() + .with_repository_data("test-repo", RepositoryData { + announcements: vec![...], + clone_urls: vec!["https://server1.com/repo.git".to_string()], + }) + .with_needed_oids("test-repo", hashset!["abc123", "def456"]) + .with_fetch_result("https://server1.com/repo.git", Ok(vec!["abc123"])) + .with_fetch_result("https://server2.com/repo.git", Ok(vec!["def456"])); +``` + +**Test Example** (from [`src/purgatory/sync/functions.rs`](../../src/purgatory/sync/functions.rs)): + +```rust +#[tokio::test] +async fn test_sync_identifier_partial_success() { + let mock = MockSyncContext::new() + .with_repository_data("repo", RepositoryData { + clone_urls: vec![ + "https://server1.com/repo.git".to_string(), + "https://server2.com/repo.git".to_string(), + ], + ..Default::default() + }) + .with_needed_oids("repo", hashset!["oid1", "oid2"]) + .with_fetch_result("https://server1.com/repo.git", Ok(vec!["oid1"])) + .with_fetch_result("https://server2.com/repo.git", Ok(vec!["oid2"])); + + let throttle = Arc::new(ThrottleManager::new(5, 30)); + let complete = sync_identifier(&mock, "repo", &throttle).await; + + assert!(complete); // Both OIDs fetched +} +``` + +**Why this matters**: + +- ✅ Tests run **instantly** (no network I/O) +- ✅ Tests are **deterministic** (no flaky failures) +- ✅ Tests cover **edge cases** easily (network errors, partial success, etc.) +- ✅ Tests are **isolated** (no shared state between tests) + +**Implementation**: [`src/purgatory/sync/context.rs:MockSyncContext`](../../src/purgatory/sync/context.rs) + +--- + +## Configuration + +Purgatory sync behavior is configurable via CLI flags or environment variables: + +| Setting | CLI Flag | Environment Variable | Default | Description | +| ----------------------- | -------- | -------------------- | ------- | ---------------------------------------------------- | +| Domain concurrent limit | (future) | (future) | `5` | Max concurrent requests per domain | +| Domain rate limit | (future) | (future) | `30` | Max requests per minute per domain | +| Sync loop interval | N/A | N/A | `1s` | How often to check for ready identifiers (hardcoded) | +| Default sync delay | N/A | N/A | `180s` | Delay for user-submitted events (hardcoded) | +| Immediate sync delay | N/A | N/A | `500ms` | Delay for sync-triggered events (hardcoded) | +| Purgatory expiry | N/A | N/A | `30min` | How long events wait before expiring (hardcoded) | + +**Note**: Currently, throttle limits and delays are hardcoded constants. Future work may expose these as configuration options if needed. + +--- + +## Key Design Decisions + +### 1. Identifier-Based, Not Event-Based + +**Decision**: Sync by repository identifier, not individual events. + +**Rationale**: Multiple events for the same repository should trigger a single fetch operation, not N separate fetches. + +**Impact**: Batches events efficiently, reduces server load. + +### 2. Two Separate `tried_urls` Tracking + +**Decision**: Main sync loop and domain queues track tried URLs independently. + +**Main sync**: Local `HashSet` for current attempt (all domains) +**Domain queue**: Per-identifier `HashSet` for this domain only + +**Rationale**: + +- Main sync skips throttled domains entirely (doesn't need their tried URLs) +- Domain queue only cares about URLs from its own domain +- No coordination needed → simpler code + +**Impact**: Clean separation of concerns, easier to reason about. + +### 3. Trigger-Based Domain Processing + +**Decision**: Domain queues process on triggers (capacity freed, new enqueue), not polling. + +**Rationale**: + +- Polling wastes CPU cycles checking capacity every interval +- Triggers provide instant response when capacity frees +- Event-driven design is easier to test and debug + +**Impact**: Lower CPU usage, faster response times. + +### 4. Fresh Start on New Events + +**Decision**: Reset `attempt_count` to 0 when new events arrive for an identifier. + +**Rationale**: + +- New events often mean fresh git data is available +- Previous failures might have been temporary +- Gives repositories a "second chance" without waiting for full backoff + +**Impact**: Faster recovery from transient failures, better UX. + +### 5. OID Copying in `process_newly_available_git_data` + +**Decision**: Copy OIDs and release events **per successful fetch**, not at end of sync. + +**Rationale**: + +- Events can be released as soon as their specific OIDs are available +- Partial success scenarios work correctly (some events release, others stay) +- Handles multiple state events for same identifier independently + +**Impact**: Events release faster, better handling of partial success. + +--- + +## Observability + +### Logging + +Sync operations produce structured logs at different levels: + +**INFO**: Major events + +``` +Starting purgatory sync loop (interval: 1s) +Sync complete - removed from sync queue (identifier=test-repo, complete=true) +``` + +**DEBUG**: Detailed progress + +``` +Added new sync queue entry (identifier=test-repo, delay_secs=180) +Starting sync task for identifier (identifier=test-repo) +Sync incomplete - applying backoff (identifier=test-repo, attempt_count=2, next_backoff_secs=40) +``` + +**WARN**: Errors and failures + +``` +Failed to fetch OIDs (url=https://server.com/repo.git, error=connection timeout) +``` + +### Metrics (Future) + +Planned Prometheus metrics for observability: + +- `purgatory_sync_queue_size` - Number of identifiers pending sync +- `purgatory_sync_attempts_total{identifier}` - Total sync attempts per identifier +- `purgatory_sync_oids_fetched_total{identifier}` - OIDs successfully fetched +- `purgatory_domain_in_flight{domain}` - Current in-flight requests per domain +- `purgatory_domain_requests_total{domain}` - Total requests per domain + +--- + +## Testing Strategy + +### Unit Tests + +Core sync functions have comprehensive unit tests using `MockSyncContext`: + +**`sync_identifier_next_url`** (3 tests): + +- Skips throttled domains +- Skips tried URLs +- Returns None when all URLs exhausted + +**`sync_identifier_from_url`** (2 tests): + +- Successful fetch triggers processing +- Failed fetch doesn't trigger processing + +**`sync_identifier`** (3 tests): + +- Tries multiple URLs until complete +- Enqueues throttled domains when incomplete +- Handles partial success correctly + +**`SyncQueueEntry`** (3 tests): + +- Backoff calculation correct +- Fresh start on new events +- Ready state logic correct + +**`DomainThrottle`** (4 tests): + +- Concurrent limit enforced +- Rate limit enforced +- Round-robin fairness +- Queue management correct + +**Total**: 15+ unit tests covering all core logic + +**Location**: [`src/purgatory/sync/`](../../src/purgatory/sync/) (various `#[cfg(test)]` modules) + +### Integration Tests + +End-to-end tests verify sync behavior with real relay instances: + +**Planned tests**: + +- State event syncs from remote server +- PR event syncs from remote server +- Partial OID aggregation across multiple servers +- Throttling prevents overwhelming servers +- Backoff retry after failures + +**Location**: [`tests/purgatory_sync.rs`](../../tests/purgatory_sync.rs) (planned) + +--- + +## Future Enhancements + +### 1. Configurable Throttle Limits + +**Current**: Hardcoded to 5 concurrent, 30/min per domain +**Future**: CLI flags `--sync-domain-concurrent` and `--sync-domain-rate-limit` + +**Use case**: Operators might want stricter limits for public servers or looser limits for trusted servers. + +### 2. Per-Domain Throttle Configuration + +**Current**: Same limits for all domains +**Future**: Domain-specific overrides (e.g., `github.com:10,60` for higher limits) + +**Use case**: Popular forges like GitHub/GitLab can handle more load than small personal servers. + +### 3. Prometheus Metrics + +**Current**: Structured logging only +**Future**: Export metrics for monitoring dashboards + +**Use case**: Operators want visibility into sync performance, throttle effectiveness, success rates. + +### 4. Negentropy Integration + +**Current**: Sync triggered by event arrival +**Future**: Proactive sync discovers missing events via negentropy + +**Use case**: Catch up with repositories after downtime without waiting for event re-submission. + +--- + +## Related Documentation + +- **[Purgatory Design](purgatory-design.md)** - Core purgatory concepts and event flows +- **[GRASP-02 Proactive Sync](grasp-02-proactive-sync.md)** - Full GRASP-02 implementation (relay sync) +- **[Unified Git Data Sync](unify-git-data-sync.md)** - Shared processing for push and sync paths +- **[Architecture Overview](architecture.md)** - System-wide architecture + +--- + +## Summary + +The purgatory sync system is a sophisticated, production-ready implementation that: + +✅ **Batches intelligently** - Groups events by identifier for efficient fetching +✅ **Retries smartly** - Exponential backoff with fresh start on new events +✅ **Throttles respectfully** - 5 concurrent + 30/min per domain, round-robin fairness +✅ **Times strategically** - 3min for user events, 500ms for synced events +✅ **Expires responsibly** - 30min auto-cleanup prevents memory leaks +✅ **Tests thoroughly** - Mock-based architecture enables comprehensive unit tests + +This design ensures ngit-grasp can serve repositories reliably even when git data and Nostr events arrive out-of-order or from different sources, while respecting remote server capacity and providing excellent observability. -- cgit v1.2.3