diff options
Diffstat (limited to 'docs')
| -rw-r--r-- | docs/explanation/grasp-02-proactive-sync.md | 402 |
1 files changed, 160 insertions, 242 deletions
diff --git a/docs/explanation/grasp-02-proactive-sync.md b/docs/explanation/grasp-02-proactive-sync.md index 64193d3..f13050e 100644 --- a/docs/explanation/grasp-02-proactive-sync.md +++ b/docs/explanation/grasp-02-proactive-sync.md | |||
| @@ -2,17 +2,44 @@ | |||
| 2 | 2 | ||
| 3 | ## Overview | 3 | ## Overview |
| 4 | 4 | ||
| 5 | This document explains the proactive sync system that synchronizes repository data from external relays based on relay URLs listed in 30617 repository announcements. Key principles: | 5 | Proactively Sync Nostr Events from other relays listed in accepted repository announcements. |
| 6 | |||
| 7 | Features: | ||
| 8 | |||
| 9 | - Fetches all repository announcements from connected relays to discover new repos listing our service | ||
| 10 | - Discovers and dynamically connects to new relays listed by repository announcements we have accepted (with optional bootstrap relay to get started) | ||
| 11 | - Fetches events tagging repositories we are interested in, as well as events tagging Issues, Patches and PRs of these repositories | ||
| 12 | - Supports live sync and historic sync (tries NIP-77 negentropy but falls back to REQ+EOSE with 'until' based pagination) | ||
| 13 | - Plays nicely with other relays - connection backoff and rate-limiting detection with cooldown | ||
| 14 | - Does a full reconciliation daily | ||
| 15 | - Prometheus metrics | ||
| 16 | |||
| 17 | Key Architectural Points: | ||
| 18 | |||
| 19 | - **Simple data model** for tracking target, pending and actual filter state against relays | ||
| 20 | - **Self-subscription** enables a deduplicated feed of all accepted events which leads to an updated target sync state | ||
| 21 | - **Clear separation** between Live sync (using `limit:0`) and Historic Sync (handled via negentropy falling back to REQ+EOSE with 'until' based pagination support) | ||
| 22 | - **Discovery management**: The nature of discovery inherently leads to a drip feed of root_events (e.g., Repo Announcements, Issues, Patches and PRs) that require additional subscriptions. Without careful management this can lead to large numbers of subscriptions and potentially rate limiting. Mitigation strategies: | ||
| 23 | - Self-subscriber waits for 5s to batch updates before creating new filters / subscriptions, allowing time for most events to be received from outstanding subscriptions from connected relays | ||
| 24 | - PendingBatch tracks each new set of filters that may require pagination until they are complete | ||
| 25 | - Avoid long awaits - recompute desired filters when connection is established to ensure filters are as consolidated as possible | ||
| 26 | - Consolidation function ensures number of live_sync subscriptions don't reach rate-limiting limits (threshold: 70 filters) | ||
| 27 | - **Quick Reconnect** (< 15mins) - doesn't do a full reconciliation vs fresh start (longer disconnect or relaunch binary) | ||
| 28 | - **Background timers** handle relay connection health and metrics, handling reconnects after backoff and recovery after rate-limiting | ||
| 29 | |||
| 30 | Sections: | ||
| 31 | |||
| 32 | - Data Model | ||
| 33 | - Connection Lifecycle | ||
| 34 | - Live vs Historic Sync | ||
| 35 | - Triggers and Flow | ||
| 36 | - Background Tasks | ||
| 6 | 37 | ||
| 7 | 1. **Two paths to AddFilters → handle_new_sync_filters** - Self-subscriber sends directly via channel; connect/reconnect uses `recompute_new_sync_filters_for_relay` | 38 | ## Data Model |
| 8 | 2. **Clear separation of live vs historic sync** - Two distinct primitives with different purposes | ||
| 9 | 3. **Layer 1 on connect, Layer 2+3 via AddFilters** - L1 handled at connection time, L2+L3 flow through AddFilters | ||
| 10 | 4. **Always clear PendingSyncIndex first** - Before any reconnect/consolidate operation | ||
| 11 | 5. **NIP-77 negentropy for historical sync** - Efficient set reconciliation, fallback to REQ if unsupported | ||
| 12 | 39 | ||
| 13 | --- | 40 | The state of which relays we want to connect to, the progress of historic sync, and the active live filters is captures in this simple data model. |
| 14 | 41 | ||
| 15 | ## Data Model | 42 | This state starts afresh when the binary loads. |
| 16 | 43 | ||
| 17 | ### RepoSyncIndex (Source of Truth) | 44 | ### RepoSyncIndex (Source of Truth) |
| 18 | 45 | ||
| @@ -26,7 +53,7 @@ pub type RepoSyncIndex = Arc<RwLock<HashMap<String, RepoSyncNeeds>>>; | |||
| 26 | pub struct RepoSyncNeeds { | 53 | pub struct RepoSyncNeeds { |
| 27 | /// Relay URLs listed in this repo's 30617 announcement | 54 | /// Relay URLs listed in this repo's 30617 announcement |
| 28 | pub relays: HashSet<String>, | 55 | pub relays: HashSet<String>, |
| 29 | /// Root event IDs - 1617/1618/1619/1621 - that reference this repo | 56 | /// Root event IDs - 1617/1618/1621 - that reference this repo |
| 30 | pub root_events: HashSet<EventId>, | 57 | pub root_events: HashSet<EventId>, |
| 31 | } | 58 | } |
| 32 | ``` | 59 | ``` |
| @@ -180,13 +207,13 @@ stateDiagram-v2 | |||
| 180 | 207 | ||
| 181 | ### Connection Flow Methods | 208 | ### Connection Flow Methods |
| 182 | 209 | ||
| 183 | | Method | Purpose | When Called | Actions | | 210 | | Method | Purpose | When Called | Actions | |
| 184 | | ------------------------------- | ------------------------- | ------------------------------- | --------------------------------------------------------------- | | 211 | | ------------------------------- | ------------------------- | --------------------------------- | --------------------------------------------------------------- | |
| 185 | | `register_relay()` | Initialize relay tracking | Discovery via RepoSyncIndex | Creates RelayConnection, stores in HashMap, returns immediately | | 212 | | `register_relay()` | Initialize relay tracking | Discovery via RepoSyncIndex | Creates RelayConnection, stores in HashMap, returns immediately | |
| 186 | | `try_connect_relay()` | Attempt connection | Periodic retry (500ms) | Calls connect_and_subscribe, sends notification on success | | 213 | | `try_connect_relay()` | Attempt connection | Health tracker allows retry | Calls connection.connect(), sends notification on success | |
| 187 | | `handle_connect_or_reconnect()` | Setup after connection | ConnectNotification received | Spawns event loop, updates state, decides sync strategy | | 214 | | `handle_connect_or_reconnect()` | Setup after connection | ConnectNotification received | Spawns event loop, updates state, decides sync strategy | |
| 188 | | `handle_disconnect()` | Cleanup after disconnect | DisconnectNotification received | Updates state, clears pending, KEEPS RelayConnection | | 215 | | `handle_disconnect()` | Cleanup after disconnect | DisconnectNotification received | Updates state, clears pending, KEEPS RelayConnection | |
| 189 | | `retry_disconnected_relays()` | Periodic reconnection | Every 500ms | For each ready relay: try_connect_relay() | | 216 | | `retry_disconnected_relays()` | Periodic reconnection | Every 2s (health & metrics timer) | For each ready relay: try_connect_relay() | |
| 190 | 217 | ||
| 191 | ### Event Loop Lifecycle | 218 | ### Event Loop Lifecycle |
| 192 | 219 | ||
| @@ -212,6 +239,53 @@ flowchart LR | |||
| 212 | 239 | ||
| 213 | --- | 240 | --- |
| 214 | 241 | ||
| 242 | ## Background Tasks | ||
| 243 | |||
| 244 | The sync system uses three background tasks that run continuously: | ||
| 245 | |||
| 246 | ### 1. Daily Timer (`run_daily_timer`) | ||
| 247 | |||
| 248 | **Purpose**: Periodic full reconciliation to detect state drift | ||
| 249 | |||
| 250 | **Interval**: Random 23-25 hours (prevents thundering herd) | ||
| 251 | |||
| 252 | **Actions**: | ||
| 253 | |||
| 254 | - Triggers `daily_sync()` for all connected relays | ||
| 255 | - Same as `fresh_start()` but without recording disconnect metrics | ||
| 256 | - Ensures consistency over time | ||
| 257 | |||
| 258 | ### 2. Health and Metrics Checker (`run_health_and_metrics_checker`) | ||
| 259 | |||
| 260 | **Purpose**: Combined health management and metrics updates | ||
| 261 | |||
| 262 | **Interval**: 2 seconds | ||
| 263 | |||
| 264 | **Actions**: | ||
| 265 | |||
| 266 | 1. **Disconnect checking**: Calls `check_disconnects()` to remove relays with no repos/events (except bootstrap) | ||
| 267 | 2. **Retry disconnected**: Calls `retry_disconnected_relays()` to attempt reconnection per health tracker backoff | ||
| 268 | 3. **Rate limit recovery**: Calls `check_rate_limit_recovery()` to clear expired rate limits | ||
| 269 | 4. **Metrics update**: Updates Prometheus metrics with current health states | ||
| 270 | |||
| 271 | **Why combined**: The 2-second interval provides good responsiveness for health changes while minimizing overhead. All operations are lightweight (index checks, no I/O except actual connection attempts). | ||
| 272 | |||
| 273 | ### 3. Self-Subscriber (`SelfSubscriber::run`) | ||
| 274 | |||
| 275 | **Purpose**: Monitor own relay for repository announcements and root events | ||
| 276 | |||
| 277 | **Subscribed kinds**: 30617, 1617, 1618, 1621 (NOT 30618) | ||
| 278 | |||
| 279 | **Batching**: 5-second window (configurable via `NGIT_SYNC_BATCH_WINDOW_MS`) | ||
| 280 | |||
| 281 | **Flow**: | ||
| 282 | |||
| 283 | 1. Queue events to `PendingUpdates` | ||
| 284 | 2. Timer fires (interval, does not reset on events) | ||
| 285 | 3. Process batch: update RepoSyncIndex → derive targets → send AddFilters to SyncManager | ||
| 286 | |||
| 287 | --- | ||
| 288 | |||
| 215 | ## Core Architecture: Live vs Historic Sync | 289 | ## Core Architecture: Live vs Historic Sync |
| 216 | 290 | ||
| 217 | The sync system is built on two fundamental primitives that are clearly separated: | 291 | The sync system is built on two fundamental primitives that are clearly separated: |
| @@ -223,19 +297,6 @@ The sync system is built on two fundamental primitives that are clearly separate | |||
| 223 | | `sync_live()` | Ongoing event stream | `limit: 0` | Not tracked | | 297 | | `sync_live()` | Ongoing event stream | `limit: 0` | Not tracked | |
| 224 | | `historic_sync()` | Catch up on past events | Optional `since` | PendingSyncIndex | | 298 | | `historic_sync()` | Catch up on past events | Optional `since` | PendingSyncIndex | |
| 225 | 299 | ||
| 226 | ### Why `limit: 0` for Live Sync? | ||
| 227 | |||
| 228 | | Approach | Pros | Cons | | ||
| 229 | | ------------ | --------------------------------------- | --------------------------------- | | ||
| 230 | | `since: now` | Intuitive | Time-sensitive, clock skew issues | | ||
| 231 | | `limit: 0` | Deterministic, mirrors filter structure | Less intuitive name | | ||
| 232 | |||
| 233 | `limit: 0` is better because: | ||
| 234 | |||
| 235 | 1. **No time dependency**: Doesn't depend on synchronized clocks | ||
| 236 | 2. **Mirrors historic filters**: Same tag structure, just different limit | ||
| 237 | 3. **State reconstruction**: Can rebuild from repo/event lists without timestamps | ||
| 238 | |||
| 239 | ### Layer Strategy | 300 | ### Layer Strategy |
| 240 | 301 | ||
| 241 | | Layer | Content | When Subscribed | Managed By | | 302 | | Layer | Content | When Subscribed | Managed By | |
| @@ -261,7 +322,7 @@ The system has **two independent paths** that create and process AddFilters acti | |||
| 261 | 322 | ||
| 262 | **Path 1: Self-Subscriber (direct AddFilters construction)** | 323 | **Path 1: Self-Subscriber (direct AddFilters construction)** |
| 263 | 324 | ||
| 264 | The [`SelfSubscriber::process_batch()`](src/sync/self_subscriber.rs:452) method: | 325 | The [`SelfSubscriber::process_batch()`](src/sync/self_subscriber.rs:448) method: |
| 265 | 326 | ||
| 266 | 1. Updates `RepoSyncIndex` with discovered repos | 327 | 1. Updates `RepoSyncIndex` with discovered repos |
| 267 | 2. Calls `derive_relay_targets()` to get per-relay targets | 328 | 2. Calls `derive_relay_targets()` to get per-relay targets |
| @@ -271,7 +332,7 @@ The [`SelfSubscriber::process_batch()`](src/sync/self_subscriber.rs:452) method: | |||
| 271 | 332 | ||
| 272 | **Path 2: Connect/Reconnect (via compute_actions)** | 333 | **Path 2: Connect/Reconnect (via compute_actions)** |
| 273 | 334 | ||
| 274 | The [`SyncManager::recompute_new_sync_filters_for_relay()`](src/sync/mod.rs:1374) method: | 335 | The `SyncManager::recompute_new_sync_filters_for_relay()` method: |
| 275 | 336 | ||
| 276 | 1. Calls `derive_relay_targets()` from `RepoSyncIndex` | 337 | 1. Calls `derive_relay_targets()` from `RepoSyncIndex` |
| 277 | 2. Calls `compute_actions(targets, pending, confirmed)` - three-way diff | 338 | 2. Calls `compute_actions(targets, pending, confirmed)` - three-way diff |
| @@ -527,183 +588,37 @@ fn compute_actions( | |||
| 527 | 588 | ||
| 528 | --- | 589 | --- |
| 529 | 590 | ||
| 530 | ## Method Specifications | 591 | ## Key Implementation Methods |
| 531 | |||
| 532 | ### Primitives | ||
| 533 | |||
| 534 | #### `sync_live()` - Live Subscriptions | ||
| 535 | |||
| 536 | ```rust | ||
| 537 | /// Set up live subscription (filters with limit: 0) | ||
| 538 | /// | ||
| 539 | /// - Uses `limit: 0` to receive only new events | ||
| 540 | /// - NOT tracked in PendingSyncIndex (state reconstructable) | ||
| 541 | async fn sync_live(&self, relay_url: &str, filters: &[Filter]) | ||
| 542 | ``` | ||
| 543 | |||
| 544 | #### `historic_sync()` - Historical Sync Dispatcher | ||
| 545 | |||
| 546 | ```rust | ||
| 547 | /// Dispatch to appropriate historic sync method based on relay capabilities | ||
| 548 | /// | ||
| 549 | /// Both paths update PendingSyncIndex to ensure consistent lifecycle tracking. | ||
| 550 | async fn historic_sync( | ||
| 551 | &mut self, | ||
| 552 | relay_url: &str, | ||
| 553 | filters: Vec<Filter>, | ||
| 554 | items: PendingItems, | ||
| 555 | since: Option<Timestamp>, | ||
| 556 | ) -> Option<u64> // Returns batch_id | ||
| 557 | ``` | ||
| 558 | |||
| 559 | Dispatches to: | ||
| 560 | |||
| 561 | - `historic_sync_negentropy()` - NIP-77 parallel sync (if supported) - no pagination needed | ||
| 562 | - `historic_sync_legacy()` - REQ+EOSE fallback with automatic pagination for large result sets | ||
| 563 | 592 | ||
| 564 | ### Building Blocks | 593 | ### Connection Lifecycle |
| 565 | 594 | ||
| 566 | #### `handle_new_sync_filters()` - Handle New AddFilters | 595 | - **`register_relay()`**: Creates RelayConnection object, stores in HashMap, returns immediately |
| 596 | - **`try_connect_relay()`**: Attempts connection using `connection.connect()` with timeout | ||
| 597 | - **`handle_connect_or_reconnect()`**: Spawns event loop, updates state, decides sync strategy (fresh_start/quick_reconnect) | ||
| 598 | - **`handle_disconnect()`**: Updates state to Disconnected, clears pending batches, keeps RelayConnection object | ||
| 599 | - **`retry_disconnected_relays()`**: Called every 2s, retries relays that pass health tracker checks | ||
| 567 | 600 | ||
| 568 | ```rust | 601 | ### Sync Entry Points |
| 569 | /// Handle AddFilters action (from self-subscriber channel OR compute_actions) | ||
| 570 | /// | ||
| 571 | /// Orchestrates both live and historic sync for NEW items: | ||
| 572 | /// 1. Check/spawn connection if needed (for unknown relays) | ||
| 573 | /// 2. maybe_consolidate() - check filter threshold | ||
| 574 | /// 3. sync_live() - set up permanent L2+L3 subscriptions | ||
| 575 | /// 4. historic_sync() - catch up on past events | ||
| 576 | /// | ||
| 577 | /// This is the SINGLE entry point for processing AddFilters from BOTH paths. | ||
| 578 | async fn handle_new_sync_filters(&mut self, action: AddFilters) | ||
| 579 | ``` | ||
| 580 | 602 | ||
| 581 | ### Top-Level Entry Points | 603 | - **`fresh_start()`**: Full sync - clears all state, L1 historic (with negentropy if available), then L2+L3 via recompute |
| 604 | - **`quick_reconnect()`**: Incremental sync - preserves confirmed state, L1 historic with `since`, L2+L3 rebuild with `since`, then recompute for new items | ||
| 605 | - **`daily_sync()`**: Wrapper around `fresh_start()` without disconnect metrics | ||
| 606 | - **`consolidate()`**: Reduces filter count - clears pending, unsubscribes all, rebuilds live subscriptions only, then recompute for new items | ||
| 582 | 607 | ||
| 583 | #### `fresh_start()` - Clean Slate Sync | 608 | ### Sync Primitives |
| 584 | |||
| 585 | ```rust | ||
| 586 | /// Fresh start - clears state and does full sync | ||
| 587 | /// | ||
| 588 | /// Called by: initial connect, long_reconnect, daily_sync | ||
| 589 | /// | ||
| 590 | /// Flow: | ||
| 591 | /// 1. Clear PendingSyncIndex | ||
| 592 | /// 2. Clear RelaySyncIndex | ||
| 593 | /// 3. L1 live + L1 historic (negentropy if available) | ||
| 594 | /// 4. recompute_new_sync_filters_for_relay → compute_actions → handle_new_sync_filters for L2+L3 | ||
| 595 | async fn fresh_start(&mut self, relay_url: &str) | ||
| 596 | ``` | ||
| 597 | |||
| 598 | #### `quick_reconnect()` - Short Disconnection Recovery | ||
| 599 | |||
| 600 | ```rust | ||
| 601 | /// Quick reconnect - for disconnections < 15 minutes | ||
| 602 | /// | ||
| 603 | /// Flow: | ||
| 604 | /// 1. Clear PendingSyncIndex | ||
| 605 | /// 2. L1 live + L1 historic(since) | ||
| 606 | /// 3. reconstruct_filters → L2+L3 live + L2+L3 historic(since) | ||
| 607 | /// 4. compute_actions for any new items | ||
| 608 | async fn quick_reconnect(&mut self, relay_url: &str, since: Timestamp) | ||
| 609 | ``` | ||
| 610 | |||
| 611 | #### `long_reconnect()` - Extended Disconnection Recovery | ||
| 612 | |||
| 613 | ```rust | ||
| 614 | /// Long reconnect - for disconnections > 15 minutes | ||
| 615 | /// | ||
| 616 | /// Flow: | ||
| 617 | /// 1. Record disconnect/reconnect metric | ||
| 618 | /// 2. fresh_start() | ||
| 619 | async fn long_reconnect(&mut self, relay_url: &str) | ||
| 620 | ``` | ||
| 621 | |||
| 622 | #### `daily_sync()` - Scheduled Full Refresh | ||
| 623 | |||
| 624 | ```rust | ||
| 625 | /// Daily sync - full refresh without disconnect metrics | ||
| 626 | /// | ||
| 627 | /// Flow: fresh_start() (no disconnect metric recorded) | ||
| 628 | async fn daily_sync(&mut self, relay_url: &str) | ||
| 629 | ``` | ||
| 630 | 609 | ||
| 631 | #### `consolidate()` - Filter Count Reduction | 610 | - **`sync_live()`**: Creates subscriptions with `limit: 0` for ongoing event stream (not tracked in PendingSyncIndex) |
| 611 | - **`historic_sync()`**: Dispatches to negentropy or REQ+EOSE based on relay capability, creates PendingBatch, returns batch_id | ||
| 632 | 612 | ||
| 633 | ```rust | 613 | ### Filter Processing |
| 634 | /// Consolidate subscriptions when filter count exceeds threshold | ||
| 635 | /// | ||
| 636 | /// Flow: | ||
| 637 | /// 1. Clear PendingSyncIndex | ||
| 638 | /// 2. unsubscribe_all | ||
| 639 | /// 3. reconstruct_filters → sync_live only (L1+L2+L3) | ||
| 640 | /// 4. compute_actions for any new items | ||
| 641 | /// | ||
| 642 | /// NO historic sync - items already synced, just reducing subscriptions | ||
| 643 | async fn consolidate(&mut self, relay_url: &str) | ||
| 644 | ``` | ||
| 645 | 614 | ||
| 646 | #### `handle_new_sync_filters()` - New Filter Discovery | 615 | - **`handle_new_sync_filters()`**: Single entry point for AddFilters from both paths (self-subscriber OR recompute), orchestrates live+historic sync |
| 647 | 616 | - **`recompute_new_sync_filters_for_relay()`**: Calls derive_relay_targets → compute_actions → handle_new_sync_filters for each resulting action | |
| 648 | ```rust | ||
| 649 | /// Handle AddFilters action from compute_actions | ||
| 650 | /// | ||
| 651 | /// Flow: | ||
| 652 | /// 1. Check/spawn connection if needed | ||
| 653 | /// 2. maybe_consolidate (check filter threshold) | ||
| 654 | /// 3. recompute_new_sync_filters_for_relay | ||
| 655 | async fn handle_new_sync_filters(&mut self, action: AddFilters) | ||
| 656 | ``` | ||
| 657 | 617 | ||
| 658 | --- | 618 | --- |
| 659 | 619 | ||
| 660 | ## Method Relationships Summary | 620 | ## Method Relationships Summary |
| 661 | 621 | ||
| 662 | ``` | ||
| 663 | fresh_start(relay_url) // Initial/long_reconnect/daily | ||
| 664 | ├──> Clear PendingSyncIndex | ||
| 665 | ├──> Clear RelaySyncIndex | ||
| 666 | ├──> L1: sync_live(announcement_filter) | ||
| 667 | ├──> L1: historic_sync(announcement_filter, None) | ||
| 668 | └──> compute_actions → AddFilters → recompute_new_sync_filters_for_relay (L2+L3) | ||
| 669 | |||
| 670 | quick_reconnect(relay_url, since) // Disconnected < 15 min | ||
| 671 | ├──> Clear PendingSyncIndex | ||
| 672 | ├──> L1: sync_live(announcement_filter) | ||
| 673 | ├──> L1: historic_sync(announcement_filter, since) | ||
| 674 | ├──> reconstruct_filters() → L2+L3 filters | ||
| 675 | ├──> L2+L3: sync_live(filters) | ||
| 676 | ├──> L2+L3: historic_sync(filters, since) | ||
| 677 | └──> compute_actions → AddFilters → recompute_new_sync_filters_for_relay (new items only) | ||
| 678 | |||
| 679 | long_reconnect(relay_url) // Disconnected > 15 min | ||
| 680 | ├──> Record disconnect/reconnect metric | ||
| 681 | └──> fresh_start() | ||
| 682 | |||
| 683 | daily_sync(relay_url) // Timer fires | ||
| 684 | └──> fresh_start() // No disconnect metric | ||
| 685 | |||
| 686 | consolidate(relay_url) // Filter count > threshold | ||
| 687 | ├──> Clear PendingSyncIndex | ||
| 688 | ├──> unsubscribe_all() | ||
| 689 | ├──> reconstruct_filters() → L1+L2+L3 filters | ||
| 690 | ├──> sync_live(filters) // Live only, NO historic | ||
| 691 | └──> compute_actions → AddFilters → recompute_new_sync_filters_for_relay (new items only) | ||
| 692 | |||
| 693 | handle_new_sync_filters(action) // New filter discovery | ||
| 694 | ├──> Check/spawn connection | ||
| 695 | ├──> maybe_consolidate() | ||
| 696 | └──> recompute_new_sync_filters_for_relay(action, None) | ||
| 697 | |||
| 698 | recompute_new_sync_filters_for_relay(action, since) // Process AddFilters | ||
| 699 | ├──> sync_live(action.filters) // L2+L3 live | ||
| 700 | └──> historic_sync(action.filters, since) // L2+L3 historic | ||
| 701 | ├── historic_sync_negentropy() // Parallel, updates Pending | ||
| 702 | └── historic_sync_legacy() // REQ+EOSE, updates Pending | ||
| 703 | ``` | ||
| 704 | |||
| 705 | --- | ||
| 706 | |||
| 707 | ## Filter Building (Three-Layer Strategy) | 622 | ## Filter Building (Three-Layer Strategy) |
| 708 | 623 | ||
| 709 | ### Layer 1: Announcements | 624 | ### Layer 1: Announcements |
| @@ -746,7 +661,6 @@ Negentropy sync is attempted for: | |||
| 746 | 661 | ||
| 747 | - **fresh_start()** - Full sync without `since` | 662 | - **fresh_start()** - Full sync without `since` |
| 748 | - **daily_sync()** - Periodic full refresh (via fresh_start) | 663 | - **daily_sync()** - Periodic full refresh (via fresh_start) |
| 749 | - **long_reconnect()** - Via fresh_start | ||
| 750 | 664 | ||
| 751 | Negentropy is NOT used for: | 665 | Negentropy is NOT used for: |
| 752 | 666 | ||
| @@ -864,22 +778,6 @@ flowchart TB | |||
| 864 | 778 | ||
| 865 | --- | 779 | --- |
| 866 | 780 | ||
| 867 | ## Key Design Decisions | ||
| 868 | |||
| 869 | | Decision | Choice | Rationale | | ||
| 870 | | ----------------------------- | ---------------------------------------------------------- | ------------------------------------------------------------------ | | ||
| 871 | | Live vs Historic separation | Two distinct primitives | Clear responsibilities, easier reasoning about state | | ||
| 872 | | Live sync method | `limit: 0` not `since: now` | No clock dependency, deterministic, mirrors filter structure | | ||
| 873 | | Layer 1 handling | On connect, separate from AddFilters | Connection-level concern, not item-level | | ||
| 874 | | Layer 2+3 handling | Via compute_actions → recompute_new_sync_filters_for_relay | Item-level, proper pending tracking | | ||
| 875 | | Clear PendingSyncIndex | Always first | Old subscriptions are dead, must clear before any operation | | ||
| 876 | | fresh_start vs long_reconnect | Same flow, different metrics | Reuse logic, distinguish intentional refresh from failure recovery | | ||
| 877 | | Consolidation | Live only, no historic | Items already synced, just reducing subscription count | | ||
| 878 | | compute_actions role | ONLY decision point for new work | Single place to reason about what needs syncing | | ||
| 879 | | NIP-77 negentropy | Try first on full sync, fallback | Efficient for large sets, graceful degradation | | ||
| 880 | |||
| 881 | --- | ||
| 882 | |||
| 883 | ## Module Structure | 781 | ## Module Structure |
| 884 | 782 | ||
| 885 | ``` | 783 | ``` |
| @@ -897,50 +795,70 @@ src/sync/ | |||
| 897 | 795 | ||
| 898 | ## Health Tracking | 796 | ## Health Tracking |
| 899 | 797 | ||
| 900 | The `RelayHealthTracker` manages connection health with exponential backoff: | 798 | The [`RelayHealthTracker`](src/sync/health.rs:209) manages connection health with exponential backoff and state transitions: |
| 901 | 799 | ||
| 902 | - **States**: Healthy, Degraded, Dead | 800 | ### Health States |
| 903 | - **Backoff**: `base * 2^(failures-1)`, capped at max_backoff | 801 | |
| 802 | 1. **Healthy**: Working connection, no recent failures, proven stable (past 5-minute stability period) | ||
| 803 | 2. **Disconnected**: Not currently connected, but no recent failures or issues | ||
| 804 | 3. **Degraded**: Connection problems (actively failing to connect) OR recently recovered but not yet stable | ||
| 805 | 4. **Dead**: 24+ hours of continuous failures, minimal retry (once per 24 hours) | ||
| 806 | 5. **RateLimited**: Rate limited by relay, 65-second cooldown active | ||
| 807 | |||
| 808 | ### State Transitions | ||
| 809 | |||
| 810 | ``` | ||
| 811 | Healthy <-> Disconnected: Normal connection/disconnection | ||
| 812 | Disconnected -> Degraded: Connection failure | ||
| 813 | Degraded -> Dead: 24h+ of continuous failures | ||
| 814 | Degraded -> Disconnected: Recovery (enters 5min stability period) | ||
| 815 | Disconnected -> Healthy: Stable for 5 minutes after recovery | ||
| 816 | Any -> RateLimited: NOTICE message from relay indicating rate limiting | ||
| 817 | RateLimited -> previous state: After 65-second cooldown expires | ||
| 818 | ``` | ||
| 819 | |||
| 820 | ### Backoff Configuration | ||
| 821 | |||
| 822 | - **Formula**: `base_backoff * 2^(failures-1)`, capped at `max_backoff` | ||
| 823 | - **Default base**: 5 seconds (configurable via `sync_base_backoff_secs`) | ||
| 824 | - **Default max**: 1 hour (configurable via `sync_max_backoff_secs`) | ||
| 904 | - **Dead threshold**: 24 hours of continuous failures | 825 | - **Dead threshold**: 24 hours of continuous failures |
| 905 | - **Dead relay retry**: Once per 24 hours | 826 | - **Dead retry interval**: Once per 24 hours |
| 827 | - **Rate limit cooldown**: Fixed 65 seconds (60s typical limit + 5s buffer) | ||
| 828 | - **Stability period**: 5 minutes after recovery before marking as Healthy | ||
| 829 | |||
| 830 | ### Special Behaviors | ||
| 906 | 831 | ||
| 907 | Bootstrap relays are never disconnected by the cleanup system, even if empty. | 832 | - **Bootstrap relays**: Never disconnected by cleanup system, even if empty |
| 833 | - **Rate limiting**: Distinct from connection failures - triggered by relay NOTICE messages | ||
| 834 | - **Connection timeout**: Set to `base_backoff_secs` to ensure retry timing works correctly | ||
| 908 | 835 | ||
| 909 | --- | 836 | --- |
| 910 | 837 | ||
| 911 | ## Self-Subscriber | 838 | ## Prometheus Metrics |
| 912 | 839 | ||
| 913 | The `SelfSubscriber` monitors our own relay for repository announcements and root events, updating the `RepoSyncIndex`. | 840 | The [`SyncMetrics`](src/sync/metrics.rs:18) module provides comprehensive monitoring via Prometheus: |
| 914 | 841 | ||
| 915 | ### Event Kinds Monitored | 842 | ### Connection Metrics |
| 916 | 843 | ||
| 917 | - **30617** - Repository Announcements (triggers discovery of repos listing our relay) | 844 | - `ngit_sync_relay_connected`: Per-relay connection status (1=connected, 0=disconnected) |
| 918 | - **1617** - Patches (root events referencing repos) | 845 | - `ngit_sync_connection_attempts_total`: Total connection attempts by relay and result (success/failure) |
| 919 | - **1618** - Issues | ||
| 920 | - **1619** - Replies/Status | ||
| 921 | - **1621** - Pull Requests | ||
| 922 | 846 | ||
| 923 | Note: 30618 (Maintainer Lists) is NOT self-subscribed - only synced from remote relays. | 847 | ### Health Metrics |
| 924 | 848 | ||
| 925 | ### Batching Flow | 849 | - `ngit_sync_relay_status`: Per-relay health status (1=healthy, 2=disconnected, 3=degraded, 4=dead, 5=rate_limited) |
| 850 | - `ngit_sync_relay_failures`: Consecutive failure count per relay | ||
| 926 | 851 | ||
| 927 | 1. **Receive events** from own relay subscription | 852 | ### Event Metrics |
| 928 | 2. **Queue to pending** - announcements get repo ID + relay URLs; root events get repo ref + event ID | ||
| 929 | 3. **Timer fires** (configurable window, default 5 seconds) - does NOT reset on new events | ||
| 930 | 4. **Process batch**: | ||
| 931 | - Update `RepoSyncIndex` with discovered repos and root events | ||
| 932 | - Call `compute_actions()` | ||
| 933 | - Send `AddFilters` actions to SyncManager → `recompute_new_sync_filters_for_relay()` | ||
| 934 | 853 | ||
| 935 | --- | 854 | - `ngit_sync_events_synced_total`: Total events synced (newly saved events only, not duplicates or rejected) |
| 936 | 855 | ||
| 937 | ## Disconnect Handling | 856 | ### Summary Metrics |
| 938 | 857 | ||
| 939 | The disconnect checker runs periodically (default: 60 seconds) to clean up empty relays: | 858 | - `ngit_sync_relays_tracked_total`: Total number of relays discovered and tracked |
| 859 | - `ngit_sync_relays_connected_total`: Number of currently connected relays | ||
| 860 | - `ngit_sync_relays_dead_total`: Number of relays marked as dead | ||
| 940 | 861 | ||
| 941 | - Finds relays with `repos.is_empty() && root_events.is_empty()` | 862 | All metrics follow the `ngit_sync_` prefix convention and are updated by the health and metrics checker every 2 seconds. |
| 942 | - Skips bootstrap relays (`is_bootstrap == true`) | ||
| 943 | - Removes from relay_sync_index, pending_sync_index, and connections | ||
| 944 | - Disconnects the WebSocket connection | ||
| 945 | 863 | ||
| 946 | Also triggers reconnection attempts for disconnected relays that have pending work. | 864 | --- |