diff options
| author | DanConwayDev <DanConwayDev@protonmail.com> | 2025-12-04 18:43:49 +0000 |
|---|---|---|
| committer | DanConwayDev <DanConwayDev@protonmail.com> | 2025-12-04 18:43:49 +0000 |
| commit | dd403b17e7c74db9443d0891a9de1f0f0f9f89eb (patch) | |
| tree | 177dd9f664dde3565492c1d11016dabfeda28bbc /docs/explanation | |
| parent | 950c2e4e68448d2abcad90a31bfffaca6d7bc47e (diff) | |
feat(sync): Phase 6 - observability and production readiness
- Add SyncMetrics with full Prometheus integration
- Track sync gaps via catchup events
- Update Grafana dashboard with sync panels
- Document all sync configuration options
- Update design doc with implementation notes
Diffstat (limited to 'docs/explanation')
| -rw-r--r-- | docs/explanation/grasp-02-proactive-sync.md | 128 |
1 files changed, 128 insertions, 0 deletions
diff --git a/docs/explanation/grasp-02-proactive-sync.md b/docs/explanation/grasp-02-proactive-sync.md index a8af3f4..98531ec 100644 --- a/docs/explanation/grasp-02-proactive-sync.md +++ b/docs/explanation/grasp-02-proactive-sync.md | |||
| @@ -745,3 +745,131 @@ pub struct SyncConfig { | |||
| 745 | 8. **Dynamic subscription addition** with periodic consolidation | 745 | 8. **Dynamic subscription addition** with periodic consolidation |
| 746 | 9. **Custom acceptance policy** excluding rate limiting defaults | 746 | 9. **Custom acceptance policy** excluding rate limiting defaults |
| 747 | 10. **Catchup as failure signal** - events found during catchup/daily indicate live sync gaps, tracked in Prometheus | 747 | 10. **Catchup as failure signal** - events found during catchup/daily indicate live sync gaps, tracked in Prometheus |
| 748 | |||
| 749 | --- | ||
| 750 | |||
| 751 | ## Implementation Notes (Phase 6) | ||
| 752 | |||
| 753 | This section documents the final implementation as of Phase 6 (Observability & Production Readiness). | ||
| 754 | |||
| 755 | ### What Was Actually Built | ||
| 756 | |||
| 757 | The implementation closely follows the design document with the following completed components: | ||
| 758 | |||
| 759 | #### Phase 1: Basic Sync (commit b167f1b) | ||
| 760 | - [`SyncManager`](../../src/sync/manager.rs) - Main coordinator for proactive sync | ||
| 761 | - Single relay sync via `NGIT_SYNC_RELAY_URL` configuration | ||
| 762 | - Event validation through existing [`Nip34WritePolicy`](../../src/nostr/builder.rs) | ||
| 763 | |||
| 764 | #### Phase 2: Three-Layer Filters (commit bf558b0) | ||
| 765 | - [`FilterService`](../../src/sync/filter.rs) - Builds three-layer filter strategy | ||
| 766 | - Layer 1: All kind 30617+30618 (announcements) | ||
| 767 | - Layer 2: A/a tag filters for repository events | ||
| 768 | - Layer 3: E/e tag filters for related events (PRs, Issues) | ||
| 769 | - Multi-relay discovery from stored announcements | ||
| 770 | |||
| 771 | #### Phase 3: Health Tracking (commit f639ecf) | ||
| 772 | - [`RelayHealthTracker`](../../src/sync/health.rs) - DashMap-based health tracking | ||
| 773 | - Three states: Healthy → Degraded → Dead | ||
| 774 | - Exponential backoff: 5s → 10s → 20s → ... → max (default 1h) | ||
| 775 | - Dead relay detection after 24h continuous failures | ||
| 776 | - Startup jitter (0-10s) to prevent thundering herd | ||
| 777 | |||
| 778 | #### Phase 4: Dynamic Subscriptions (commit a19ff57) | ||
| 779 | - [`SubscriptionManager`](../../src/sync/subscription.rs) - Per-connection subscription tracking | ||
| 780 | - Dynamic Layer 2 subscriptions when new announcements arrive | ||
| 781 | - Dynamic Layer 3 subscriptions when new PRs/Issues arrive | ||
| 782 | - Filter consolidation at threshold (150 filters) | ||
| 783 | |||
| 784 | #### Phase 5: Catchup & Gap Detection (commit 950c2e4) | ||
| 785 | - [`NegentropyService`](../../src/sync/negentropy.rs) - Gap-filling catchup operations | ||
| 786 | - Startup catchup (configurable delay) | ||
| 787 | - Reconnection catchup (limited lookback) | ||
| 788 | - Daily catchup (not yet implemented - placeholder) | ||
| 789 | |||
| 790 | #### Phase 6: Observability (this phase) | ||
| 791 | - [`SyncMetrics`](../../src/sync/metrics.rs) - Full Prometheus integration | ||
| 792 | - Grafana dashboard panels for sync monitoring | ||
| 793 | - Documentation updates | ||
| 794 | |||
| 795 | ### Differences from Original Design | ||
| 796 | |||
| 797 | 1. **Negentropy (NIP-77)**: Simplified gap-filling was used instead of full NIP-77 negentropy reconciliation, as nostr-sdk 0.44 lacks built-in negentropy support. The current implementation uses timestamp-based catchup queries. | ||
| 798 | |||
| 799 | 2. **Filter Consolidation Threshold**: Set at 150 filters (as designed) based on typical relay filter limits. | ||
| 800 | |||
| 801 | 3. **Health Tracking**: Implemented exactly as designed - in-memory only (not persisted to database), which is acceptable for production as health state rebuilds quickly on restart. | ||
| 802 | |||
| 803 | 4. **Metric Label Strategy**: Used simpler numeric encoding for health status (1=healthy, 2=degraded, 3=dead) instead of multiple label values per relay, reducing cardinality. | ||
| 804 | |||
| 805 | 5. **Event Source Tracking**: Implemented four source types (`live`, `startup`, `reconnect`, `daily`) instead of the original (`direct`, `live_sync`, `catchup`, `daily_catchup`). | ||
| 806 | |||
| 807 | ### Three-Layer Filter Strategy (As Implemented) | ||
| 808 | |||
| 809 | ``` | ||
| 810 | Layer 1: Discovery Layer | ||
| 811 | ├── Query: kinds [30617, 30618] (announcements) | ||
| 812 | ├── Applied: At startup and during sync | ||
| 813 | └── Purpose: Discover all repositories across network | ||
| 814 | |||
| 815 | Layer 2: Repository Events | ||
| 816 | ├── Query: Events with A/a tags pointing to tracked repos | ||
| 817 | ├── Format: A tag = "30617:<pubkey>:<identifier>" | ||
| 818 | ├── Triggered: When new announcement is accepted | ||
| 819 | └── Purpose: Get PRs, issues, patches for repositories | ||
| 820 | |||
| 821 | Layer 3: Related Events | ||
| 822 | ├── Query: Events with E/e tags pointing to tracked PRs/Issues | ||
| 823 | ├── Triggered: When new PR/Issue is accepted | ||
| 824 | └── Purpose: Get comments, reviews, status updates | ||
| 825 | ``` | ||
| 826 | |||
| 827 | ### Prometheus Metrics (As Implemented) | ||
| 828 | |||
| 829 | | Metric | Type | Labels | Description | | ||
| 830 | |--------|------|--------|-------------| | ||
| 831 | | `ngit_sync_relay_connected` | Gauge | relay | Connection status (1/0) | | ||
| 832 | | `ngit_sync_connection_attempts_total` | Counter | relay, result | Attempts by outcome | | ||
| 833 | | `ngit_sync_relay_status` | Gauge | relay | Health state (1/2/3) | | ||
| 834 | | `ngit_sync_relay_failures` | Gauge | relay | Consecutive failures | | ||
| 835 | | `ngit_sync_events_total` | Counter | source | Events by source type | | ||
| 836 | | `ngit_sync_gap_events_total` | Counter | relay | Gap events filled | | ||
| 837 | | `ngit_sync_relays_tracked_total` | Gauge | - | Total relays discovered | | ||
| 838 | | `ngit_sync_relays_connected_total` | Gauge | - | Currently connected | | ||
| 839 | | `ngit_sync_relays_dead_total` | Gauge | - | Dead relay count | | ||
| 840 | |||
| 841 | ### Configuration Options (As Implemented) | ||
| 842 | |||
| 843 | All configuration via environment variables or CLI flags: | ||
| 844 | |||
| 845 | | Option | Type | Default | Description | | ||
| 846 | |--------|------|---------|-------------| | ||
| 847 | | `NGIT_SYNC_RELAY_URL` | String | None | Primary sync relay URL | | ||
| 848 | | `NGIT_SYNC_MAX_BACKOFF_SECS` | u64 | 3600 | Max backoff delay (seconds) | | ||
| 849 | | `NGIT_SYNC_STARTUP_DELAY_SECS` | u64 | 30 | Catchup delay after startup | | ||
| 850 | | `NGIT_SYNC_RECONNECT_DELAY_SECS` | u64 | 10 | Catchup delay after reconnect | | ||
| 851 | | `NGIT_SYNC_RECONNECT_LOOKBACK_DAYS` | u64 | 3 | Days to look back on reconnect | | ||
| 852 | |||
| 853 | ### Module Structure (As Implemented) | ||
| 854 | |||
| 855 | ``` | ||
| 856 | src/sync/ | ||
| 857 | ├── mod.rs # Module exports, constants | ||
| 858 | ├── manager.rs # SyncManager - orchestrates sync | ||
| 859 | ├── connection.rs # SyncConnection - per-relay WebSocket | ||
| 860 | ├── filter.rs # FilterService - three-layer filters | ||
| 861 | ├── health.rs # RelayHealthTracker - health states | ||
| 862 | ├── metrics.rs # SyncMetrics - Prometheus integration | ||
| 863 | ├── negentropy.rs # NegentropyService - gap-filling | ||
| 864 | └── subscription.rs # SubscriptionManager - dynamic subs | ||
| 865 | ``` | ||
| 866 | |||
| 867 | ### Production Readiness Checklist | ||
| 868 | |||
| 869 | - [x] All metrics exposed at `/metrics` endpoint | ||
| 870 | - [x] Health state tracking with configurable backoff | ||
| 871 | - [x] Dead relay detection and minimal retry | ||
| 872 | - [x] Startup jitter to prevent thundering herd | ||
| 873 | - [x] Grafana dashboard with sync panels | ||
| 874 | - [x] Configuration documented | ||
| 875 | - [x] Integration tests passing | ||