| Age | Commit message (Collapse) | Author |
|
Refactor internal code to use the mark_negentropy_unsupported() method
instead of direct field access for improved readability.
|
|
When negentropy retry makes no progress (relay returns zero events),
this indicates the relay's negentropy implementation is broken. Instead
of marking the batch as failed, we now:
1. Mark the relay as not supporting NIP-77 so future batches skip
negentropy and use REQ+EOSE directly
2. Fall back to REQ+EOSE using semantic filters (kind/author/tags)
for the current batch, which may succeed where ID-based queries fail
This addresses the issue where some relays (e.g., azzamo.net, snort.social)
return event IDs during negentropy diff but fail to serve those events
when requested by ID.
|
|
Previously, when a relay didn't support NIP-77, the negentropy_sync_diff
function would wait for the full client.sync() timeout even after receiving
a NOTICE message that marked the relay as not supporting NIP-77.
This change uses tokio::select! to race the sync operation against a
polling task that checks the nip77_supported flag every 10ms. When a NOTICE
is received (detected in the message handler), the poll task detects the
status change and immediately returns an error, allowing quick fallback to
REQ+EOSE without waiting for timeouts.
Benefits:
- Fast failure (within 10ms) when relay sends NIP-77 NOTICE
- No artificial timeout reduction that could hurt legitimate operations
- Maintains full timeout for relays that actually support NIP-77
|
|
When negentropy sync times out or has other failures, it now properly
returns Err() instead of Ok() with empty reconciliation. This ensures
historic_sync increments failed_count and triggers fallback to REQ+EOSE
instead of treating it as a successful sync with 0 events.
Resolves issue where bootstrap relay timeouts were marked as complete
instead of falling back to traditional sync.
|
|
- Upgrade NOTICE log level to INFO when relay rejects negentropy (envelope/NEG- errors)
- Track NIP-77 support status per relay connection to avoid repeated failed attempts
- Mark relay as unsupported when NOTICE rejection or timeout occurs
- Skip negentropy on subsequent syncs during same connection session
- Reset support status on reconnect to allow retry after relay upgrades
This reduces log noise and eliminates 10-second timeout delays on each historic
sync attempt for relays that don't support NIP-77 negentropy.
Fixes negentropy-timeout-10-seconds issue by learning from relay behavior.
|
|
Negentropy diff timeouts are expected when relays don't support NIP-77.
The relay responds with NOTICE 'unknown envelope label' and the timeout
is hit before we recognize this is unsupported rather than a failure.
Changes:
- Downgrade from warn! to debug! in negentropy_sync_filter()
(src/sync/relay_connection.rs:493)
- Add comment explaining timeouts are common for non-NIP-77 relays
- Update message to clarify timeout typically means no NIP-77 support
The existing fallback mechanism (lines 505-509) properly handles this
case and logs a one-time warning about falling back to REQ+EOSE.
Discovered via production sync testing against wss://git.shakespeare.diy
|
|
|
|
Replace the owner-npub configuration option with relay-owner-nsec to provide
a persistent cryptographic identity for the relay operator. This addresses
NIP-42 authentication requirements discovered during sync debugging.
Motivation:
- Some relays (e.g., relay.damus.io) require NIP-42 authentication for
advanced features like NIP-77 negentropy sync
- Previously used random ephemeral keys per connection, providing no
persistent identity
- Other relays can now recognize us by pubkey for reputation-based rate
limiting
- Ensures consistency between NIP-11 pubkey and authentication key
Changes:
- Config: relay_owner_nsec with auto-load/generate from .relay-owner.nsec
- NIP-11: Pubkey derived from nsec instead of separate npub field
- Sync: RelayConnection now uses operator keys for NIP-42 auth
- Docs: Updated README, .env.example, and added .relay-owner.nsec to gitignore
Key Features:
- Auto-generates key on first run and saves to .relay-owner.nsec
- Loads existing key from file on subsequent runs
- Can override via CLI flag or environment variable
- Enables reputation building across relay network
- Future-ready for event signing and WoT calculations
Testing:
- 225/232 tests passing (7 pre-existing purgatory failures unrelated)
- Verified key generation, loading, and NIP-11 derivation
- Release build successful
Related: work/sync-debug-analysis.md, work/relay-owner-nsec-implementation.md
|
|
|
|
|
|
|
|
|
|
Add automatic pagination support for non-Negentropy historic sync to handle
large result sets efficiently. When a subscription receives >= 75 events,
the system automatically fetches the next page using the 'until' parameter.
Changes:
- Add PaginationState struct to track event counts and min timestamps
- Add pagination_state HashMap to PendingBatch for per-subscription tracking
- Add PAGINATION_THRESHOLD constant (75 events)
- Pass pending_sync_index to event processor for state updates
- Track events and timestamps as they arrive
- Check threshold on EOSE and launch follow-up subscriptions
- Initialize pagination state when creating historic sync subscriptions
- Update test fixtures in algorithms.rs
The pagination continues recursively until a page returns fewer than 75 events,
ensuring complete historic data retrieval without overwhelming relay limits.
|
|
|
|
Separated connection from subscription logic. The RelayConnection.connect()
method now only handles WebSocket connection establishment. Subscriptions
are managed separately via handle_connect_or_reconnect.
Changes:
- Renamed RelayConnection::connect_and_subscribe() to connect()
- Removed subscription logic from connect method
- Updated call site in try_connect_relay()
- Removed unused build_announcement_filter import
|
|
The system was incorrectly treating subscription-specific CLOSED messages
as connection-wide disconnects, causing live subscriptions to be terminated
immediately after historic_sync completed.
Two bugs fixed:
1. relay_connection.rs: Removed break on RelayMessage::Closed - it's
subscription-specific, not connection-wide
2. mod.rs: Removed disconnect handling for RelayEvent::Closed - only log
at DEBUG level and continue
All 41 sync tests now pass including previously failing live sync tests.
|
|
|
|
Main lib (src/):
- Add #[allow(dead_code)] for build_info field (stored to prevent Prometheus unregistration)
- Add #[allow(dead_code)] for first_seen field (reserved for future rate limiting)
- Replace .or_insert_with(RelaySyncNeeds::default) with .or_default()
- Replace manual div_ceil implementations with .div_ceil(100)
Test code (tests/):
- Replace .expect(&format!(...)) with .unwrap_or_else(|_| panic!(...))
- Remove needless borrows in fetch_metrics() calls
- Add #[allow(dead_code)] and #[allow(unused_imports)] to test helpers module
grasp-audit:
- Apply cargo fmt to fix formatting
|
|
Replace EOSE-based sync completion with negentropy reconciliation for:
- Initial connect (fresh sync)
- Daily sync (Layer 1 announcements)
- Stale reconnect (>15 min)
Key changes:
- Add NegentropySyncResult struct with remote_only, local_only, received fields
- Add supports_negentropy() using try-and-fallback approach
- Add negentropy_sync_filter() using nostr-sdk client.sync() API
- Modify handle_connect_or_reconnect() to use negentropy for fresh/stale sync
- Modify daily_sync() to use negentropy for Layer 1
- Single-warning logging per relay when negentropy fails
Quick reconnects (<15 min) unchanged - still use REQ with since filter.
If negentropy unsupported, gracefully falls back to REQ+EOSE flow.
|
|
nostr-sdk 0.44's Relay::new() is pub(crate), making it impossible to
construct a Relay directly from outside the crate. Relays can only be
created through Client::add_relay() or RelayPool::add_relay().
This commit:
- Adds 'Why Client instead of Relay directly?' section to struct docs
- Updates run_event_loop() docs to explain the API constraint
- Removes outdated 'Future Refactoring' suggestion (not feasible)
|
|
Replace the 1-second polling loop with nostr-sdk's relay-level notification
system that provides immediate disconnect detection via RelayNotification::RelayStatus.
Key changes:
- Use relay.notifications() instead of client.notifications()
- Handle RelayNotification::RelayStatus { Disconnected | Terminated } to detect
connection loss immediately without polling
- Remove tokio::select! with interval timer - now uses simple match loop
- Handle additional notification types (Authenticated, AuthenticationFailed)
Why this is better:
- Event-driven vs polling: no wasted CPU cycles checking every second
- Immediate detection: disconnect triggers notification instantly
- Uses nostr-sdk's built-in mechanism that was previously inaccessible at pool level
(RelayStatus notifications are filtered out in RelayPoolNotification)
Technical note: RelayNotification::RelayStatus is only available via
Relay::notifications(), not Client::notifications(), because the pool-level
broadcast filters out status change events.
Future refactoring opportunity: Consider restructuring RelayConnection to hold
a Relay directly instead of wrapping a Client, since we only manage one relay
per connection anyway.
|
|
- Add periodic health check in RelayConnection::run_event_loop that polls
nostr-sdk's relay.is_connected() every second to detect dead connections
- When event channel closes without explicit Closed/Shutdown, send
DisconnectNotification to SyncManager (fixes case where TCP drops silently)
- Enable test_relay_connected_status test which validates the
ngit_sync_relay_connected metric correctly reflects connection state
The issue was that when a remote relay stops abruptly, nostr-sdk's
notification receiver blocks indefinitely waiting for data. TCP disconnect
detection without keepalive can take minutes. The health check polls
nostr-sdk's internal relay status which detects disconnection promptly.
|
|
|
|
|
|
Changes:
- Fix connection attempt metrics: record success/failure based on actual
connection result instead of pre-emptively recording failure
- Add health tracker integration on connection failure: call
record_failure() and record_health_state() in error path
- Add connection verification in relay_connection.rs: wait 500ms after
connect() then verify is_connected() to detect silent failures
- Add configurable disconnect check interval via
NGIT_SYNC_DISCONNECT_CHECK_INTERVAL_SECS env var
- Update TestRelay with fast test settings: startup_delay=0, jitter=0,
disconnect_check_interval=1s
- Add debug output to metrics tests for investigation
Note: Tests may still fail due to 5-second base backoff in health tracker.
A follow-up task will add NGIT_SYNC_BASE_BACKOFF_SECS config parameter
to allow faster test cycles.
Related: metrics-wiring-plan.md Tasks 1 & 2
|
|
|
|
|
|
|
|
|
|
|