| Age | Commit message (Collapse) | Author |
|
Fix pre-existing clippy lints:
- &PathBuf -> &Path in audit_cleanup.rs
- too_many_arguments on process_newly_available_git_data,
process_purgatory_announcements, and HttpService::new
- clone_on_copy for PublicKey (Copy type) in purgatory cleanup loop
|
|
|
|
Remove the redundant inline kind-30617 registration block from the sync
event loop and the three is_generic/recompute_new_sync_filters_for_relay
calls from confirm_batch error paths. The purgatory announcement sync
timer (run_purgatory_announcement_sync) is now the sole registration path.
Consolidate NGIT_SYNC_BATCH_WINDOW_MS and NGIT_PURGATORY_SYNC_INTERVAL_MS
into a single NGIT_TEST=1 flag that sets both timers to 200ms, replacing
two ad-hoc env vars with one reusable test-mode flag.
|
|
Instead of threading repo_sync_index through PolicyContext/builder.rs/main.rs
to handle user-submitted purgatory announcements, add a simple background
timer (run_purgatory_announcement_sync, every 5s) that scans the purgatory
for announcement entries and registers them in repo_sync_index as StateOnly.
This is simpler and covers both flows:
- Sync-path announcements: inline registration still happens during event
processing (sync/mod.rs:1839+), timer provides a safety net
- User-submitted announcements: SelfSubscriber never sees them (rejected
from DB), timer is the primary registration path
The timer calls sync_purgatory_announcements_to_index() which:
1. Snapshots purgatory via new announcements_for_sync() public method
2. Or_inserts StateOnly entries (never downgrades Full entries)
3. Detects newly added relay URLs and calls handle_new_sync_filters to
connect and subscribe - fixing the failing test that expected relay
discovery from a user-submitted purgatory announcement
Removes: repo_sync_index field from PolicyContext, set/get_repo_sync_index
methods, set_repo_sync_index on Nip34WritePolicy, wiring in main.rs, and
the inline AcceptPurgatory registration block in builder.rs.
|
|
negentropy fallback
Three targeted fixes for purgatory announcement sync:
1. SelfSubscriber sync_level upgrade: After or_insert_with in process_batch,
always set entry.sync_level = SyncLevel::Full so that when a promoted
announcement is broadcast via notify_event and SelfSubscriber receives it,
an existing StateOnly entry gets upgraded to Full and PR event subscriptions
are triggered immediately (not delayed up to 24h).
2. Negentropy fallback filter split: In handle_eose, when falling back from
negentropy to REQ+EOSE, split batch_repos by SyncLevel and call
build_sync_level_aware_filters instead of build_layer2_and_layer3_filters.
Prevents StateOnly (purgatory) repos from getting Layer 2 #a/#A/#q filters
prematurely, which caused nostr-sdk client deduplication to permanently
drop PR events after orphan rejection.
3. Recompute sync filters after announcement batch EOSE: Add
recompute_new_sync_filters_for_relay calls at all three batch-completion
paths in handle_eose for generic filter (announcement) batches. This
triggers state-only subscriptions for any purgatory repos registered during
that batch, fixing the 24h delay before state event sync starts.
4. User-submitted purgatory announcements: Add repo_sync_index field to
PolicyContext with setter/getter, wire in main.rs after SyncManager
creation, and register in AcceptPurgatory handler so user-submitted
announcements get StateOnly sync started immediately.
5. Update archive tests: test_archive_without_state_events_does_not_sync_git
updated to reflect that StateOnly subscription now proactively fetches
state events from source relays. test_archive_read_only_creates_bare_repo
un-ignored as it now works end-to-end.
|
|
premature PR event delivery"
This reverts commit 806936e7d1aab5dfd0c2ad6b98a115122dc1785c.
|
|
premature PR event delivery
StateOnly repos in a pending batch had their repo IDs included in the
negentropy REQ+EOSE fallback, which called build_layer2_and_layer3_filters.
This generated #a/#A/#q tag filters for repos whose announcements were
still in purgatory (not yet promoted to the database).
When the remote relay responded with PR events matching those filters,
the write policy correctly rejected them as 'orphan' (no accepted repo
in DB yet). However, nostr-sdk's client-level deduplication then silently
dropped the same event on all subsequent deliveries, making it permanently
unavailable even after the announcement was promoted.
Fix: split batch_repos into full vs state-only by consulting repo_sync_index
at fallback time, then call build_sync_level_aware_filters which only
generates #a/#A/#q filters for Full repos. StateOnly repos only get
the kind 30618 + #d filter they were originally subscribed with.
|
|
Purgatory announcements need state events (kind 30618) synced from
external relays, but not full L2/L3 events (patches, issues, PRs)
which would be rejected anyway. This implements the SyncLevel concept
from the design doc (decision #6):
- Add SyncLevel enum (Full vs StateOnly) to RepoSyncNeeds
- When announcement enters purgatory during sync, register in
RepoSyncIndex with SyncLevel::StateOnly
- Add build_sync_level_aware_filters() that partitions repos by level:
StateOnly repos only get state event filters (kind 30618)
- Update derive_relay_targets to track state_only_repos separately
- Update compute_actions to handle both repo sets
- SelfSubscriber always uses SyncLevel::Full (promoted repos)
|
|
The partial fix treating ProcessResult::Purgatory as confirmed in
pending_sync_index would trigger full L2/L3 sync for purgatory
announcements. Per design (decision #6), purgatory announcements
should only sync state events via SyncLevel::StateOnly (not yet
implemented).
Ignore test_archive_read_only_creates_bare_repo until SyncLevel
is implemented in Phase 3.
|
|
Route new announcements to purgatory instead of accepting immediately.
Announcements are promoted to the database when git data arrives,
ensuring we only serve announcements for repos with actual content.
Implemented:
- AnnouncementPurgatoryEntry type and DashMap store
- Route new announcements to purgatory (replacement announcements skip)
- Promote announcements on git data arrival (process_purgatory_announcements)
- Authorization checks purgatory announcements (fetch_repository_data_with_purgatory)
- State policy uses purgatory announcements for maintainer validation
- Cleanup task handles announcement expiry
- Updated count()/cleanup() to 3-tuples
Known broken:
- test_archive_read_only_creates_bare_repo fails: sync module does not
treat purgatory announcements as confirmed repos, so per-repo sync
(state events, PRs) is never triggered for purgatory announcements
- Announcement persistence (save/restore) not implemented
- SyncLevel (StateOnly vs Full) not implemented
- Soft expiry two-phase not implemented
- Expiry extension on state event / git auth not wired up
|
|
- Derive Default for config structs instead of manual impl
- Fix doc comment formatting in ArchiveConfig::matches
- Collapse nested if statement in validate_announcement
- Allow too_many_arguments for SyncManager::new
|
|
This merge includes critical bug fixes and comprehensive migration tooling
developed during the relay.ngit.dev migration effort.
Bug Fixes:
- Fix git protocol error handling to return HTTP 200 with ERR pkt-line
- Fix naughty list false positives and DNS failure identification
- Fix database query filters in load_existing_events (remove .since())
- Fix OID fetch tracking to distinguish 0 OIDs from successful fetches
- Fix purgatory event source tracking for filtered expiry logging
- Implement OID retry logic for 'not our ref' errors
Migration Tools & Documentation:
- Complete 5-phase migration analysis pipeline with orchestration script
- Phase 1: Event fetching from source relay
- Phase 2: Git sync verification
- Phase 3: Categorization and relay comparison
- Phase 4: Log extraction (parse failures, purgatory expiry)
- Phase 5: Action classification for migration decisions
- Comprehensive migration guide with lessons learned
- Troubleshooting guide for permission and corruption issues
Configuration:
- Add NGIT_LOG_LEVEL configuration option
- Update git throttle limits to 60/minute
- Improve logging throughout for better observability
|
|
Change protocol error detection to only match WebSocket-specific errors
(websocket, invalid frame) instead of generic 'protocol' keyword which
was incorrectly catching transient git protocol errors.
Git protocol errors like 'fatal: protocol error: bad line length' are
transient network issues that should use backoff/retry, not permanent
naughty list blocking. Only WebSocket/Nostr protocol violations indicate
persistent infrastructure problems.
Fixes production false positive:
- relay.ngit.dev: git protocol error + remote warning misclassified
Add production test cases for git protocol errors and warning combinations.
|
|
Strip URLs (http://, https://, git://, ws://, wss://) from error messages
before classification to prevent false positives from repository names,
paths, or identifiers containing keywords like 'ssl', 'certificate', etc.
- Add strip_urls() function to remove URLs before pattern matching
- Add WebSocket protocol support (ws://, wss://) for relay errors
- Filter remote warnings that don't indicate infrastructure problems
- Use more specific SSL/TLS patterns to avoid npub substring matches
- Reduce test suite from 40 to 13 tests, keeping only edge cases
Fixes false positives seen in production:
- git.shakespeare.diy: 'repository not found' with npub containing 'ssl'
- relay.ngit.dev: HTTP 500 error with npub containing 'ssl'
- gitnostr.com: remote permission warning misclassified as protocol error
|
|
failures
|
|
load_existing_events()
Root cause: `last_connected` was set to Timestamp::now() BEFORE
load_existing_events() was called (line 425), causing the database
query to filter out all existing events with .since(current_time).
The query became: SELECT * FROM events WHERE created_at >= <now>
Result: 0 events returned (nothing has created_at in the future)
Solution: Remove .since() filter from database queries entirely.
The `last_connected` field is now only used for WebSocket subscription
filters to avoid re-fetching events from remote relays on reconnect.
Rationale for this approach over reordering operations:
- Database queries are fast (indexed by kind and created_at)
- Loading all events on startup ensures consistency
- Eliminates subtle ordering dependency that could break in refactoring
- Cleaner mental model: database = full load, WebSocket = incremental
This fixes the issue where ~190 state events weren't being fetched
after deploying the database query fix (commit 4162c90).
Evidence: Production logs showed "Loaded announcements from database
count=0" when there should have been hundreds of announcements.
|
|
Previously, SelfSubscriber only saw events returned by the WebSocket
subscription to the local relay, which has limits on the number of
events returned. This caused repos with announcements in the database
to never get Layer 2/3 filters created, resulting in missing state events.
Now, on startup, we query the database directly with two separate queries:
1. Query announcements (30617) to populate repo_sync_index
2. Query root events (1617/1618/1621) to create Layer 3 filters
Both queries use .since(last_connected) if available for incremental
loading on reconnect.
Filters are created inline and made mutable to support the .since()
clause, rather than using a shared create_event_filter() method.
Fixes the issue where state events were missing for repos like cashbird
and creative-space that had announcements in the database but weren't
returned by the WebSocket subscription.
|
|
caught a production bug where npub in url string contained "dns"
triggering false positive
|
|
Refactor internal code to use the mark_negentropy_unsupported() method
instead of direct field access for improved readability.
|
|
When negentropy retry makes no progress (relay returns zero events),
this indicates the relay's negentropy implementation is broken. Instead
of marking the batch as failed, we now:
1. Mark the relay as not supporting NIP-77 so future batches skip
negentropy and use REQ+EOSE directly
2. Fall back to REQ+EOSE using semantic filters (kind/author/tags)
for the current batch, which may succeed where ID-based queries fail
This addresses the issue where some relays (e.g., azzamo.net, snort.social)
return event IDs during negentropy diff but fail to serve those events
when requested by ID.
|
|
shutdown/startup
Implement save/restore functionality for rejected events cache and
integrate persistence with relay shutdown/startup lifecycle. Both
purgatory and rejected cache now survive relay restarts.
Key features:
- Serialize rejected events cache to JSON (rejected-events-cache.json)
- Save both hot cache (2min, full events) and cold index (7day, metadata)
- Restore with downtime adjustment (preserves remaining TTL)
- Graceful degradation (missing/corrupted files don't crash)
- File cleanup after successful restore
- Automatic restoration in SyncManager::new()
Integration:
- Shutdown hook saves both purgatory and rejected cache
- Startup hook restores both and re-queues repositories
- Non-fatal errors (logs warnings, continues on failure)
Files:
- src/sync/rejected_index.rs: save_to_disk/restore_from_disk methods
- src/sync/mod.rs: SyncManager integration and auto-restore
- src/main.rs: Shutdown/startup hooks for both caches
- tests/purgatory_persistence.rs: 17 integration tests
Tests: 13 unit tests + 17 integration tests covering full lifecycle
|
|
The bug: SelfSubscriber filtered announcements with lists_our_relay() check,
preventing archive_all mode from discovering relays in announcements that
don't list our relay domain.
The insight: SelfSubscriber only receives events that ALREADY passed
write policy validation (archive_all, archive_whitelist, blacklist, etc.)
via admit_event() before being saved to the database. The event flow:
External relay → process_event_static() → write_policy.admit_event()
→ (validation happens here) → save to DB → notify_event()
→ SelfSubscriber receives via WebSocket
So the lists_our_relay() check was redundant double-validation that
broke archive_all mode by filtering events that had already been
accepted by the write policy.
The fix: Simply remove the lists_our_relay() filtering. Events reaching
SelfSubscriber are pre-validated and should all be processed for relay
discovery according to the configured archive policy.
Changes:
- Removed lists_our_relay() check from process_notification() (4 lines)
- Removed unused lists_our_relay() helper function (9 lines)
- Added comment explaining events are pre-validated (3 lines)
- Total: 13 lines removed, 3 lines added
Fixes #194d
|
|
Add comprehensive comment explaining why some relays (azzamo.net,
snort.social) return zero events during negentropy retry even when
they have the events. Documents infinite loop prevention logic and
suggests future REQ+EOSE fallback strategy.
|
|
Implement domain-level naughty list tracking for git remotes, reusing the
existing NaughtyListTracker from relay sync. This prevents repeated attempts
to fetch from git domains with persistent infrastructure issues (SSL/TLS
certificate errors, DNS failures).
Changes:
- Updated NaughtyListTracker to track both relay URLs and git domains
- Added git_naughty_list field to RealSyncContext for error classification
- Modified fetch_oids() to classify git fetch errors and record naughty domains
- Updated sync_identifier_next_url() to filter out naughty domains during URL selection
- Added git_naughty_list parameter to ThrottleManager for domain queue processing
- Threaded naughty list through start_sync_loop and all sync functions
- Updated all tests to pass naughty list parameter
The naughty list uses 12-hour expiration (configurable) to allow domains to
recover from infrastructure issues. First occurrence logs WARN, repeats log DEBUG.
|
|
When negentropy sync fails (one or more filters fail during diff), the
code previously left a pending batch and returned early, preventing any
sync from happening. This caused the "No sync targets found" issue.
Changes:
- Track negentropy success with a boolean flag
- On negentropy failure: clean up pending batch and fall through to REQ+EOSE
- Log the fallback at info level for visibility
- Restructure control flow so REQ+EOSE path executes after negentropy failure
This ensures sync always completes using traditional REQ+EOSE when
NIP-77 negentropy is unavailable or fails.
|
|
Add naughty list tracking for relays with persistent infrastructure issues
(DNS failures, TLS certificate errors, protocol violations) to reduce log
noise and provide better visibility via metrics.
Key features:
- Classify errors into naughty (persistent) vs transient (temporary)
- Track naughty relays with category, reason, and occurrence count
- Log WARN on first naughty occurrence, DEBUG on repeats
- Automatic expiration after 12 hours (configurable)
- Prometheus metrics for monitoring naughty relays by category
- Periodic cleanup task integrated with health checker
Components added:
- src/sync/naughty_list.rs: Core naughty list tracker with error classification
- NaughtyListTracker integration in RelayHealthTracker
- Connection error handling updates in sync manager
- Naughty list metrics (total by category, detailed info per relay)
- Config option for naughty_list_expiration_hours (default: 12)
Closes DNS lookup failures and TLS certificate errors tracking issues.
|
|
Live subscriptions (limit:0, no auto-close) are not tracked in
PendingBatch because they stay open indefinitely for new events.
When they receive EOSE (immediately, since no historic events),
handle_eose can't find them in outstanding_subs.
This is expected behavior, not an error. Changed log level from
warn to trace to reduce noise.
Observed in production logs: sync_live() subscriptions with limit:0
complete immediately and trigger this path.
Issue: work/active-issues/eose-unknown-subscription.md
|
|
Removes kind 30618 (state events) from Layer 1 announcement filter and
adds targeted subscriptions using #d (identifier) tags in Layer 2.
Problem: Layer 1 was receiving ALL state events from all relays,
causing 1000+ rejections for repositories we don't host.
Solution:
- Remove Kind::RepoState from build_announcement_filter (Layer 1)
- Add state_event_filters_for_our_repos() function that creates filters
with kind 30618 and #d tags for only our hosted repo identifiers
- Integrate state filters into build_layer2_and_layer3_filters
- Extract unique identifiers from repo refs and batch by 100 per filter
Benefits:
- Dramatically reduces bandwidth and rejection noise (1000+ → ~0)
- More efficient: one filter with multiple identifiers vs broadcast
- Only receive state events for repositories we actually care about
Resolves: work/active-issues/layer1-state-event-oversubscription.md
|
|
Previously, when a relay didn't support NIP-77, the negentropy_sync_diff
function would wait for the full client.sync() timeout even after receiving
a NOTICE message that marked the relay as not supporting NIP-77.
This change uses tokio::select! to race the sync operation against a
polling task that checks the nip77_supported flag every 10ms. When a NOTICE
is received (detected in the message handler), the poll task detects the
status change and immediately returns an error, allowing quick fallback to
REQ+EOSE without waiting for timeouts.
Benefits:
- Fast failure (within 10ms) when relay sends NIP-77 NOTICE
- No artificial timeout reduction that could hurt legitimate operations
- Maintains full timeout for relays that actually support NIP-77
|
|
When negentropy sync times out or has other failures, it now properly
returns Err() instead of Ok() with empty reconciliation. This ensures
historic_sync increments failed_count and triggers fallback to REQ+EOSE
instead of treating it as a successful sync with 0 events.
Resolves issue where bootstrap relay timeouts were marked as complete
instead of falling back to traditional sync.
|
|
Change relay NOTICE logging from DEBUG to TRACE level to avoid
duplicate logs (nostr-sdk already logs all NOTICEs at DEBUG level).
Negentropy-specific NOTICEs remain at INFO level as they indicate
important NIP-77 support information.
|
|
EOSE messages can arrive after batch completion due to:
1. Late/duplicate EOSE from relay (e.g., live_sync REQ subscriptions)
2. Race condition between batch confirmation and EOSE arrival
3. EOSE during intentional disconnect cleanup
Since this is expected behavior, downgrade from debug to trace level
to reduce log noise. Added detailed code comment explaining the
scenarios and suggesting how to investigate if needed (tracking
recently-completed subscription IDs).
Resolves issue where duplicate EOSE from live_sync subscriptions
appeared as confusing 'unknown relay' debug messages.
|
|
Previously, disconnect_relay() would immediately remove RelayState and
pending batches before the event loop finished draining messages. This
caused confusing 'unknown relay' debug messages for EOSE and other
events that arrived after state removal but were expected during
normal shutdown.
Changes:
- Add ConnectionStatus::Disconnecting to track intentional disconnects
- disconnect_relay() now marks relay as Disconnecting (keeps state)
- Event loop drains messages while state exists
- handle_disconnect() detects intentional vs unexpected disconnects:
- Intentional: Completes cleanup by removing state/connections
- Unexpected: Updates to Disconnected, keeps connection for retry
- handle_eose() suppresses logs for Disconnecting relays (TRACE level)
- check_disconnects() skips relays already in Disconnecting state
This ensures proper sequencing: mark->drain->cleanup instead of
remove->drain->confusion. Fixes the root cause instead of just
hiding log messages.
|
|
- Upgrade NOTICE log level to INFO when relay rejects negentropy (envelope/NEG- errors)
- Track NIP-77 support status per relay connection to avoid repeated failed attempts
- Mark relay as unsupported when NOTICE rejection or timeout occurs
- Skip negentropy on subsequent syncs during same connection session
- Reset support status on reconnect to allow retry after relay upgrades
This reduces log noise and eliminates 10-second timeout delays on each historic
sync attempt for relays that don't support NIP-77 negentropy.
Fixes negentropy-timeout-10-seconds issue by learning from relay behavior.
|
|
The bootstrap relay was being registered with is_bootstrap=false, causing
it to be disconnected when empty. This change adds an is_bootstrap parameter
to register_relay() and passes true when registering the bootstrap relay.
The existing check_disconnects() logic already skips bootstrap relays,
but the flag was never being set correctly.
|
|
When bootstrap sync completes with zero announcements, users may not
know if this is expected or indicates a configuration problem (wrong
domain or wrong bootstrap relay).
Changes:
- Add INFO-level message after bootstrap announcement sync completes
- If zero announcements: suggest verifying domain/relay configuration
- If announcements found: report count for user awareness
- Only applies to bootstrap relay (is_bootstrap flag)
This helps users quickly diagnose configuration issues during initial
setup and testing.
Discovered via production sync testing against wss://git.shakespeare.diy
|
|
Negentropy diff timeouts are expected when relays don't support NIP-77.
The relay responds with NOTICE 'unknown envelope label' and the timeout
is hit before we recognize this is unsupported rather than a failure.
Changes:
- Downgrade from warn! to debug! in negentropy_sync_filter()
(src/sync/relay_connection.rs:493)
- Add comment explaining timeouts are common for non-NIP-77 relays
- Update message to clarify timeout typically means no NIP-77 support
The existing fallback mechanism (lines 505-509) properly handles this
case and logs a one-time warning about falling back to REQ+EOSE.
Discovered via production sync testing against wss://git.shakespeare.diy
|
|
During relay disconnect, EOSE messages may arrive after the relay has
been removed from pending_sync_index. This creates a benign race
condition that was logged as a warning.
Changes:
- Downgrade from warn! to debug! in handle_eose() (src/sync/mod.rs:632)
- Add clarifying comment explaining this occurs during disconnect
- Update message to indicate this is expected behavior
Discovered via production sync testing against wss://git.shakespeare.diy
|
|
Remove rejected_states_index and use single rejected_events_index for both
announcement and state events. Extract duplicate re-processing logic into
a consolidated helper function.
Changes:
- Eliminate duplicate RepositoryAnnouncement::from_event() call
- Remove rejected_states_index field from SyncManager
- Update cleanup loop to process both event types via single index
- Add ReprocessingStats struct to track re-processing outcomes
- Add reprocess_events_from_hot_cache() helper that handles:
- Logging re-processing attempts with context
- Calling process_event_static recursively
- Tracking saved/duplicate/purgatory/rejected counts
- Replace three nearly-identical re-processing loops with helper calls
Consolidates phases 1, 5, and 6 of rejected events index refactoring.
|
|
Replace duplicate metrics methods (announcements vs states) with unified
methods using IntGaugeVec/IntCounterVec with an event_type label:
- update_rejected_hot_cache_size(event_type, size)
- record_rejected_hot_cache_hit(event_type)
- record_rejected_hot_cache_miss(event_type)
- record_rejected_hot_cache_expired(event_type, count)
- update_rejected_cold_index_size(event_type, size)
- record_rejected_cold_index_expired(event_type, count)
- record_rejected_invalidation(event_type, count)
Prometheus labels remain separate (event_type="announcement" vs
event_type="state") but implementation is now unified.
Phase 4 of rejected events index refactoring.
|
|
Add EventType enum (Announcement, State) to distinguish event types within
RejectedEventsIndex. This consolidates the two-tier index design into a
single unified interface.
Changes:
- Add EventType enum with Announcement and State variants
- Add event_type field to HotCacheEntry and ColdIndexEntry
- Create unified invalidate_and_get() with optional event_type filter
- Update cleanup_expired_for_type() to handle both types
- Remove deprecated wrapper methods (invalidate_and_get_events,
invalidate_and_get_state_events, cleanup_expired, cleanup_states_expired)
Consolidates phases 2, 3, and 7 of rejected events index refactoring.
|
|
|
|
Replace PR-specific references (PR3, PR4.1, PR4.2) with problem-focused
documentation that explains what the code does and why.
Changes:
- Maintainer re-processing: Explain race condition handling
- State event re-processing (announcement): Clarify timing issue
- State event re-processing (state): Describe multi-event scenario
Why: PR numbers are ephemeral and meaningless to future readers.
Comments should explain the problem being solved, not when code was added.
All tests pass: 248 library tests passing
|
|
**Problem:**
Integration test `test_concurrent_state_and_pr_sync` was timing out because
of a race condition: when syncing from remote relays, state events can arrive
BEFORE their announcements (no ordering guarantee). The system was rejecting
these state events with "no announcement exists" but NOT tracking them for
re-processing when the announcement later arrived.
**Solution:**
Implemented announcement → state event re-processing (GRASP-02 PR4.1) to
handle the race condition, mirroring the existing maintainer announcement
re-processing logic (GRASP-02 PR3).
**What Changed:**
1. **Announcement → State Event Re-processing (GRASP-02 PR4.1)**: When a
repository announcement is accepted, the system now invalidates and
re-processes state events that were rejected with "no announcement exists".
This ensures state events arriving before their announcements are eventually
processed correctly.
2. **State Event → State Event Re-processing (GRASP-02 PR4.2)**: When a state
event is accepted (git data arrives), the system invalidates and re-processes
other rejected state events for the same repository from the hot cache.
(Renamed from PR4 for clarity - this was already implemented in previous commit)
3. **Proper Rejection Tracking**: Extended rejection reason detection to include
"no announcement exists" and "not authorized" messages, ensuring these state
events are properly tracked in the rejected events index for re-processing.
4. **Proper State Event Metrics**: State events now use `add_state()` instead
of `add_announcement()` when rejected, ensuring correct metrics tracking.
5. **Removed Redundant Field**: Removed `event_id` field from `ColdIndexEntry`
since it's already stored as the HashMap key. This eliminates dead code while
preserving the cold index's core purpose: preventing re-fetch of rejected
events during negentropy sync via `get_all_event_ids()`.
6. **Fixed Doc Test**: Changed doc test from `no_run` to `ignore` since it uses
undefined variables for illustration purposes.
7. **Fixed Clippy Warnings**:
- Added `#[allow(dead_code)]` for `reason` fields (reserved for future metrics)
- Fixed unused variable warning
- Collapsed nested if statement
**Why:**
The two-tier rejected events index was handling two scenarios:
- GRASP-02 PR3: Maintainer announcement arrives → re-process announcements
- GRASP-02 PR4.2: State event with git data arrives → re-process state events
But it was missing:
- GRASP-02 PR4.1: Repository announcement arrives → re-process state events
This created a race condition where state events arriving before their
announcements would be rejected and never re-processed.
**Implementation Details:**
The fix follows the same pattern as maintainer re-processing:
1. When announcement accepted, parse it to get pubkey + identifier
2. Call `invalidate_and_get_state_events()` to get rejected state events
3. Re-process each state event from hot cache using `process_event_static()`
4. Log results (Saved, Purgatory, Duplicate, or still rejected)
**Test Results:**
✅ All tests pass (578 total):
- 248 unit tests pass
- 330 integration tests pass (including the previously failing test)
- All clippy warnings fixed
- Doc tests pass
✅ Target test now passes consistently:
- `test_concurrent_state_and_pr_sync` completes in ~2.7s (was timing out at 30s)
**Impact:**
- Fixes race condition in sync ordering (state before announcement)
- No breaking changes - only adds re-processing capability
- Follows existing patterns - mirrors GRASP-02 PR3 maintainer re-processing
- Minimal code changes - ~86 lines added to handle new re-processing path
**Files Changed:**
```
src/sync/mod.rs | 86 +++++++++++++++++++++++++++++++++++++++++++++
src/sync/rejected_index.rs | 6 ++--
2 files changed, 87 insertions(+), 5 deletions(-)
```
Co-authored-by: Assistant <assistant@anthropic.com>
|
|
Add comprehensive authorization checks to ensure state events are only
accepted from maintainers of accepted repository announcements. This
implements the core GRASP-01 requirement that pushes must match the
latest state announcement "respecting the maintainer set."
Changes:
1. StatePolicy authorization (src/nostr/policy/state.rs):
- Check authorization BEFORE git data validation (fail-fast)
- Reject if no announcement exists for repository
- Reject if author not in maintainer set
- Use existing helpers: fetch_repository_data() and
pubkey_authorised_for_repo_owners()
- Structured logging for all rejections
2. Purgatory invalidation (src/nostr/builder.rs):
- New method: check_purgatory_state_events_for_identifier()
- Called when announcements accepted (Accept and AcceptMaintainer)
- Re-evaluates state events in purgatory for the identifier
- Processes newly-authorized events (releases from purgatory)
- Keeps unauthorized events for natural expiry (30 min)
- Enables retroactive authorization when announcements arrive late
3. Purgatory sync authorization (src/git/sync.rs):
- Check authorization BEFORE processing git data
- Remove unauthorized events from purgatory (permanent rejection)
- Prevents processing even if git data arrives first
- Structured logging for monitoring
4. Rejected events tracking (src/sync/rejected_index.rs):
- Add support for tracking rejected state events
- New methods: add_state(), contains_state()
- Separate metrics for state rejections
- Enables sync to avoid re-fetching rejected states
5. Sync metrics (src/sync/metrics.rs, src/sync/mod.rs):
- Add state-specific metrics (hot cache, cold index)
- Track rejected states separately from announcements
- Support monitoring of authorization rejections
6. Comprehensive tests (tests/state_authorization.rs):
- test_reject_state_without_announcement
- test_reject_state_from_unauthorized_author
- test_accept_state_from_announcement_author
- test_accept_state_from_maintainer
Security Impact:
- Before: State events could be published by anyone
- After: Only maintainers can publish state events
- Defense-in-depth: Authorization checked at 3 points:
1. On arrival (StatePolicy)
2. On announcement acceptance (purgatory re-evaluation)
3. On git data arrival (purgatory sync)
All tests pass:
- 248 unit tests
- 51 NIP-34 announcement tests
- 4 new state authorization tests
- 9 rejected index tests
Closes: State authorization requirement from GRASP-01 spec
|
|
Add automatic cleanup and Prometheus metrics for the two-tier rejected
events index that caches rejected announcements for re-processing.
Cleanup loops:
- Hot cache: Every 60 seconds (events expire after 2 minutes)
- Cold index: Every 24 hours (metadata expires after 7 days)
- Background task with graceful shutdown support
New Prometheus metrics (7):
- Gauges: hot_cache_current, cold_index_current
- Counters: hits, misses, hot_expired, cold_expired, invalidated
This completes the maintainer announcement re-processing feature,
reducing wait time from 24 hours to <1 second when a maintainer's
announcement arrives before the repository owner's announcement.
Memory is bounded through automatic cleanup, and comprehensive metrics
enable monitoring of hit rates, memory usage, and cleanup effectiveness.
Changes:
- src/sync/metrics.rs: Added 7 metrics with recording methods
- src/sync/rejected_index.rs: Added optional metrics support
- src/sync/mod.rs: Added cleanup background task
Tests: 248 library tests passing, 3 integration tests passing
|
|
- Add two-tier rejected events index (hot cache + cold index)
- Hot cache: 2-minute in-memory storage of full rejected events
- Cold index: 7-day metadata storage for deduplication
- Immediate re-processing when owner announcements list maintainers
- Fix rejection reason detection to match actual error messages
- Rewrite integration tests to use two-relay sync pattern
- All tests passing (3 passed, 1 ignored slow test)
|
|
Replaces the simple HashSet<EventId> with the sophisticated two-tier
RejectedEventsIndex from PR1, enabling future immediate re-processing
when maintainer dependencies resolve.
## Changes
### Config (src/config.rs)
- Add `rejected_hot_cache_duration_secs` (default: 120 = 2 minutes)
- Add `rejected_cold_index_expiry_secs` (default: 604800 = 7 days)
- Both configurable via CLI flags or environment variables
### SyncManager (src/sync/mod.rs)
**Type Change:**
- Before: `Arc<RwLock<HashSet<EventId>>>` (simple event ID set)
- After: `Arc<RejectedEventsIndex>` (two-tier storage)
**Initialization:**
- Pass config durations to RejectedEventsIndex::new()
- Creates hot cache (2 min) + cold index (7 days)
**Event Processing (process_event_static):**
- Extract identifier from 'd' tag
- Determine rejection reason from error message
- Call `add_announcement()` with full event + metadata
- Stores in both hot cache and cold index
**Negentropy Sync (derive_relay_targets):**
- Call `get_all_event_ids()` to get rejected IDs
- Returns union of hot cache + cold index event IDs
- Excludes from negentropy reconciliation
**Event Loop (relay_connection):**
- Use `contains()` method instead of direct HashSet access
- Simpler API, same skip-rejected behavior
### RejectedEventsIndex (src/sync/rejected_index.rs)
**New Method:**
- `get_all_event_ids()`: Returns HashSet<EventId> from both tiers
- Used for negentropy exclusion (replaces direct HashSet access)
### Tests Updated
**test_rejected_events_index_tracks_announcements:**
- Create RejectedEventsIndex with config durations
- Add 'd' tag to test announcement
- Use `add_announcement()` with full event
- Verify both hot cache and cold index populated
- Check lengths with `hot_cache_len()` and `cold_index_len()`
**test_rejected_events_excluded_from_negentropy:**
- Create RejectedEventsIndex instead of HashSet
- Build full event with 'd' tag
- Add to index with `add_announcement()`
- Get IDs with `get_all_event_ids()`
- Verify excluded from reconciliation
## Architecture
```
┌─────────────────────────────────────────────────────────────┐
│ SyncManager │
│ │
│ rejected_events_index: Arc<RejectedEventsIndex> │
│ ├─ Hot Cache (2 min): Full events for re-processing │
│ └─ Cold Index (7 days): Metadata for dedup │
└─────────────────────────────────────────────────────────────┘
│
│ On rejection
▼
┌─────────────────────────────────────────────────────────────┐
│ add_announcement(event, pubkey, identifier, reason) │
│ ├─ Store full event in hot cache │
│ └─ Store metadata in cold index │
└─────────────────────────────────────────────────────────────┘
│
│ On negentropy sync
▼
┌─────────────────────────────────────────────────────────────┐
│ get_all_event_ids() → HashSet<EventId> │
│ ├─ Union of hot cache IDs │
│ └─ Union of cold index IDs │
└─────────────────────────────────────────────────────────────┘
```
## Benefits
### Immediate
- **Better tracking**: Store rejection reason + metadata
- **Configurable**: Tune cache/index durations per deployment
- **Observable**: Separate hot/cold metrics (future PR4)
### Future (PR3)
- **Immediate re-processing**: Get events from hot cache when valid
- **No 24h delay**: Maintainer announcements accepted in <1 second
- **Automatic recovery**: Hot cache for immediate, cold index for later
## Backward Compatibility
**No breaking changes:**
- Same rejection behavior (skip events in index)
- Same negentropy exclusion (union with purgatory IDs)
- Default config values match previous implicit behavior
**Migration:**
- Existing deployments continue working with defaults
- Optional: Tune durations via new config flags
## Testing
All tests passing:
- ✅ 9 rejected_index tests (hot cache, cold index, two-tier)
- ✅ 139 sync module tests (including updated integration tests)
- ✅ 247 total library tests
## Next Steps
**PR3: Add invalidation + immediate re-processing**
- Invalidate cold index when owner announcement accepted
- Get events from hot cache for re-processing
- Recursive call to process_event_static
- Integration tests for <1s maintainer acceptance
**PR4: Add cleanup + metrics**
- Hot cache cleanup task (every 60s)
- Cold index cleanup task (daily)
- Prometheus metrics for both tiers
- Monitor hot cache hits vs misses
## Configuration Examples
```bash
# Default (2 min hot cache, 7 day cold index)
ngit-grasp
# Longer hot cache for slow relays
ngit-grasp --rejected-hot-cache-duration-secs 300
# Shorter cold index for memory-constrained systems
ngit-grasp --rejected-cold-index-expiry-secs 86400
# Environment variables
export NGIT_REJECTED_HOT_CACHE_DURATION_SECS=180
export NGIT_REJECTED_COLD_INDEX_EXPIRY_SECS=259200
ngit-grasp
```
Part of: Maintainer chain discovery fix
See: work/SOLUTION-SUMMARY-V2.md for full design
Previous: PR1 (rejected_index.rs implementation)
Next: PR3 (invalidation + re-processing)
|
|
Implements a sophisticated two-tier storage system for rejected repository
announcements to enable immediate re-processing when dependencies resolve.
## Architecture
**Tier 1: Hot Cache (2 minutes)**
- Stores full event objects for immediate re-processing
- Enables <1 second re-processing vs 24 hour wait
- Auto-expires to prevent memory growth
- Memory: ~200 KB typical, ~20 MB worst case
**Tier 2: Cold Index (7 days)**
- Stores metadata only (event_id, pubkey, identifier)
- Prevents repeated downloads of rejected events
- Enables invalidation when circumstances change
- Memory: ~1 MB typical
## Problem Solved
Without this system, maintainer announcements face a timing gap:
00:00 - Maintainer announcement rejected → Event discarded
00:02 - Owner announcement accepted (lists maintainer) → Want to re-process
00:02 - ❌ Maintainer announcement GONE → Must wait 24h for next sync
With two-tier system:
00:00 - Maintainer announcement rejected → Stored in both tiers
00:02 - Owner announcement accepted → Invalidate + get from hot cache
00:02 - ✅ Re-process immediately → Accepted in <1 second
## Implementation
New module: src/sync/rejected_index.rs
- RejectedEventsIndex: Public API combining both tiers
- HotCache: Internal struct for full event storage
- ColdIndex: Internal struct for metadata storage
- RejectionReason: Enum for tracking why events were rejected
Key methods:
- add_announcement(): Add to both tiers
- contains(): Check if event is rejected
- invalidate_and_get_events(): Remove from cold index, get from hot cache
- cleanup_expired(): Remove expired entries from both tiers
## Testing
9 comprehensive unit tests covering:
- Hot cache storage and retrieval
- Hot cache expiration
- Cold index metadata tracking
- Cold index invalidation
- Two-tier integration
- Cleanup of expired entries
- Hot cache misses after expiry
- Multiple maintainer repositories
All tests passing.
## Next Steps
PR2: Switch SyncManager to use new RejectedEventsIndex
PR3: Add invalidation + immediate re-processing logic
PR4: Add cleanup task + Prometheus metrics
Part of: Maintainer chain discovery fix
See: work/SOLUTION-SUMMARY-V2.md for full design
|
|
- Fix relay_connected() helper to check v >= 2 (Syncing/Connected states)
- Fix unit test to use status value 3 (Connected) instead of 1 (Connecting)
- Fix clippy warning: use .to_vec() instead of .iter().cloned().collect()
All 61 sync integration tests now passing.
All 238 unit tests passing.
Clippy clean.
|
|
Resolves naming conflict with RelayHealthState::Degraded by using a more
explicit name that clearly indicates the connection status relates to
historic sync failures, not connection health degradation.
Changes:
- ConnectionStatus::ConnectedDegraded → ConnectedHistoricSyncFailures
- Updated all documentation and comments
- Updated Prometheus metric descriptions
- Metric value remains 4 for backward compatibility
This makes it clear that:
- ConnectedHistoricSyncFailures = connection lifecycle (missing historic data)
- RelayHealthState::Degraded = connection health (reliability issues)
These are orthogonal concerns - a relay can be ConnectedHistoricSyncFailures
but Healthy, or Connected but Degraded.
|
|
- Add ConnectionStatus::ConnectedDegraded (status=4 in metrics)
- Track batch failures via PendingBatch.failed field
- Track relay-level failures via RelayState.historic_sync_had_failures
- Transition to ConnectedDegraded when any batch fails during historic sync
- Add is_live_sync_active() helper for cleaner match patterns
- Update state machine diagram with ConnectedDegraded transitions
- Update metrics docs with status=4 and example queries
Fixes issue where relays with failed negentropy retries would
incorrectly transition to Connected status despite missing data.
Now operators can distinguish 'fully synced' vs 'degraded (partial data)'.
|
|
- Add ConnectionStatus::Syncing state between Connecting and Connected
- Track historic_sync_completed and historic_sync_completed_at in RelayState
- Auto-detect sync completion via check_and_complete_historic_sync()
- Update metrics: ngit_sync_relay_connected now shows 0-3 (disconnected/connecting/syncing/connected)
- Update Prometheus metric documentation with new status values
- Add state machine diagram showing Syncing transition
- Operators can now distinguish 'connected but catching up' vs 'fully synced'
|
|
Add retry protection to negentropy event validation:
- Track retry_count in PendingBatch (incremented on each retry attempt)
- Detect when retry makes zero progress (relay returns no requested events)
- Abort retry and complete batch with partial results when stuck
- Log error with full details when retry protection triggers
This prevents infinite loops when:
- Relay has bugs and returns wrong events for ID queries
- Relay is malicious and returns unrelated events
- Relay has eventual consistency issues
- Network corruption causes incorrect responses
The protection triggers when received_count == 0 on a retry (relay
returned nothing we asked for), indicating the relay will never
provide the missing events.
Future work: Track failed batches in Prometheus metrics
(sync_failed_batches_total) for monitoring and alerting.
|
|
Add validation that all events requested by ID during negentropy sync
are actually received from the relay. When events are missing:
- Log detailed information (requested/received/missing counts and IDs)
- Create retry subscriptions for missing events (chunked by 300)
- Update batch to track only missing events in next round
- Only complete batch after all events received or retry fails
This handles relays that have limits on ID-based queries (e.g., max 150
events per query) by automatically retrying in smaller chunks.
Also excludes purgatory and rejected announcement events from negentropy
requests to avoid re-requesting events we know we can't/won't store.
Note: Current implementation lacks retry limit - infinite loop protection
needed (tracked as future work).
|
|
Implement RejectedEventsIndex to prevent repeatedly fetching and
processing announcement events (kinds 30617/30618) that have been
rejected by the write policy.
Changes:
- Add RejectedEventsIndex to track rejected announcement EventIds
- Record rejections in process_event_static when announcements fail
write policy validation
- Exclude rejected events from negentropy sync (along with purgatory)
- Skip rejected events early in REQ+EOSE processing
- Add 2 tests verifying tracking and exclusion logic
Benefits:
- Reduced network traffic (no re-fetching of known-bad events)
- Lower CPU usage (no repeated validation)
- Faster sync (smaller negentropy diffs)
- Better observability (trace logging when skipping)
Scope limited to announcements as they are the primary source of
repeated rejection cycles during Layer 1 sync.
Closes: Reduces wasted bandwidth from continually fetching rejected events
|
|
Replace the owner-npub configuration option with relay-owner-nsec to provide
a persistent cryptographic identity for the relay operator. This addresses
NIP-42 authentication requirements discovered during sync debugging.
Motivation:
- Some relays (e.g., relay.damus.io) require NIP-42 authentication for
advanced features like NIP-77 negentropy sync
- Previously used random ephemeral keys per connection, providing no
persistent identity
- Other relays can now recognize us by pubkey for reputation-based rate
limiting
- Ensures consistency between NIP-11 pubkey and authentication key
Changes:
- Config: relay_owner_nsec with auto-load/generate from .relay-owner.nsec
- NIP-11: Pubkey derived from nsec instead of separate npub field
- Sync: RelayConnection now uses operator keys for NIP-42 auth
- Docs: Updated README, .env.example, and added .relay-owner.nsec to gitignore
Key Features:
- Auto-generates key on first run and saves to .relay-owner.nsec
- Loads existing key from file on subsequent runs
- Can override via CLI flag or environment variable
- Enables reputation building across relay network
- Future-ready for event signing and WoT calculations
Testing:
- 225/232 tests passing (7 pre-existing purgatory failures unrelated)
- Verified key generation, loading, and NIP-11 derivation
- Release build successful
Related: work/sync-debug-analysis.md, work/relay-owner-nsec-implementation.md
|
|
|
|
|
|
|
|
- Replace KIND_REPOSITORY_ANNOUNCEMENT with Kind::GitRepoAnnouncement
- Replace KIND_REPOSITORY_STATE with Kind::RepoState
- Replace KIND_PR with Kind::GitPullRequest
- Replace KIND_PR_UPDATE with Kind::GitPullRequestUpdate
- Replace KIND_USER_GRASP_LIST with Kind::GitUserGraspList
- Replace KIND_PATCH with Kind::GitPatch
- Replace KIND_ISSUE with Kind::GitIssue
- Replace KIND_COMMENT with Kind::Comment
- Replace all Kind::Custom(30617|30618|1617|1618|1619|1621|1111|10317) patterns
- Remove all hardcoded KIND_* constants from events.rs
- Update all match statements to use Kind enum directly
- Update all filter builders to use Kind variants
- Update all test helpers and assertions
Benefits:
- Type safety: compiler prevents wrong kind numbers
- Readability: Kind::GitRepoAnnouncement is self-documenting
- Maintainability: single source of truth (rust-nostr)
- IDE support: full autocompletion and refactoring
- Standards: aligns with rust-nostr best practices
Files modified: 21
Constants removed: 9
Patterns replaced: 100+
Tests passing: 222/222
|
|
|
|
Implements Phase 3 of the purgatory sync integration test plan.
Key changes:
- Add immediate sync triggering for sync-received events that go to
purgatory (instead of default 3-minute delay for user-submitted events)
- TestRelay now respects RUST_LOG environment variable for debugging
- New test verifies end-to-end flow: state event syncs from source relay,
enters purgatory, git data is fetched from source's clone URL, and
event is released and served
|
|
|
|
don't save new events destined for purgatory events directly to db
or serve on websockets
don't download events already in purgatory via negentropy sync
|
|
|
|
so we can more easily support grasp purgatory feature
|
|
|
|
|
|
|
|
|
|
Add automatic pagination support for non-Negentropy historic sync to handle
large result sets efficiently. When a subscription receives >= 75 events,
the system automatically fetches the next page using the 'until' parameter.
Changes:
- Add PaginationState struct to track event counts and min timestamps
- Add pagination_state HashMap to PendingBatch for per-subscription tracking
- Add PAGINATION_THRESHOLD constant (75 events)
- Pass pending_sync_index to event processor for state updates
- Track events and timestamps as they arrive
- Check threshold on EOSE and launch follow-up subscriptions
- Initialize pagination state when creating historic sync subscriptions
- Update test fixtures in algorithms.rs
The pagination continues recursively until a page returns fewer than 75 events,
ensuring complete historic data retrieval without overwhelming relay limits.
|
|
Replace broken event counting that occurred before duplicate/policy checks
with accurate tracking of events that are new, accepted, and saved.
Changes:
- Added ProcessResult enum to track event processing outcomes
- Modified process_event_static() to return ProcessResult
- Replaced events_total (with source labels) with events_synced_total
- Removed gap_events_total and event_source module
- Removed eose_received flag (EOSE is per-subscription, not suitable)
- Updated all tests to use new simplified API
The new ngit_sync_events_synced_total metric only counts events that:
1. Are new (not duplicates)
2. Pass write policy validation
3. Are successfully saved to database
All 165 tests pass (124 lib + 41 integration)
|
|
|
|
Separated connection from subscription logic. The RelayConnection.connect()
method now only handles WebSocket connection establishment. Subscriptions
are managed separately via handle_connect_or_reconnect.
Changes:
- Renamed RelayConnection::connect_and_subscribe() to connect()
- Removed subscription logic from connect method
- Updated call site in try_connect_relay()
- Removed unused build_announcement_filter import
|
|
Bug: handle_connect_or_reconnect() was incorrectly calling quick_reconnect()
on first connections instead of fresh_start().
Root cause: The code updated last_connected = Some(now) at line 808, then
immediately read it back at line 932 to make the reconnection decision.
This meant first connections saw elapsed = now - now = 0 seconds, which
triggered quick_reconnect() instead of fresh_start().
Fix: Capture old_last_connected BEFORE updating the state, then use that
value for the reconnection decision. Now first connections correctly see
None and call fresh_start().
Impact:
- First connections now properly use fresh_start() with full historic sync
- Short disconnections (< 15 min) use quick_reconnect() with since filter
- Long disconnections (> 15 min) use fresh_start() with full resync
All 41 sync tests passing.
|
|
The system was incorrectly treating subscription-specific CLOSED messages
as connection-wide disconnects, causing live subscriptions to be terminated
immediately after historic_sync completed.
Two bugs fixed:
1. relay_connection.rs: Removed break on RelayMessage::Closed - it's
subscription-specific, not connection-wide
2. mod.rs: Removed disconnect handling for RelayEvent::Closed - only log
at DEBUG level and continue
All 41 sync tests now pass including previously failing live sync tests.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
The log message claimed 'will fall back to REQ+EOSE' but no such
fallback was implemented - the function simply returns 0 and exits.
|
|
When negentropy (NIP-77) sync was enabled, the RelaySyncIndex was never
updated to reflect historical sync completion. This caused the three-way
diff algorithm in compute_actions() to malfunction, leading to:
- Repeated sync attempts for the same items
- Incorrect filter counting for consolidation
- Potential premature relay disconnection
This fix unifies both sync paths (REQ+EOSE and Negentropy) through a
consistent PendingBatch flow:
1. Added SyncMethod enum to distinguish between sync types
2. Updated PendingBatch struct to include sync_method field
3. Extracted confirm_batch() method for unified batch confirmation
4. Modified negentropy_sync_and_process() to:
- Create a PendingBatch before sync
- Add batch to pending_sync_index
- On success: Remove batch and call confirm_batch()
- On failure: Remove batch without confirming
The confirm_batch() method moves repos and root_events from the batch
to the RelayState.repos and RelayState.root_events, ensuring the
three-way diff works correctly regardless of sync method.
Closes: negentropy-sync-state-tracking.md
|
|
they are legacy and not root events
|
|
Main lib (src/):
- Add #[allow(dead_code)] for build_info field (stored to prevent Prometheus unregistration)
- Add #[allow(dead_code)] for first_seen field (reserved for future rate limiting)
- Replace .or_insert_with(RelaySyncNeeds::default) with .or_default()
- Replace manual div_ceil implementations with .div_ceil(100)
Test code (tests/):
- Replace .expect(&format!(...)) with .unwrap_or_else(|_| panic!(...))
- Remove needless borrows in fetch_metrics() calls
- Add #[allow(dead_code)] and #[allow(unused_imports)] to test helpers module
grasp-audit:
- Apply cargo fmt to fix formatting
|
|
|
|
Replace EOSE-based sync completion with negentropy reconciliation for:
- Initial connect (fresh sync)
- Daily sync (Layer 1 announcements)
- Stale reconnect (>15 min)
Key changes:
- Add NegentropySyncResult struct with remote_only, local_only, received fields
- Add supports_negentropy() using try-and-fallback approach
- Add negentropy_sync_filter() using nostr-sdk client.sync() API
- Modify handle_connect_or_reconnect() to use negentropy for fresh/stale sync
- Modify daily_sync() to use negentropy for Layer 1
- Single-warning logging per relay when negentropy fails
Quick reconnects (<15 min) unchanged - still use REQ with since filter.
If negentropy unsupported, gracefully falls back to REQ+EOSE flow.
|
|
|
|
|
|
- Add Layer 1 (announcements) re-subscription in daily_sync() after
unsubscribe_all() to ensure kinds 30617+30618 are re-established
- Clarify comments in handle_connect_or_reconnect() explaining that
Layer 1 subscription is established during connect_and_subscribe()
Addresses implementation gaps from design vs implementation report:
- Gap 1: Comments clarified (Layer 1 handled by connect_and_subscribe)
- Gap 2: daily_sync() now re-subscribes to Layer 1 without since filter
- Gap 3: consolidate() already had Layer 1 re-subscription (no change)
All 125 unit tests and integration tests pass.
|
|
|
|
Previously, events were classified as 'startup' or 'live' based on whether
they came from a bootstrap relay (is_bootstrap flag). This meant ALL events
from bootstrap relays were counted as 'startup', even events received after
the initial sync completed.
Now events are classified based on whether EOSE (End Of Stored Events) has
been received for that connection:
- Events BEFORE EOSE → 'startup' (historical events during initial sync)
- Events AFTER EOSE → 'live' (new events via real-time subscription)
This enables the test_live_sync_event_count test which validates that events
received after sync connection is established are counted as live events.
Also removed the #[ignore] attribute from test_live_sync_event_count since
the metrics are now properly wired up.
|
|
nostr-sdk 0.44's Relay::new() is pub(crate), making it impossible to
construct a Relay directly from outside the crate. Relays can only be
created through Client::add_relay() or RelayPool::add_relay().
This commit:
- Adds 'Why Client instead of Relay directly?' section to struct docs
- Updates run_event_loop() docs to explain the API constraint
- Removes outdated 'Future Refactoring' suggestion (not feasible)
|
|
Replace the 1-second polling loop with nostr-sdk's relay-level notification
system that provides immediate disconnect detection via RelayNotification::RelayStatus.
Key changes:
- Use relay.notifications() instead of client.notifications()
- Handle RelayNotification::RelayStatus { Disconnected | Terminated } to detect
connection loss immediately without polling
- Remove tokio::select! with interval timer - now uses simple match loop
- Handle additional notification types (Authenticated, AuthenticationFailed)
Why this is better:
- Event-driven vs polling: no wasted CPU cycles checking every second
- Immediate detection: disconnect triggers notification instantly
- Uses nostr-sdk's built-in mechanism that was previously inaccessible at pool level
(RelayStatus notifications are filtered out in RelayPoolNotification)
Technical note: RelayNotification::RelayStatus is only available via
Relay::notifications(), not Client::notifications(), because the pool-level
broadcast filters out status change events.
Future refactoring opportunity: Consider restructuring RelayConnection to hold
a Relay directly instead of wrapping a Client, since we only manage one relay
per connection anyway.
|
|
- Add periodic health check in RelayConnection::run_event_loop that polls
nostr-sdk's relay.is_connected() every second to detect dead connections
- When event channel closes without explicit Closed/Shutdown, send
DisconnectNotification to SyncManager (fixes case where TCP drops silently)
- Enable test_relay_connected_status test which validates the
ngit_sync_relay_connected metric correctly reflects connection state
The issue was that when a remote relay stops abruptly, nostr-sdk's
notification receiver blocks indefinitely waiting for data. TCP disconnect
detection without keepalive can take minutes. The health check polls
nostr-sdk's internal relay status which detects disconnection promptly.
|
|
Root cause: Both Metrics::new() and SyncManager::new() were trying to register
SyncMetrics with the same Prometheus registry. The second registration failed
silently, leaving SyncManager.metrics = None, so record_connection_attempt()
calls were no-ops.
Changes:
- SyncManager::new() now accepts Option<SyncMetrics> instead of Option<&Registry>
- main.rs passes already-registered sync metrics from Metrics to SyncManager
- Simplified test_connection_failure_increments_counter assertion
- Marked 3 tests as #[ignore] pending relay tracking metrics wiring
Tests fixed:
- test_connection_failure_increments_counter (now counts failures)
- test_health_state_degrades_on_failure (now tracks health state)
- test_live_sync_layer3_events (already working, confirmed)
Tests ignored (future work):
- test_live_sync_event_count
- test_multi_source_aggregate_counts
- test_relay_connected_status
|
|
|
|
|