| Age | Commit message (Collapse) | Author |
|
Add naughty list tracking for relays with persistent infrastructure issues
(DNS failures, TLS certificate errors, protocol violations) to reduce log
noise and provide better visibility via metrics.
Key features:
- Classify errors into naughty (persistent) vs transient (temporary)
- Track naughty relays with category, reason, and occurrence count
- Log WARN on first naughty occurrence, DEBUG on repeats
- Automatic expiration after 12 hours (configurable)
- Prometheus metrics for monitoring naughty relays by category
- Periodic cleanup task integrated with health checker
Components added:
- src/sync/naughty_list.rs: Core naughty list tracker with error classification
- NaughtyListTracker integration in RelayHealthTracker
- Connection error handling updates in sync manager
- Naughty list metrics (total by category, detailed info per relay)
- Config option for naughty_list_expiration_hours (default: 12)
Closes DNS lookup failures and TLS certificate errors tracking issues.
|
|
Previously, disconnect_relay() would immediately remove RelayState and
pending batches before the event loop finished draining messages. This
caused confusing 'unknown relay' debug messages for EOSE and other
events that arrived after state removal but were expected during
normal shutdown.
Changes:
- Add ConnectionStatus::Disconnecting to track intentional disconnects
- disconnect_relay() now marks relay as Disconnecting (keeps state)
- Event loop drains messages while state exists
- handle_disconnect() detects intentional vs unexpected disconnects:
- Intentional: Completes cleanup by removing state/connections
- Unexpected: Updates to Disconnected, keeps connection for retry
- handle_eose() suppresses logs for Disconnecting relays (TRACE level)
- check_disconnects() skips relays already in Disconnecting state
This ensures proper sequencing: mark->drain->cleanup instead of
remove->drain->confusion. Fixes the root cause instead of just
hiding log messages.
|
|
Replace duplicate metrics methods (announcements vs states) with unified
methods using IntGaugeVec/IntCounterVec with an event_type label:
- update_rejected_hot_cache_size(event_type, size)
- record_rejected_hot_cache_hit(event_type)
- record_rejected_hot_cache_miss(event_type)
- record_rejected_hot_cache_expired(event_type, count)
- update_rejected_cold_index_size(event_type, size)
- record_rejected_cold_index_expired(event_type, count)
- record_rejected_invalidation(event_type, count)
Prometheus labels remain separate (event_type="announcement" vs
event_type="state") but implementation is now unified.
Phase 4 of rejected events index refactoring.
|
|
|
|
Add comprehensive authorization checks to ensure state events are only
accepted from maintainers of accepted repository announcements. This
implements the core GRASP-01 requirement that pushes must match the
latest state announcement "respecting the maintainer set."
Changes:
1. StatePolicy authorization (src/nostr/policy/state.rs):
- Check authorization BEFORE git data validation (fail-fast)
- Reject if no announcement exists for repository
- Reject if author not in maintainer set
- Use existing helpers: fetch_repository_data() and
pubkey_authorised_for_repo_owners()
- Structured logging for all rejections
2. Purgatory invalidation (src/nostr/builder.rs):
- New method: check_purgatory_state_events_for_identifier()
- Called when announcements accepted (Accept and AcceptMaintainer)
- Re-evaluates state events in purgatory for the identifier
- Processes newly-authorized events (releases from purgatory)
- Keeps unauthorized events for natural expiry (30 min)
- Enables retroactive authorization when announcements arrive late
3. Purgatory sync authorization (src/git/sync.rs):
- Check authorization BEFORE processing git data
- Remove unauthorized events from purgatory (permanent rejection)
- Prevents processing even if git data arrives first
- Structured logging for monitoring
4. Rejected events tracking (src/sync/rejected_index.rs):
- Add support for tracking rejected state events
- New methods: add_state(), contains_state()
- Separate metrics for state rejections
- Enables sync to avoid re-fetching rejected states
5. Sync metrics (src/sync/metrics.rs, src/sync/mod.rs):
- Add state-specific metrics (hot cache, cold index)
- Track rejected states separately from announcements
- Support monitoring of authorization rejections
6. Comprehensive tests (tests/state_authorization.rs):
- test_reject_state_without_announcement
- test_reject_state_from_unauthorized_author
- test_accept_state_from_announcement_author
- test_accept_state_from_maintainer
Security Impact:
- Before: State events could be published by anyone
- After: Only maintainers can publish state events
- Defense-in-depth: Authorization checked at 3 points:
1. On arrival (StatePolicy)
2. On announcement acceptance (purgatory re-evaluation)
3. On git data arrival (purgatory sync)
All tests pass:
- 248 unit tests
- 51 NIP-34 announcement tests
- 4 new state authorization tests
- 9 rejected index tests
Closes: State authorization requirement from GRASP-01 spec
|
|
Add automatic cleanup and Prometheus metrics for the two-tier rejected
events index that caches rejected announcements for re-processing.
Cleanup loops:
- Hot cache: Every 60 seconds (events expire after 2 minutes)
- Cold index: Every 24 hours (metadata expires after 7 days)
- Background task with graceful shutdown support
New Prometheus metrics (7):
- Gauges: hot_cache_current, cold_index_current
- Counters: hits, misses, hot_expired, cold_expired, invalidated
This completes the maintainer announcement re-processing feature,
reducing wait time from 24 hours to <1 second when a maintainer's
announcement arrives before the repository owner's announcement.
Memory is bounded through automatic cleanup, and comprehensive metrics
enable monitoring of hit rates, memory usage, and cleanup effectiveness.
Changes:
- src/sync/metrics.rs: Added 7 metrics with recording methods
- src/sync/rejected_index.rs: Added optional metrics support
- src/sync/mod.rs: Added cleanup background task
Tests: 248 library tests passing, 3 integration tests passing
|
|
Resolves naming conflict with RelayHealthState::Degraded by using a more
explicit name that clearly indicates the connection status relates to
historic sync failures, not connection health degradation.
Changes:
- ConnectionStatus::ConnectedDegraded → ConnectedHistoricSyncFailures
- Updated all documentation and comments
- Updated Prometheus metric descriptions
- Metric value remains 4 for backward compatibility
This makes it clear that:
- ConnectedHistoricSyncFailures = connection lifecycle (missing historic data)
- RelayHealthState::Degraded = connection health (reliability issues)
These are orthogonal concerns - a relay can be ConnectedHistoricSyncFailures
but Healthy, or Connected but Degraded.
|
|
- Add ConnectionStatus::ConnectedDegraded (status=4 in metrics)
- Track batch failures via PendingBatch.failed field
- Track relay-level failures via RelayState.historic_sync_had_failures
- Transition to ConnectedDegraded when any batch fails during historic sync
- Add is_live_sync_active() helper for cleaner match patterns
- Update state machine diagram with ConnectedDegraded transitions
- Update metrics docs with status=4 and example queries
Fixes issue where relays with failed negentropy retries would
incorrectly transition to Connected status despite missing data.
Now operators can distinguish 'fully synced' vs 'degraded (partial data)'.
|
|
- Add ConnectionStatus::Syncing state between Connecting and Connected
- Track historic_sync_completed and historic_sync_completed_at in RelayState
- Auto-detect sync completion via check_and_complete_historic_sync()
- Update metrics: ngit_sync_relay_connected now shows 0-3 (disconnected/connecting/syncing/connected)
- Update Prometheus metric documentation with new status values
- Add state machine diagram showing Syncing transition
- Operators can now distinguish 'connected but catching up' vs 'fully synced'
|
|
|
|
Replace broken event counting that occurred before duplicate/policy checks
with accurate tracking of events that are new, accepted, and saved.
Changes:
- Added ProcessResult enum to track event processing outcomes
- Modified process_event_static() to return ProcessResult
- Replaced events_total (with source labels) with events_synced_total
- Removed gap_events_total and event_source module
- Removed eose_received flag (EOSE is per-subscription, not suitable)
- Updated all tests to use new simplified API
The new ngit_sync_events_synced_total metric only counts events that:
1. Are new (not duplicates)
2. Pass write policy validation
3. Are successfully saved to database
All 165 tests pass (124 lib + 41 integration)
|
|
|
|
Main lib (src/):
- Add #[allow(dead_code)] for build_info field (stored to prevent Prometheus unregistration)
- Add #[allow(dead_code)] for first_seen field (reserved for future rate limiting)
- Replace .or_insert_with(RelaySyncNeeds::default) with .or_default()
- Replace manual div_ceil implementations with .div_ceil(100)
Test code (tests/):
- Replace .expect(&format!(...)) with .unwrap_or_else(|_| panic!(...))
- Remove needless borrows in fetch_metrics() calls
- Add #[allow(dead_code)] and #[allow(unused_imports)] to test helpers module
grasp-audit:
- Apply cargo fmt to fix formatting
|
|
|
|
|
|
- Add SyncMetrics with full Prometheus integration
- Track sync gaps via catchup events
- Update Grafana dashboard with sync panels
- Document all sync configuration options
- Update design doc with implementation notes
|