upleb.uk

Public git repos — served from a NIP-34 GRASP relay at git.upleb.uk

summaryrefslogtreecommitdiff
path: root/src/sync/metrics.rs
AgeCommit message (Collapse)Author
2026-01-10Implement relay naughty list featureDanConwayDev
Add naughty list tracking for relays with persistent infrastructure issues (DNS failures, TLS certificate errors, protocol violations) to reduce log noise and provide better visibility via metrics. Key features: - Classify errors into naughty (persistent) vs transient (temporary) - Track naughty relays with category, reason, and occurrence count - Log WARN on first naughty occurrence, DEBUG on repeats - Automatic expiration after 12 hours (configurable) - Prometheus metrics for monitoring naughty relays by category - Periodic cleanup task integrated with health checker Components added: - src/sync/naughty_list.rs: Core naughty list tracker with error classification - NaughtyListTracker integration in RelayHealthTracker - Connection error handling updates in sync manager - Naughty list metrics (total by category, detailed info per relay) - Config option for naughty_list_expiration_hours (default: 12) Closes DNS lookup failures and TLS certificate errors tracking issues.
2026-01-09fix: eliminate disconnect race condition by adding Disconnecting stateDanConwayDev
Previously, disconnect_relay() would immediately remove RelayState and pending batches before the event loop finished draining messages. This caused confusing 'unknown relay' debug messages for EOSE and other events that arrived after state removal but were expected during normal shutdown. Changes: - Add ConnectionStatus::Disconnecting to track intentional disconnects - disconnect_relay() now marks relay as Disconnecting (keeps state) - Event loop drains messages while state exists - handle_disconnect() detects intentional vs unexpected disconnects: - Intentional: Completes cleanup by removing state/connections - Unexpected: Updates to Disconnected, keeps connection for retry - handle_eose() suppresses logs for Disconnecting relays (TRACE level) - check_disconnects() skips relays already in Disconnecting state This ensures proper sequencing: mark->drain->cleanup instead of remove->drain->confusion. Fixes the root cause instead of just hiding log messages.
2026-01-09refactor(sync): parameterize rejected index metrics by event typeDanConwayDev
Replace duplicate metrics methods (announcements vs states) with unified methods using IntGaugeVec/IntCounterVec with an event_type label: - update_rejected_hot_cache_size(event_type, size) - record_rejected_hot_cache_hit(event_type) - record_rejected_hot_cache_miss(event_type) - record_rejected_hot_cache_expired(event_type, count) - update_rejected_cold_index_size(event_type, size) - record_rejected_cold_index_expired(event_type, count) - record_rejected_invalidation(event_type, count) Prometheus labels remain separate (event_type="announcement" vs event_type="state") but implementation is now unified. Phase 4 of rejected events index refactoring.
2026-01-09chore: cargo fmtDanConwayDev
2026-01-09feat: implement state event authorization per GRASP-01 specDanConwayDev
Add comprehensive authorization checks to ensure state events are only accepted from maintainers of accepted repository announcements. This implements the core GRASP-01 requirement that pushes must match the latest state announcement "respecting the maintainer set." Changes: 1. StatePolicy authorization (src/nostr/policy/state.rs): - Check authorization BEFORE git data validation (fail-fast) - Reject if no announcement exists for repository - Reject if author not in maintainer set - Use existing helpers: fetch_repository_data() and pubkey_authorised_for_repo_owners() - Structured logging for all rejections 2. Purgatory invalidation (src/nostr/builder.rs): - New method: check_purgatory_state_events_for_identifier() - Called when announcements accepted (Accept and AcceptMaintainer) - Re-evaluates state events in purgatory for the identifier - Processes newly-authorized events (releases from purgatory) - Keeps unauthorized events for natural expiry (30 min) - Enables retroactive authorization when announcements arrive late 3. Purgatory sync authorization (src/git/sync.rs): - Check authorization BEFORE processing git data - Remove unauthorized events from purgatory (permanent rejection) - Prevents processing even if git data arrives first - Structured logging for monitoring 4. Rejected events tracking (src/sync/rejected_index.rs): - Add support for tracking rejected state events - New methods: add_state(), contains_state() - Separate metrics for state rejections - Enables sync to avoid re-fetching rejected states 5. Sync metrics (src/sync/metrics.rs, src/sync/mod.rs): - Add state-specific metrics (hot cache, cold index) - Track rejected states separately from announcements - Support monitoring of authorization rejections 6. Comprehensive tests (tests/state_authorization.rs): - test_reject_state_without_announcement - test_reject_state_from_unauthorized_author - test_accept_state_from_announcement_author - test_accept_state_from_maintainer Security Impact: - Before: State events could be published by anyone - After: Only maintainers can publish state events - Defense-in-depth: Authorization checked at 3 points: 1. On arrival (StatePolicy) 2. On announcement acceptance (purgatory re-evaluation) 3. On git data arrival (purgatory sync) All tests pass: - 248 unit tests - 51 NIP-34 announcement tests - 4 new state authorization tests - 9 rejected index tests Closes: State authorization requirement from GRASP-01 spec
2026-01-09feat(sync): add cleanup loops and metrics for rejected events indexDanConwayDev
Add automatic cleanup and Prometheus metrics for the two-tier rejected events index that caches rejected announcements for re-processing. Cleanup loops: - Hot cache: Every 60 seconds (events expire after 2 minutes) - Cold index: Every 24 hours (metadata expires after 7 days) - Background task with graceful shutdown support New Prometheus metrics (7): - Gauges: hot_cache_current, cold_index_current - Counters: hits, misses, hot_expired, cold_expired, invalidated This completes the maintainer announcement re-processing feature, reducing wait time from 24 hours to <1 second when a maintainer's announcement arrives before the repository owner's announcement. Memory is bounded through automatic cleanup, and comprehensive metrics enable monitoring of hit rates, memory usage, and cleanup effectiveness. Changes: - src/sync/metrics.rs: Added 7 metrics with recording methods - src/sync/rejected_index.rs: Added optional metrics support - src/sync/mod.rs: Added cleanup background task Tests: 248 library tests passing, 3 integration tests passing
2026-01-09refactor(sync): rename ConnectedDegraded to ConnectedHistoricSyncFailuresDanConwayDev
Resolves naming conflict with RelayHealthState::Degraded by using a more explicit name that clearly indicates the connection status relates to historic sync failures, not connection health degradation. Changes: - ConnectionStatus::ConnectedDegraded → ConnectedHistoricSyncFailures - Updated all documentation and comments - Updated Prometheus metric descriptions - Metric value remains 4 for backward compatibility This makes it clear that: - ConnectedHistoricSyncFailures = connection lifecycle (missing historic data) - RelayHealthState::Degraded = connection health (reliability issues) These are orthogonal concerns - a relay can be ConnectedHistoricSyncFailures but Healthy, or Connected but Degraded.
2026-01-09feat(sync): add ConnectedDegraded status for failed historic syncDanConwayDev
- Add ConnectionStatus::ConnectedDegraded (status=4 in metrics) - Track batch failures via PendingBatch.failed field - Track relay-level failures via RelayState.historic_sync_had_failures - Transition to ConnectedDegraded when any batch fails during historic sync - Add is_live_sync_active() helper for cleaner match patterns - Update state machine diagram with ConnectedDegraded transitions - Update metrics docs with status=4 and example queries Fixes issue where relays with failed negentropy retries would incorrectly transition to Connected status despite missing data. Now operators can distinguish 'fully synced' vs 'degraded (partial data)'.
2026-01-09feat(sync): add Syncing connection status to track historic sync progressDanConwayDev
- Add ConnectionStatus::Syncing state between Connecting and Connected - Track historic_sync_completed and historic_sync_completed_at in RelayState - Auto-detect sync completion via check_and_complete_historic_sync() - Update metrics: ngit_sync_relay_connected now shows 0-3 (disconnected/connecting/syncing/connected) - Update Prometheus metric documentation with new status values - Add state machine diagram showing Syncing transition - Operators can now distinguish 'connected but catching up' vs 'fully synced'
2025-12-22sync: add req rate-limit detection and cooldownDanConwayDev
2025-12-19Simplify sync metrics to track only newly saved eventsDanConwayDev
Replace broken event counting that occurred before duplicate/policy checks with accurate tracking of events that are new, accepted, and saved. Changes: - Added ProcessResult enum to track event processing outcomes - Modified process_event_static() to return ProcessResult - Replaced events_total (with source labels) with events_synced_total - Removed gap_events_total and event_source module - Removed eose_received flag (EOSE is per-subscription, not suitable) - Updated all tests to use new simplified API The new ngit_sync_events_synced_total metric only counts events that: 1. Are new (not duplicates) 2. Pass write policy validation 3. Are successfully saved to database All 165 tests pass (124 lib + 41 integration)
2025-12-18sync: fix sync connectionDanConwayDev
2025-12-11fix: resolve all fmt and clippy warningsDanConwayDev
Main lib (src/): - Add #[allow(dead_code)] for build_info field (stored to prevent Prometheus unregistration) - Add #[allow(dead_code)] for first_seen field (reserved for future rate limiting) - Replace .or_insert_with(RelaySyncNeeds::default) with .or_default() - Replace manual div_ceil implementations with .div_ceil(100) Test code (tests/): - Replace .expect(&format!(...)) with .unwrap_or_else(|_| panic!(...)) - Remove needless borrows in fetch_metrics() calls - Add #[allow(dead_code)] and #[allow(unused_imports)] to test helpers module grasp-audit: - Apply cargo fmt to fix formatting
2025-12-10feat: create sync metrics module (Phase 1)DanConwayDev
2025-12-10stub of sync v4DanConwayDev
2025-12-04feat(sync): Phase 6 - observability and production readinessDanConwayDev
- Add SyncMetrics with full Prometheus integration - Track sync gaps via catchup events - Update Grafana dashboard with sync panels - Document all sync configuration options - Update design doc with implementation notes