upleb.uk

Public git repos — served from a NIP-34 GRASP relay at git.upleb.uk

summaryrefslogtreecommitdiff
path: root/docs/explanation/monitoring.md
AgeCommit message (Collapse)Author
2026-01-09docs: integrate rejected events index into architecture documentationDanConwayDev
- Add rejected events index to architecture.md with two-tier system explanation - Document NGIT_REJECTED_HOT_CACHE_DURATION_SECS and NGIT_REJECTED_COLD_INDEX_EXPIRY_SECS in configuration.md - Add comprehensive rejected events metrics section to monitoring.md with Grafana queries and alerts - Explain negentropy integration with rejected index in grasp-02-proactive-sync.md - Document state event authorization defense-in-depth and rejection tracking in inline-authorization.md This integrates information from work/rejected-events-index-summary.md into the main documentation, ensuring architecture docs accurately reflect the implemented rejected events index system.
2026-01-09refactor(sync): rename ConnectedDegraded to ConnectedHistoricSyncFailuresDanConwayDev
Resolves naming conflict with RelayHealthState::Degraded by using a more explicit name that clearly indicates the connection status relates to historic sync failures, not connection health degradation. Changes: - ConnectionStatus::ConnectedDegraded → ConnectedHistoricSyncFailures - Updated all documentation and comments - Updated Prometheus metric descriptions - Metric value remains 4 for backward compatibility This makes it clear that: - ConnectedHistoricSyncFailures = connection lifecycle (missing historic data) - RelayHealthState::Degraded = connection health (reliability issues) These are orthogonal concerns - a relay can be ConnectedHistoricSyncFailures but Healthy, or Connected but Degraded.
2026-01-09feat(sync): add ConnectedDegraded status for failed historic syncDanConwayDev
- Add ConnectionStatus::ConnectedDegraded (status=4 in metrics) - Track batch failures via PendingBatch.failed field - Track relay-level failures via RelayState.historic_sync_had_failures - Transition to ConnectedDegraded when any batch fails during historic sync - Add is_live_sync_active() helper for cleaner match patterns - Update state machine diagram with ConnectedDegraded transitions - Update metrics docs with status=4 and example queries Fixes issue where relays with failed negentropy retries would incorrectly transition to Connected status despite missing data. Now operators can distinguish 'fully synced' vs 'degraded (partial data)'.
2026-01-09feat(sync): add Syncing connection status to track historic sync progressDanConwayDev
- Add ConnectionStatus::Syncing state between Connecting and Connected - Track historic_sync_completed and historic_sync_completed_at in RelayState - Auto-detect sync completion via check_and_complete_historic_sync() - Update metrics: ngit_sync_relay_connected now shows 0-3 (disconnected/connecting/syncing/connected) - Update Prometheus metric documentation with new status values - Add state machine diagram showing Syncing transition - Operators can now distinguish 'connected but catching up' vs 'fully synced'
2025-12-04docs: update GRASP-02 proactive sync event sync approachDanConwayDev
2025-12-04add prometheus metricsDanConwayDev