diff options
Diffstat (limited to 'docs/explanation/grasp-02-proactive-sync.md')
| -rw-r--r-- | docs/explanation/grasp-02-proactive-sync.md | 29 |
1 files changed, 23 insertions, 6 deletions
diff --git a/docs/explanation/grasp-02-proactive-sync.md b/docs/explanation/grasp-02-proactive-sync.md index e1fb367..b17b8bf 100644 --- a/docs/explanation/grasp-02-proactive-sync.md +++ b/docs/explanation/grasp-02-proactive-sync.md | |||
| @@ -79,6 +79,8 @@ pub enum ConnectionStatus { | |||
| 79 | Syncing, | 79 | Syncing, |
| 80 | /// Successfully connected, historic sync completed | 80 | /// Successfully connected, historic sync completed |
| 81 | Connected, | 81 | Connected, |
| 82 | /// Successfully connected, historic sync failed but live sync active | ||
| 83 | ConnectedDegraded, | ||
| 82 | } | 84 | } |
| 83 | 85 | ||
| 84 | /// Complete state for a single relay - combines sync needs with connection lifecycle | 86 | /// Complete state for a single relay - combines sync needs with connection lifecycle |
| @@ -207,15 +209,19 @@ stateDiagram-v2 | |||
| 207 | Disconnected --> Connecting: retry_disconnected_relays → try_connect_relay | 209 | Disconnected --> Connecting: retry_disconnected_relays → try_connect_relay |
| 208 | Connecting --> Syncing: success → handle_connect_or_reconnect | 210 | Connecting --> Syncing: success → handle_connect_or_reconnect |
| 209 | Connecting --> Disconnected: failure + record in health tracker | 211 | Connecting --> Disconnected: failure + record in health tracker |
| 210 | Syncing --> Connected: all historic batches complete → check_and_complete_historic_sync | 212 | Syncing --> Connected: all batches succeed → check_and_complete_historic_sync |
| 213 | Syncing --> ConnectedDegraded: any batch failed → check_and_complete_historic_sync | ||
| 211 | Syncing --> Disconnected: connection lost → handle_disconnect | 214 | Syncing --> Disconnected: connection lost → handle_disconnect |
| 212 | Connected --> Disconnected: connection lost → handle_disconnect | 215 | Connected --> Disconnected: connection lost → handle_disconnect |
| 216 | ConnectedDegraded --> Disconnected: connection lost → handle_disconnect | ||
| 213 | Connected --> [*]: intentional disconnect via check_disconnects | 217 | Connected --> [*]: intentional disconnect via check_disconnects |
| 218 | ConnectedDegraded --> [*]: intentional disconnect via check_disconnects | ||
| 214 | 219 | ||
| 215 | note right of Disconnected: disconnected_at set for 15min rule<br/>RelayConnection kept in HashMap | 220 | note right of Disconnected: disconnected_at set for 15min rule<br/>RelayConnection kept in HashMap |
| 216 | note right of Connecting: connection attempt with timeout | 221 | note right of Connecting: connection attempt with timeout |
| 217 | note right of Syncing: historic sync in progress<br/>event loop spawned here | 222 | note right of Syncing: historic sync in progress<br/>event loop spawned here |
| 218 | note right of Connected: historic sync complete<br/>last_connected tracked for since filter | 223 | note right of Connected: historic sync complete<br/>last_connected tracked for since filter |
| 224 | note right of ConnectedDegraded: historic sync failed (missing events)<br/>live sync active, partial data | ||
| 219 | ``` | 225 | ``` |
| 220 | 226 | ||
| 221 | ### Connection Flow Methods | 227 | ### Connection Flow Methods |
| @@ -240,17 +246,28 @@ When a relay first connects, it enters the **Syncing** state and begins historic | |||
| 240 | Each layer creates one or more `PendingBatch` entries tracked in `PendingSyncIndex`. As EOSE messages arrive: | 246 | Each layer creates one or more `PendingBatch` entries tracked in `PendingSyncIndex`. As EOSE messages arrive: |
| 241 | 247 | ||
| 242 | - `handle_eose()` confirms each batch via `confirm_batch()` | 248 | - `handle_eose()` confirms each batch via `confirm_batch()` |
| 243 | - `confirm_batch()` moves items to confirmed state and calls `check_and_complete_historic_sync()` | 249 | - `confirm_batch()` moves items to confirmed state, tracks if batch failed, and calls `check_and_complete_historic_sync()` |
| 244 | - `check_and_complete_historic_sync()` checks if `PendingSyncIndex` is empty for this relay | 250 | - `check_and_complete_historic_sync()` uses a **double-check pattern** to avoid race conditions: |
| 245 | - When empty: transitions `Syncing` → `Connected`, sets `historic_sync_completed = true` | 251 | 1. First check: Are there pending batches? If yes, return early |
| 252 | 2. Wait 6 seconds (batch window + buffer) for self-subscriber to process in-flight events | ||
| 253 | 3. Second check: Are there still no pending batches? If yes, return early | ||
| 254 | 4. If no pending batches after wait: | ||
| 255 | - If any batch failed: transition `Syncing` → `ConnectedDegraded` | ||
| 256 | - If all batches succeeded: transition `Syncing` → `Connected` | ||
| 257 | - Set `historic_sync_completed = true` | ||
| 258 | |||
| 259 | **Why the double-check?** There's an async gap between receiving EOSE and the self-subscriber processing events to create Layer 2/3 filters. The 6-second wait (5s batch window + 1s buffer) ensures we don't prematurely mark sync complete while Layer 2/3 batches are being created. | ||
| 260 | |||
| 261 | **Batch Failure Tracking**: When negentropy retry protection triggers (relay returns zero requested events on retry), the batch is marked as `failed = true`. This causes the relay to transition to `ConnectedDegraded` instead of `Connected`, signaling that live sync is active but historic sync is incomplete. | ||
| 246 | 262 | ||
| 247 | **Metrics tracking**: The `ngit_sync_relay_connected` metric shows: | 263 | **Metrics tracking**: The `ngit_sync_relay_connected` metric shows: |
| 248 | - `0` = Disconnected | 264 | - `0` = Disconnected |
| 249 | - `1` = Connecting | 265 | - `1` = Connecting |
| 250 | - `2` = Syncing (historic sync in progress) | 266 | - `2` = Syncing (historic sync in progress) |
| 251 | - `3` = Connected (historic sync complete, live sync active) | 267 | - `3` = Connected (historic sync complete, live sync active) |
| 268 | - `4` = ConnectedDegraded (historic sync failed, live sync active, partial data) | ||
| 252 | 269 | ||
| 253 | This allows operators to monitor sync progress and distinguish between "connected but still catching up" vs "fully synced and live". | 270 | This allows operators to monitor sync progress and distinguish between "connected but still catching up" vs "fully synced and live" vs "degraded (missing historic data)". |
| 254 | 271 | ||
| 255 | ### Event Loop Lifecycle | 272 | ### Event Loop Lifecycle |
| 256 | 273 | ||