diff options
Diffstat (limited to 'docs/E2E_FIX_PLAN.md')
| -rw-r--r-- | docs/E2E_FIX_PLAN.md | 177 |
1 files changed, 177 insertions, 0 deletions
diff --git a/docs/E2E_FIX_PLAN.md b/docs/E2E_FIX_PLAN.md new file mode 100644 index 0000000..52f8305 --- /dev/null +++ b/docs/E2E_FIX_PLAN.md | |||
| @@ -0,0 +1,177 @@ | |||
| 1 | # E2E Test Stability Fix Plan | ||
| 2 | |||
| 3 | ## Problem Statement | ||
| 4 | |||
| 5 | E2E tests on physical boards are failing due to five root causes: | ||
| 6 | 1. **LWIP socket exhaustion** (RC-0) — `LWIP_MAX_SOCKETS=10` was too low for two httpd servers + DNS + DoT + wifistr WebSockets | ||
| 7 | 2. **Over-tuned httpd settings** (RC-1) — setting `max_open_sockets=2` and `keep_alive_enable=false` caused socket leaks by interfering with ESP-IDF's internal session management | ||
| 8 | 3. **Owner auto-grant** (RC-2) — makes "no internet before auth" tests non-deterministic | ||
| 9 | 4. **No boot-ready probe** (RC-3) — tests start before HTTP servers are up | ||
| 10 | 5. **Serial monitoring resets** (RC-4) — Python `serial.Serial()` toggles DTR/RTS on USB-Serial/JTAG boards, causing chip resets mid-operation | ||
| 11 | |||
| 12 | ### Baseline Test Results (Board A, before fixes) | ||
| 13 | |||
| 14 | | Suite | Pass | Fail | Notes | | ||
| 15 | |---|---|---|---| | ||
| 16 | | Smoke | 2/6 | 4 | Port 80 unresponsive, cascading failures | | ||
| 17 | | Network | 4/7 | 3 | DNS forward + ping after auth (timing) | | ||
| 18 | | API | 16/20 | 4 | Portal port 80 slow/crashed, captive URIs | | ||
| 19 | | DNS+Firewall | 15/16 | 1 | Ping after auth (timing) | | ||
| 20 | | Reset-Auth | 12/15 | 3 | Allotment was 0 (fixed), 2nd payment | | ||
| 21 | | Session | 14/14 | 0 | Perfect | | ||
| 22 | | Phase 2 | 12/12 | 0 | Perfect | | ||
| 23 | |||
| 24 | ### Verified Test Results (Board ACM2, after all fixes, commit `144b48f`) | ||
| 25 | |||
| 26 | All API endpoints verified working on AP IP `10.192.45.1` with 2-3s delays between requests: | ||
| 27 | - `GET /usage` — returns session/client counts (50/50 sequential requests passed) | ||
| 28 | - `GET /portal-config` — returns `{priceSats, stepMs, mintUrl, metric, stepBytes}` | ||
| 29 | - `GET /whoami` — returns client IP | ||
| 30 | - `GET /grant_access` — grants firewall access | ||
| 31 | - `POST /` (payment) — accepts Cashu token, returns `kind:1022` | ||
| 32 | - `GET /` (port 80 portal) — returns 3829 bytes HTML | ||
| 33 | - `GET /reset_authentication` — clears all sessions and firewall rules | ||
| 34 | |||
| 35 | Full payment flow verified: check → pay → verify → grant → portal → reset → verify clean state. | ||
| 36 | |||
| 37 | --- | ||
| 38 | |||
| 39 | ## Root Causes | ||
| 40 | |||
| 41 | ### RC-0: LWIP socket exhaustion (FIXED) | ||
| 42 | |||
| 43 | `CONFIG_LWIP_MAX_SOCKETS=10` in sdkconfig. Socket budget at steady state: | ||
| 44 | |||
| 45 | | Component | Sockets | Notes | | ||
| 46 | |---|---|---| | ||
| 47 | | Captive portal (port 80) | 5 | 1 listen + 4 workers (default `max_open_sockets`) | | ||
| 48 | | API server (port 2121) | 5 | 1 listen + 4 workers | | ||
| 49 | | DNS server (UDP 53) | 1 | | | ||
| 50 | | DoT reject (TCP 853) | 1 | | | ||
| 51 | | wifistr WebSocket x2 | 2 | relay.damus.io + nos.lol | | ||
| 52 | | **Total** | **14** | **Exceeds LWIP_MAX_SOCKETS=10 by 4** | | ||
| 53 | |||
| 54 | **Fix** (commit `144b48f`): Set `CONFIG_LWIP_MAX_SOCKETS=20` (matching standalone tollgate). Use default `max_open_sockets=4` on both servers. Previous fix tried `max_open_sockets=2` which caused worse problems (see RC-1). | ||
| 55 | |||
| 56 | ### RC-1: Over-tuned httpd settings (FIXED) | ||
| 57 | |||
| 58 | Initial fix reduced `max_open_sockets` to 2 and added `keep_alive_enable=false`, `linger_timeout=0`. This caused socket leaks — ESP-IDF's httpd manages its own session pool internally, and overriding these settings interfered with socket lifecycle management. | ||
| 59 | |||
| 60 | **Symptoms**: Board works for 10-20 requests, then all HTTP becomes unresponsive. Sockets accumulate in CLOSE_WAIT/TIME_WAIT and never get freed. | ||
| 61 | |||
| 62 | **Fix** (commit `144b48f`): Reverted to ESP-IDF defaults for all httpd settings except `stack_size=16384` and `max_uri_handlers`. Default `max_open_sockets=4` and `keep_alive_enable=true` (default) work correctly. | ||
| 63 | |||
| 64 | ### RC-2: Owner auto-grant (FIXED) | ||
| 65 | |||
| 66 | `tollgate_core_client_connected()` granted firewall access to the first WiFi client unconditionally. IP was passed as `0` (bug), creating nondeterministic behavior. | ||
| 67 | |||
| 68 | **Fix** (commit `c89ab31`): Removed `tollgate_core_fw_grant()` call from `client_connected()`. Owner tracking kept for logging. | ||
| 69 | |||
| 70 | ### RC-3: No boot-ready probe (PENDING) | ||
| 71 | |||
| 72 | Tests use fixed sleeps after flash. No polling for HTTP server readiness. | ||
| 73 | |||
| 74 | **Fix**: Add `arch-wait-ready` Makefile target that polls `:2121/usage`. | ||
| 75 | |||
| 76 | ### RC-4: Serial monitoring resets boards (DISCOVERED) | ||
| 77 | |||
| 78 | Python `serial.Serial()` on USB-Serial/JTAG ESP32-S3 boards toggles DTR/RTS during initialization, causing `rst:0x15 (USB_UART_CHIP_RESET)`. This resets the chip even if `dtr=False, rts=False` is set after construction. | ||
| 79 | |||
| 80 | **Symptoms**: | ||
| 81 | - Board boots successfully, services start, gets IP | ||
| 82 | - Python serial read causes immediate `ESP-ROM: boot:0x0 (DOWNLOAD)` or `rst:0x15` | ||
| 83 | - Board appears "dead" after testing — actually reset into download mode | ||
| 84 | - Earlier sessions attributed this to "socket exhaustion" or "WiFi instability" | ||
| 85 | |||
| 86 | **Fix**: Never use Python `serial.Serial()` for monitoring. Use `idf.py monitor` (which handles DTR/RTS correctly) or read-only tools. All hardware access must go through Makefile mutex targets. | ||
| 87 | |||
| 88 | --- | ||
| 89 | |||
| 90 | ## Fix Steps | ||
| 91 | |||
| 92 | ### Step 0: Fix LWIP socket exhaustion — DONE | ||
| 93 | - [x] Set `CONFIG_LWIP_MAX_SOCKETS=20` via sdkconfig (commit `144b48f`) | ||
| 94 | - [x] Use default `max_open_sockets` on both HTTP servers (removed override) | ||
| 95 | - [x] Verified: 50/50 sequential API requests pass on Board ACM2 | ||
| 96 | |||
| 97 | **Files**: `sdkconfig`, `main/captive_portal.c`, `main/tollgate_api.c` | ||
| 98 | |||
| 99 | ### Step 1: Kill owner auto-grant — DONE | ||
| 100 | - [x] Remove `tollgate_core_fw_grant()` from `tollgate_core_client_connected()` (commit `c89ab31`) | ||
| 101 | - [x] Keep owner tracking for logging | ||
| 102 | |||
| 103 | **Files**: `components/tollgate_core/src/tollgate_core.c` | ||
| 104 | |||
| 105 | ### Step 2: HTTP server robustness — DONE | ||
| 106 | - [x] Add `Connection: close` header to port 80 responses (commit `c89ab31`) | ||
| 107 | - [x] Increase captive portal stack to 16384 (commit `c89ab31`) | ||
| 108 | - [x] Use ESP-IDF default socket management (commit `144b48f`) | ||
| 109 | |||
| 110 | **Files**: `main/captive_portal.c`, `main/tollgate_api.c` | ||
| 111 | |||
| 112 | ### Step 3: Add API endpoints — DONE | ||
| 113 | - [x] `GET /portal-config` on port 2121 returning `{priceSats, mintUrl, ...}` (commit `c89ab31`) | ||
| 114 | - [x] `GET /grant_access` — manual firewall grant (commit `c89ab31`) | ||
| 115 | - [x] `GET /reset_authentication` — clear all auth (commit `c89ab31`) | ||
| 116 | - [x] CORS header on portal-config | ||
| 117 | |||
| 118 | **Files**: `main/tollgate_api.c` | ||
| 119 | |||
| 120 | ### Step 4: Remove NAPT flush from `fw_revoke_all()` — DONE | ||
| 121 | - [x] Remove `ip_napt_enable()` toggle that caused 30s hangs (commit `c89ab31`) | ||
| 122 | |||
| 123 | **Files**: `components/tollgate_core/src/tollgate_core_firewall.c` | ||
| 124 | |||
| 125 | ### Step 5: Boot-ready probe — PENDING | ||
| 126 | - [ ] Add `arch-wait-ready` Makefile target that polls `:2121/usage` | ||
| 127 | - [ ] Update `arch-test-full` to call `arch-wait-ready` first | ||
| 128 | - [ ] Add 2-3 second delays between test requests (burst rate mitigation) | ||
| 129 | |||
| 130 | **Files**: `physical-router-test-automation/esp32/Makefile` | ||
| 131 | |||
| 132 | ### Step 6: Hardware testing — BLOCKED | ||
| 133 | - [ ] Flash to working board via Makefile mutex targets | ||
| 134 | - [ ] Run `make arch-test-full` | ||
| 135 | - [ ] Document results | ||
| 136 | - [ ] Board A stuck in download mode (GPIO0 strapping pin) — needs hardware fix | ||
| 137 | |||
| 138 | --- | ||
| 139 | |||
| 140 | ## Burst Rate Limitation | ||
| 141 | |||
| 142 | On USB-Serial/JTAG ESP32-S3 boards, back-to-back HTTP requests with no delay can | ||
| 143 | overwhelm the WiFi AP stack. With 2-3 second delays between requests, the board | ||
| 144 | handles 50+ sequential requests reliably. Without delays, rapid bursts of 10+ | ||
| 145 | requests can cause the WiFi AP to become unresponsive. | ||
| 146 | |||
| 147 | **Mitigation**: E2E tests should include a 2-3 second delay between HTTP requests. | ||
| 148 | This is a WiFi AP throughput limitation, not a firmware bug. | ||
| 149 | |||
| 150 | ## Board Status | ||
| 151 | |||
| 152 | | Board | Port | MAC | Status | | ||
| 153 | |-------|------|-----|--------| | ||
| 154 | | Board A | `/dev/ttyACM0` | `94:a9:90:2e:37:7c` | **BROKEN** — stuck in download mode (`boot:0x0`), GPIO0 strapping pin issue, needs hardware fix | | ||
| 155 | | Board B | `/dev/ttyACM1` | `fc:01:2c:c5:50:50` | Unknown — newly discovered, needs firmware flash | | ||
| 156 | | Board C | `/dev/ttyACM2` | `20:6e:f1:98:d7:08` | **WORKING** — all endpoints verified, payment flow tested | | ||
| 157 | |||
| 158 | ## Key Architecture Decisions | ||
| 159 | |||
| 160 | - **Port 80**: Portal HTML + captive detection URIs only. No API, no state mutation. | ||
| 161 | - **Port 2121**: All API operations (discovery, payment, grant, reset, whoami, usage, wallet, portal-config). | ||
| 162 | - **Owner tracking**: Kept for logging/display, no longer grants free internet. | ||
| 163 | - **Connection: close**: Set on ALL port 80 responses to hint clients. | ||
| 164 | - **Default httpd settings**: ESP-IDF's built-in session management works correctly. Do not override `max_open_sockets`, `keep_alive_enable`, `linger_timeout`, or timeouts. | ||
| 165 | |||
| 166 | ## Execution Order | ||
| 167 | |||
| 168 | Steps 0-4 are DONE (commits `c89ab31`, `144b48f`). | ||
| 169 | Step 5 (boot-ready probe) is next — code only, no hardware needed. | ||
| 170 | Step 6 (validation) requires working board via Makefile mutex targets. | ||
| 171 | |||
| 172 | ## Hardware Access Rules | ||
| 173 | |||
| 174 | - **ALWAYS** use Makefile mutex targets (`make arch-flash-a`, etc.) for hardware access | ||
| 175 | - **NEVER** call `esptool.py` directly — bypasses mutex and conflicts with other sessions | ||
| 176 | - **NEVER** use Python `serial.Serial()` for monitoring — causes DTR/RTS resets on USB-Serial/JTAG | ||
| 177 | - Multiple opencode sessions may be active — mutex prevents board conflicts | ||