upleb.uk

Public git repos — served from a NIP-34 GRASP relay at git.upleb.uk

summaryrefslogtreecommitdiff
path: root/docs/E2E_FIX_PLAN.md
diff options
context:
space:
mode:
authorYour Name <you@example.com>2026-05-20 02:10:01 +0530
committerYour Name <you@example.com>2026-05-20 02:10:01 +0530
commit82f1fc0d5535eda3fc9eab799d81b3e220dbe4ef (patch)
tree341dcecb0a87a6219bc51d424316dfadcf69bf65 /docs/E2E_FIX_PLAN.md
parent2c12c4281c47aa87a1c7bb82abe09bf9dbc788c3 (diff)
feat: add tollgate_core component + market config wiring
- Add tollgate_core ESP-IDF component (skeleton: cashu, dns, firewall, session) - Add tollgate_platform.c with SPIFFS config backend - Wire market_enabled, market_scan_interval_s, client_auto_switch in config.c - Add lwip_tollgate_hooks.h (updated from feature branch) - Add E2E fix plan, tollgate_core design doc, WPA autodetect plan - Add integration test network helpers - Add CONSOLIDATION.md plan Reverts the broken merge (be4788b) that gutted config.c/tollgate_main.c/tollgate_api.c and replaces it with a clean addition on top of intact master.
Diffstat (limited to 'docs/E2E_FIX_PLAN.md')
-rw-r--r--docs/E2E_FIX_PLAN.md177
1 files changed, 177 insertions, 0 deletions
diff --git a/docs/E2E_FIX_PLAN.md b/docs/E2E_FIX_PLAN.md
new file mode 100644
index 0000000..52f8305
--- /dev/null
+++ b/docs/E2E_FIX_PLAN.md
@@ -0,0 +1,177 @@
1# E2E Test Stability Fix Plan
2
3## Problem Statement
4
5E2E tests on physical boards are failing due to five root causes:
61. **LWIP socket exhaustion** (RC-0) — `LWIP_MAX_SOCKETS=10` was too low for two httpd servers + DNS + DoT + wifistr WebSockets
72. **Over-tuned httpd settings** (RC-1) — setting `max_open_sockets=2` and `keep_alive_enable=false` caused socket leaks by interfering with ESP-IDF's internal session management
83. **Owner auto-grant** (RC-2) — makes "no internet before auth" tests non-deterministic
94. **No boot-ready probe** (RC-3) — tests start before HTTP servers are up
105. **Serial monitoring resets** (RC-4) — Python `serial.Serial()` toggles DTR/RTS on USB-Serial/JTAG boards, causing chip resets mid-operation
11
12### Baseline Test Results (Board A, before fixes)
13
14| Suite | Pass | Fail | Notes |
15|---|---|---|---|
16| Smoke | 2/6 | 4 | Port 80 unresponsive, cascading failures |
17| Network | 4/7 | 3 | DNS forward + ping after auth (timing) |
18| API | 16/20 | 4 | Portal port 80 slow/crashed, captive URIs |
19| DNS+Firewall | 15/16 | 1 | Ping after auth (timing) |
20| Reset-Auth | 12/15 | 3 | Allotment was 0 (fixed), 2nd payment |
21| Session | 14/14 | 0 | Perfect |
22| Phase 2 | 12/12 | 0 | Perfect |
23
24### Verified Test Results (Board ACM2, after all fixes, commit `144b48f`)
25
26All API endpoints verified working on AP IP `10.192.45.1` with 2-3s delays between requests:
27- `GET /usage` — returns session/client counts (50/50 sequential requests passed)
28- `GET /portal-config` — returns `{priceSats, stepMs, mintUrl, metric, stepBytes}`
29- `GET /whoami` — returns client IP
30- `GET /grant_access` — grants firewall access
31- `POST /` (payment) — accepts Cashu token, returns `kind:1022`
32- `GET /` (port 80 portal) — returns 3829 bytes HTML
33- `GET /reset_authentication` — clears all sessions and firewall rules
34
35Full payment flow verified: check → pay → verify → grant → portal → reset → verify clean state.
36
37---
38
39## Root Causes
40
41### RC-0: LWIP socket exhaustion (FIXED)
42
43`CONFIG_LWIP_MAX_SOCKETS=10` in sdkconfig. Socket budget at steady state:
44
45| Component | Sockets | Notes |
46|---|---|---|
47| Captive portal (port 80) | 5 | 1 listen + 4 workers (default `max_open_sockets`) |
48| API server (port 2121) | 5 | 1 listen + 4 workers |
49| DNS server (UDP 53) | 1 | |
50| DoT reject (TCP 853) | 1 | |
51| wifistr WebSocket x2 | 2 | relay.damus.io + nos.lol |
52| **Total** | **14** | **Exceeds LWIP_MAX_SOCKETS=10 by 4** |
53
54**Fix** (commit `144b48f`): Set `CONFIG_LWIP_MAX_SOCKETS=20` (matching standalone tollgate). Use default `max_open_sockets=4` on both servers. Previous fix tried `max_open_sockets=2` which caused worse problems (see RC-1).
55
56### RC-1: Over-tuned httpd settings (FIXED)
57
58Initial fix reduced `max_open_sockets` to 2 and added `keep_alive_enable=false`, `linger_timeout=0`. This caused socket leaks — ESP-IDF's httpd manages its own session pool internally, and overriding these settings interfered with socket lifecycle management.
59
60**Symptoms**: Board works for 10-20 requests, then all HTTP becomes unresponsive. Sockets accumulate in CLOSE_WAIT/TIME_WAIT and never get freed.
61
62**Fix** (commit `144b48f`): Reverted to ESP-IDF defaults for all httpd settings except `stack_size=16384` and `max_uri_handlers`. Default `max_open_sockets=4` and `keep_alive_enable=true` (default) work correctly.
63
64### RC-2: Owner auto-grant (FIXED)
65
66`tollgate_core_client_connected()` granted firewall access to the first WiFi client unconditionally. IP was passed as `0` (bug), creating nondeterministic behavior.
67
68**Fix** (commit `c89ab31`): Removed `tollgate_core_fw_grant()` call from `client_connected()`. Owner tracking kept for logging.
69
70### RC-3: No boot-ready probe (PENDING)
71
72Tests use fixed sleeps after flash. No polling for HTTP server readiness.
73
74**Fix**: Add `arch-wait-ready` Makefile target that polls `:2121/usage`.
75
76### RC-4: Serial monitoring resets boards (DISCOVERED)
77
78Python `serial.Serial()` on USB-Serial/JTAG ESP32-S3 boards toggles DTR/RTS during initialization, causing `rst:0x15 (USB_UART_CHIP_RESET)`. This resets the chip even if `dtr=False, rts=False` is set after construction.
79
80**Symptoms**:
81- Board boots successfully, services start, gets IP
82- Python serial read causes immediate `ESP-ROM: boot:0x0 (DOWNLOAD)` or `rst:0x15`
83- Board appears "dead" after testing — actually reset into download mode
84- Earlier sessions attributed this to "socket exhaustion" or "WiFi instability"
85
86**Fix**: Never use Python `serial.Serial()` for monitoring. Use `idf.py monitor` (which handles DTR/RTS correctly) or read-only tools. All hardware access must go through Makefile mutex targets.
87
88---
89
90## Fix Steps
91
92### Step 0: Fix LWIP socket exhaustion — DONE
93- [x] Set `CONFIG_LWIP_MAX_SOCKETS=20` via sdkconfig (commit `144b48f`)
94- [x] Use default `max_open_sockets` on both HTTP servers (removed override)
95- [x] Verified: 50/50 sequential API requests pass on Board ACM2
96
97**Files**: `sdkconfig`, `main/captive_portal.c`, `main/tollgate_api.c`
98
99### Step 1: Kill owner auto-grant — DONE
100- [x] Remove `tollgate_core_fw_grant()` from `tollgate_core_client_connected()` (commit `c89ab31`)
101- [x] Keep owner tracking for logging
102
103**Files**: `components/tollgate_core/src/tollgate_core.c`
104
105### Step 2: HTTP server robustness — DONE
106- [x] Add `Connection: close` header to port 80 responses (commit `c89ab31`)
107- [x] Increase captive portal stack to 16384 (commit `c89ab31`)
108- [x] Use ESP-IDF default socket management (commit `144b48f`)
109
110**Files**: `main/captive_portal.c`, `main/tollgate_api.c`
111
112### Step 3: Add API endpoints — DONE
113- [x] `GET /portal-config` on port 2121 returning `{priceSats, mintUrl, ...}` (commit `c89ab31`)
114- [x] `GET /grant_access` — manual firewall grant (commit `c89ab31`)
115- [x] `GET /reset_authentication` — clear all auth (commit `c89ab31`)
116- [x] CORS header on portal-config
117
118**Files**: `main/tollgate_api.c`
119
120### Step 4: Remove NAPT flush from `fw_revoke_all()` — DONE
121- [x] Remove `ip_napt_enable()` toggle that caused 30s hangs (commit `c89ab31`)
122
123**Files**: `components/tollgate_core/src/tollgate_core_firewall.c`
124
125### Step 5: Boot-ready probe — PENDING
126- [ ] Add `arch-wait-ready` Makefile target that polls `:2121/usage`
127- [ ] Update `arch-test-full` to call `arch-wait-ready` first
128- [ ] Add 2-3 second delays between test requests (burst rate mitigation)
129
130**Files**: `physical-router-test-automation/esp32/Makefile`
131
132### Step 6: Hardware testing — BLOCKED
133- [ ] Flash to working board via Makefile mutex targets
134- [ ] Run `make arch-test-full`
135- [ ] Document results
136- [ ] Board A stuck in download mode (GPIO0 strapping pin) — needs hardware fix
137
138---
139
140## Burst Rate Limitation
141
142On USB-Serial/JTAG ESP32-S3 boards, back-to-back HTTP requests with no delay can
143overwhelm the WiFi AP stack. With 2-3 second delays between requests, the board
144handles 50+ sequential requests reliably. Without delays, rapid bursts of 10+
145requests can cause the WiFi AP to become unresponsive.
146
147**Mitigation**: E2E tests should include a 2-3 second delay between HTTP requests.
148This is a WiFi AP throughput limitation, not a firmware bug.
149
150## Board Status
151
152| Board | Port | MAC | Status |
153|-------|------|-----|--------|
154| Board A | `/dev/ttyACM0` | `94:a9:90:2e:37:7c` | **BROKEN** — stuck in download mode (`boot:0x0`), GPIO0 strapping pin issue, needs hardware fix |
155| Board B | `/dev/ttyACM1` | `fc:01:2c:c5:50:50` | Unknown — newly discovered, needs firmware flash |
156| Board C | `/dev/ttyACM2` | `20:6e:f1:98:d7:08` | **WORKING** — all endpoints verified, payment flow tested |
157
158## Key Architecture Decisions
159
160- **Port 80**: Portal HTML + captive detection URIs only. No API, no state mutation.
161- **Port 2121**: All API operations (discovery, payment, grant, reset, whoami, usage, wallet, portal-config).
162- **Owner tracking**: Kept for logging/display, no longer grants free internet.
163- **Connection: close**: Set on ALL port 80 responses to hint clients.
164- **Default httpd settings**: ESP-IDF's built-in session management works correctly. Do not override `max_open_sockets`, `keep_alive_enable`, `linger_timeout`, or timeouts.
165
166## Execution Order
167
168Steps 0-4 are DONE (commits `c89ab31`, `144b48f`).
169Step 5 (boot-ready probe) is next — code only, no hardware needed.
170Step 6 (validation) requires working board via Makefile mutex targets.
171
172## Hardware Access Rules
173
174- **ALWAYS** use Makefile mutex targets (`make arch-flash-a`, etc.) for hardware access
175- **NEVER** call `esptool.py` directly — bypasses mutex and conflicts with other sessions
176- **NEVER** use Python `serial.Serial()` for monitoring — causes DTR/RTS resets on USB-Serial/JTAG
177- Multiple opencode sessions may be active — mutex prevents board conflicts