fix: enhance gateway logging for Windows dev restart debugging (#665)

Add comprehensive debug logging throughout the gateway lifecycle to
help troubleshoot nodemon restart issues on Windows, where SIGTERM
is used instead of SIGUSR2.

Changes:
- Enhanced shutdown handler to log all signals and env var states
- Gateway manager now logs process detachment mode explicitly
- Added environment variable confirmation on bootstrap
- Updated gateway-development.md with new debug logs and troubleshooting steps

Benefits:
- Easier troubleshooting of gateway lifecycle issues
- Clear visibility into signal handling during nodemon restarts
- Better cross-platform development experience
- Production behavior remains unchanged

Testing:
-  Windows: Gateways persist across nodemon restarts
-  macOS/Linux: Existing SIGUSR2 behavior preserved
-  Production: Default shutdown cleanup unchanged
-  Backward compatibility: No breaking changes

Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
This commit is contained in:
ekko
2026-05-12 22:03:28 +08:00
committed by GitHub
parent 5bd1475a83
commit 44d1b13741
4 changed files with 161 additions and 10 deletions
+125 -6
View File
@@ -138,7 +138,7 @@ This keeps each profile isolated.
## Development Mode on Windows ## Development Mode on Windows
Windows development has one important difference: `nodemon` restarts can terminate child processes as part of the process tree. Windows development has one important difference: `nodemon` restarts can terminate child processes as part of the process tree. On Windows, `nodemon` may send `SIGTERM` during restarts instead of `SIGUSR2`.
To avoid closing every gateway on each server restart, `nodemon.json` sets: To avoid closing every gateway on each server restart, `nodemon.json` sets:
@@ -152,13 +152,28 @@ To avoid closing every gateway on each server restart, `nodemon.json` sets:
When this variable is `0` or `false`: When this variable is `0` or `false`:
- shutdown skips `gatewayManager.stopAll()`; - shutdown skips `gatewayManager.stopAll()` for **all signals** (including `SIGTERM`);
- gateway processes are spawned with `detached: true`; - gateway processes are spawned with `detached: true`;
- gateway child processes are `unref()`ed; - gateway child processes are `unref()`ed;
- the restarted server re-detects running gateways during `detectAllOnStartup()`. - the restarted server re-detects running gateways during `detectAllOnStartup()`.
This is the intended local development behavior. Editing server files should restart the Web UI server without killing all Hermes gateways. This is the intended local development behavior. Editing server files should restart the Web UI server without killing all Hermes gateways.
### Debug Logging
The enhanced shutdown handler now logs all signals and environment variable states:
```text
[shutdown] Signal: SIGTERM, HERMES_WEB_UI_STOP_GATEWAYS_ON_SHUTDOWN: 0
[shutdown] Dev mode detected: NOT stopping gateways
```
Gateway startup logs also indicate the process detachment mode:
```text
[gateway] Detaching gateway process (dev mode: HERMES_WEB_UI_STOP_GATEWAYS_ON_SHUTDOWN=0)
```
## Production Shutdown Behavior ## Production Shutdown Behavior
In production, the env override is normally unset. In production, the env override is normally unset.
@@ -173,7 +188,15 @@ bindShutdown()
Only gateways marked as `owned` by the current Web UI instance are stopped by `stopAll()`. Only gateways marked as `owned` by the current Web UI instance are stopped by `stopAll()`.
`SIGUSR2` is treated as a restart signal and skips gateway shutdown by default. This keeps compatibility with restart tools that use `SIGUSR2`. ### Signal Handling
| Signal | Default Behavior | With `HERMES_WEB_UI_STOP_GATEWAYS_ON_SHUTDOWN=0` |
|--------|------------------|--------------------------------------------------|
| `SIGTERM` | Stop gateways | Skip gateway shutdown |
| `SIGINT` | Stop gateways | Skip gateway shutdown |
| `SIGUSR2` | Skip gateway shutdown (reload) | Skip gateway shutdown |
**Windows Note**: `nodemon` on Windows typically sends `SIGTERM` during restarts, not `SIGUSR2`. This is why the `HERMES_WEB_UI_STOP_GATEWAYS_ON_SHUTDOWN=0` override is critical on Windows for development.
## Stop Flow ## Stop Flow
@@ -249,10 +272,46 @@ Expected behavior:
- gateways keep running across server restarts; - gateways keep running across server restarts;
- the restarted server re-registers healthy gateways during bootstrap. - the restarted server re-registers healthy gateways during bootstrap.
### Quick Health Check
Verify everything is working:
```bash
# Check environment variable is set
# (should see: HERMES_WEB_UI_STOP_GATEWAYS_ON_SHUTDOWN = 0)
npm run dev
# In another terminal, check gateways are running
ps aux | grep -i "hermes.*gateway"
# Trigger a restart by editing a server file
# (gateways should keep running)
```
### Expected Logs
**Startup:**
```text
[bootstrap] HERMES_WEB_UI_STOP_GATEWAYS_ON_SHUTDOWN = 0
[gateway] Detaching gateway process (dev mode: HERMES_WEB_UI_STOP_GATEWAYS_ON_SHUTDOWN=0)
```
**During Nodemon Restart:**
```text
[shutdown] Signal: SIGTERM, HERMES_WEB_UI_STOP_GATEWAYS_ON_SHUTDOWN: 0
[shutdown] Dev mode detected: NOT stopping gateways
```
**After Restart:**
```text
[bootstrap] HERMES_WEB_UI_STOP_GATEWAYS_ON_SHUTDOWN = 0
%s: already running (PID: xxxxx, port: 8642)
```
If a gateway fails after restart, check: If a gateway fails after restart, check:
1. `HERMES_WEB_UI_STOP_GATEWAYS_ON_SHUTDOWN` is `0` in the server process. 1. `HERMES_WEB_UI_STOP_GATEWAYS_ON_SHUTDOWN` is `0` in the server process.
2. Gateway start logs include `detached: true`. 2. Gateway start logs include `Detaching gateway process`.
3. The profile has a valid `gateway.pid` or `gateway_state.json`. 3. The profile has a valid `gateway.pid` or `gateway_state.json`.
4. The configured gateway `/health` endpoint is reachable. 4. The configured gateway `/health` endpoint is reachable.
5. No unrelated process occupies the profile's configured port. 5. No unrelated process occupies the profile's configured port.
@@ -270,10 +329,40 @@ HERMES_WEB_UI_STOP_GATEWAYS_ON_SHUTDOWN=0
Also confirm the gateway start log prints: Also confirm the gateway start log prints:
```text ```text
detached: true [gateway] Detaching gateway process (dev mode: HERMES_WEB_UI_STOP_GATEWAYS_ON_SHUTDOWN=0)
``` ```
If it prints `detached: false`, the dev opt-out env did not reach the server process. If it prints `Attaching gateway process`, the dev opt-out env did not reach the server process.
#### Debugging Steps
1. **Check startup logs** for environment variable confirmation:
```text
[bootstrap] HERMES_WEB_UI_STOP_GATEWAYS_ON_SHUTDOWN = 0
```
2. **Check shutdown logs** when nodemon restarts:
```text
[shutdown] Signal: SIGTERM, HERMES_WEB_UI_STOP_GATEWAYS_ON_SHUTDOWN: 0
[shutdown] Dev mode detected: NOT stopping gateways
```
3. **Verify gateway detachment mode**:
```text
[gateway] Detaching gateway process (dev mode: HERMES_WEB_UI_STOP_GATEWAYS_ON_SHUTDOWN=0)
```
4. **Check if gateway survived restart**:
```bash
# Before restart
ps aux | grep -i "hermes.*gateway"
# Note the PID
# After nodemon restart
ps aux | grep -i "hermes.*gateway"
# PID should be the same
```
If logs show `Attaching gateway process` or shutdown logs show `STOPPING gateways`, the environment variable is not being applied correctly.
### Gateway is alive but Web UI does not detect it ### Gateway is alive but Web UI does not detect it
@@ -324,3 +413,33 @@ If startup still fails, inspect the profile directory for:
- Treat port listener discovery as a fallback. A listening port can belong to another process. - Treat port listener discovery as a fallback. A listening port can belong to another process.
- Preserve production shutdown cleanup unless the dev opt-out env is explicitly set. - Preserve production shutdown cleanup unless the dev opt-out env is explicitly set.
- When changing Windows process handling, test both `npm run dev` and production-style startup. - When changing Windows process handling, test both `npm run dev` and production-style startup.
## Recent Changes
### Enhanced Logging and Windows Support (2025-01-XX)
**Improvements:**
- Enhanced shutdown handler with detailed logging for all signals
- Gateway manager now logs detachment mode explicitly
- Added environment variable confirmation on startup
- Improved cross-platform signal handling documentation
**Debug Logs Added:**
```text
[bootstrap] HERMES_WEB_UI_STOP_GATEWAYS_ON_SHUTDOWN = 0
[gateway] Detaching gateway process (dev mode: HERMES_WEB_UI_STOP_GATEWAYS_ON_SHUTDOWN=0)
[shutdown] Signal: SIGTERM, HERMES_WEB_UI_STOP_GATEWAYS_ON_SHUTDOWN: 0
[shutdown] Dev mode detected: NOT stopping gateways
```
**Benefits:**
- Easier troubleshooting of gateway lifecycle issues
- Clear visibility into signal handling during nodemon restarts
- Better cross-platform development experience
- Production behavior remains unchanged
**Testing:**
- ✅ Windows: Gateways persist across nodemon restarts
- ✅ macOS/Linux: Existing SIGUSR2 behavior preserved
- ✅ Production: Default shutdown cleanup unchanged
- ✅ Backward compatibility: No breaking changes
+4
View File
@@ -85,6 +85,10 @@ export async function bootstrap() {
const authToken = await getToken() const authToken = await getToken()
await initLoginLimiter() await initLoginLimiter()
// Debug: log environment variable
console.log('[bootstrap] HERMES_WEB_UI_STOP_GATEWAYS_ON_SHUTDOWN =', process.env.HERMES_WEB_UI_STOP_GATEWAYS_ON_SHUTDOWN)
const app = new Koa() const app = new Koa()
await initGatewayManager() await initGatewayManager()
@@ -148,8 +148,18 @@ function isLocalHost(host: string): boolean {
} }
function shouldDetachGatewayProcess(): boolean { function shouldDetachGatewayProcess(): boolean {
// In dev mode (nodemon), always detach gateway processes so they survive restarts
// Production mode: attach gateways so they can be managed together with the server
const override = process.env.HERMES_WEB_UI_STOP_GATEWAYS_ON_SHUTDOWN?.trim().toLowerCase() const override = process.env.HERMES_WEB_UI_STOP_GATEWAYS_ON_SHUTDOWN?.trim().toLowerCase()
return override === '0' || override === 'false' const shouldDetach = override === '0' || override === 'false'
if (shouldDetach) {
console.log('[gateway] Detaching gateway process (dev mode: HERMES_WEB_UI_STOP_GATEWAYS_ON_SHUTDOWN=' + override + ')')
} else {
console.log('[gateway] Attaching gateway process (prod mode: HERMES_WEB_UI_STOP_GATEWAYS_ON_SHUTDOWN=' + (override || 'not set') + ')')
}
return shouldDetach
} }
// ============================ // ============================
+21 -3
View File
@@ -6,10 +6,25 @@ function shouldStopGatewaysOnShutdown(signal: string): boolean {
// nodemon may use SIGTERM on Windows restarts, so dev mode opts out via env. // nodemon may use SIGTERM on Windows restarts, so dev mode opts out via env.
// Production keeps stopping owned gateways by default. // Production keeps stopping owned gateways by default.
const override = process.env.HERMES_WEB_UI_STOP_GATEWAYS_ON_SHUTDOWN?.trim() const override = process.env.HERMES_WEB_UI_STOP_GATEWAYS_ON_SHUTDOWN?.trim()
if (override === '0' || override === 'false') return false
if (override === '1' || override === 'true') return true
return signal !== 'SIGUSR2' console.log(`[shutdown] Signal: ${signal}, HERMES_WEB_UI_STOP_GATEWAYS_ON_SHUTDOWN: ${override}`)
// Explicit '0' or 'false' means dev mode: never stop gateways
if (override === '0' || override === 'false') {
console.log('[shutdown] Dev mode detected: NOT stopping gateways')
return false
}
// Explicit '1' or 'true' means always stop gateways
if (override === '1' || override === 'true') {
console.log('[shutdown] Explicit gateway shutdown enabled: stopping gateways')
return true
}
// Default behavior: only stop gateways on explicit termination, not on reload
const shouldStop = signal !== 'SIGUSR2'
console.log(`[shutdown] Default behavior: ${shouldStop ? 'STOPPING' : 'NOT stopping'} gateways (signal: ${signal})`)
return shouldStop
} }
export function bindShutdown(server: any, groupChatServer?: any, chatRunServer?: any): void { export function bindShutdown(server: any, groupChatServer?: any, chatRunServer?: any): void {
@@ -23,6 +38,9 @@ export function bindShutdown(server: any, groupChatServer?: any, chatRunServer?:
setTimeout(() => process.exit(0), 3000) setTimeout(() => process.exit(0), 3000)
logger.info('Shutting down (%s)...', signal) logger.info('Shutting down (%s)...', signal)
console.log(`[shutdown] Received signal: ${signal}`)
console.log(`[shutdown] HERMES_WEB_UI_STOP_GATEWAYS_ON_SHUTDOWN = ${process.env.HERMES_WEB_UI_STOP_GATEWAYS_ON_SHUTDOWN}`)
console.log(`[shutdown] shouldStopGatewaysOnShutdown = ${shouldStopGatewaysOnShutdown(signal)}`)
try { try {
if (shouldStopGatewaysOnShutdown(signal)) { if (shouldStopGatewaysOnShutdown(signal)) {