Deep Review: 20260326-130316-pr-10018

Date	2026-03-26 13:03
Repo	rancher-sandbox/rancher-desktop
Round	2
Author	@jandubois
PR	#10018 — Use dynamic ports for Steve to avoid conflicts with other software
Branch	`steve-dynamic-ports`
Commits	`ba16a032a` Use dynamic ports for Steve to avoid conflicts with other software `b69b292f7` Check Steve HTTP readiness instead of raw TCP connectivity `24e9a1bdd` Steve: Fix healthcheck `fa2cbca72` Steve: address AI review comments
Reviewers	Claude Opus 4.6, Codex GPT 5.4, Gemini 3.1 Pro
Verdict	Merge as-is — no critical issues; the one important gap (dashboard button after Steve failure) is already tracked as #10060 and acknowledged by the reviewer
Wall-clock time	24 min 30 s

Executive Summary ¶

This PR replaces Steve's hardcoded port 9443 with dynamically allocated ephemeral ports, resolving a long-standing conflict with other software (notably Logitech GHub). The implementation is clean: getAvailablePorts() binds all ports simultaneously to guarantee distinct values, the dashboard proxy is recreated per Steve start via wrapper functions that survive route re-registration, and a new /api/steve-port endpoint lets the dashboard discover the port at runtime. The health check was iteratively improved from raw TCP to an HTTPS probe against /v1/namespaces. All three agents agreed the code is well-structured; the only important finding (dashboard button enabled despite Steve failure) is already filed as #10060.

Structure: 0 critical, 1 important, 7 suggestions. The PR has already been approved by @mook-as.

Critical Issues ¶

None.

Important Issues ¶

1. Steve startup failure still enables the dashboard UI Codex Gemini ¶

If Steve.start() throws, the catch block at line 1288 logs the error but execution continues to mainEvents.emit('k8s-check-state', mgr) at line 1296. The renderer enables the dashboard button solely based on k8sState === K8s.State.STARTED, so the dashboard appears accessible even though Steve is down. git blame attributes the swallowing catch to ba16a032a. Already filed as #10060 and acknowledged by reviewer — not blocking merge.

background.ts

        if (enabledK8s) {
          try {
            const [stevePort] = await getAvailablePorts(1);

            console.log(`Steve HTTPS port: ${ stevePort }`);
            // Set the Steve HTTPS port for certificate checking before setting
            // up Steve itself.
            setSteveCertPort(stevePort);
            await Steve.getInstance().start(stevePort);
            DashboardServer.getInstance().setStevePort(stevePort); // recreate proxy middleware
          } catch (ex) {
            console.error('Failed to start Steve:', ex);
          }
        }
      }

      // Notify the tray and renderer after Steve is ready, so the dashboard
      // button is only enabled when Steve can accept connections.
      mainEvents.emit('k8s-check-state', mgr);
      window.send('k8s-check-state', state);

      if (state === K8s.State.STOPPING) {
        Steve.getInstance().stop();
      }

Fix: either gate k8s-check-state emission on Steve success, or add a separate Steve-ready signal. The deferred approach (let Steve self-select a port and report it) would also solve this.

Suggestions ¶

1. Race condition on fast K8s restart with port mismatch Gemini ¶

If K8s restarts quickly before the old Steve process emits its close event, isRunning is still true when start() is called with a new port. The method silently returns at lines 48-52, and background.ts proceeds to update the proxy to the new (unused) port. In practice, K8s restart cycles take many seconds while Steve exits promptly on SIGINT, making this window negligible.

pkg/rancher-desktop/backend/steve.ts

  public async start(httpsPort: number) {
    const { pid } = this.process || { };

    if (this.isRunning && pid) {
      console.debug(`Steve is already running with pid: ${ pid }`);

      return;
    }

    this.httpsPort = httpsPort;

    const osSpecificName = /^win/i.test(os.platform()) ? 'steve.exe' : 'steve';

Fix: either make Steve.stop() return a promise that resolves on process exit, or guard against mismatched ports:

 if (this.isRunning && pid) {
+  if (this.httpsPort !== httpsPort) {
+    throw new Error(`Steve is shutting down on port ${ this.httpsPort }`);
+  }
   console.debug(`Steve is already running with pid: ${ pid }`);
   return;
 }

2. Test asserts behavior outside getAvailablePorts contract Codex ¶

getAvailablePorts() explicitly documents a TOCTOU window: ports are released before returning (lines 20-22 of networks.ts). Rebinding returned ports in the test depends on no other process claiming them first, which can cause intermittent CI failures on busy hosts.

pkg/rancher-desktop/utils/__tests__/networks.spec.ts

  it('returns usable ports', async() => {
    const ports = await getAvailablePorts(2);

    // Verify both returned ports can actually be bound.
    for (const port of ports) {
      const server = net.createServer();

      await new Promise<void>((resolve, reject) => {
        server.once('error', reject);
        server.listen(port, '127.0.0.1', resolve);
      });

      await new Promise<void>((resolve) => {
        server.close(() => resolve());
      });
    }
  });

Fix: test deterministic properties only:

-  it('returns usable ports', async() => {
+  it('returns distinct positive ports', async() => {
     const ports = await getAvailablePorts(2);
-    for (const port of ports) {
-      // ... rebind attempt
-    }
+    expect(new Set(ports).size).toBe(ports.length);
+    for (const port of ports) {
+      expect(port).toBeGreaterThan(0);
+    }
   });

3. Misleading docstring on setStevePort() Claude ¶

The docstring says "Call this before each Steve start," but background.ts:1287 calls it after Steve.start() succeeds (which is correct — proxies should point to where Steve is listening).

pkg/rancher-desktop/main/dashboardServer/index.ts

  /**
   * Update the Steve HTTPS port and recreate proxies with the new
   * target.  Call this before each Steve start.
   */
  public setStevePort(stevePort: number) {
    this.stevePort = stevePort;
    this.createProxies();
  }

Fix:

-   * target.  Call this before each Steve start.
+   * target.  Call this after each successful Steve start.

4. Dashboard proxy not cleared when Steve stops Gemini ¶

When Steve stops, DashboardServer still holds proxy middleware targeting the old ephemeral port. If another process binds that port, requests would be proxied to the wrong service. In practice, when K8s is STOPPING the dashboard button is disabled, so no user-initiated requests reach the proxy. Defense-in-depth, not a real exposure.

background.ts

      if (state === K8s.State.STOPPING) {
        Steve.getInstance().stop();
      }

Fix: DashboardServer.getInstance().setStevePort(0) after Steve.stop().

5. Process error listener remains attached after spawn Gemini ¶

The .once('error') listener at lines 109-111 is meant to catch spawn failures. If the process spawns successfully, this listener remains attached and would silently swallow a later runtime error by calling reject() on an already-resolved promise. Harmless in practice (Node.js ignores duplicate promise resolutions).

pkg/rancher-desktop/backend/steve.ts

    await new Promise<void>((resolve, reject) => {
      this.process.once('spawn', () => {
        this.isRunning = true;
        console.debug(`Spawned child pid: ${ this.process.pid }`);
        resolve();
      });
      this.process.once('error', (err) => {
        reject(new Error(`Failed to spawn Steve: ${ err.message }`, { cause: err }));
      });
    });

Fix: remove the listener upon successful spawn.

6. Inaccurate timeout message in waitForReady Gemini ¶

The error message calculates 60 * 500 / 1000 = 30 seconds, but each loop iteration also awaits isPortReady() which has its own 1000ms timeout. If every attempt times out, actual elapsed time is up to 90 seconds, not 30.

pkg/rancher-desktop/backend/steve.ts

      await setTimeout(delayMs);
    }

    throw new Error(`Steve did not become ready after ${ maxAttempts * delayMs / 1000 } seconds`);

Fix:

-throw new Error(`Steve did not become ready after ${ maxAttempts * delayMs / 1000 } seconds`);
+throw new Error(`Steve did not become ready after ${ maxAttempts } attempts`);

7. Redundant setCertificateVerifyProc on each health check Claude ¶

Electron.session.fromPartition() returns the same session for a given partition name. The setCertificateVerifyProc call at line 160 overwrites the previous callback on every iteration (up to 60 times in waitForReady). The function is identical each time, so this is harmless but redundant.

pkg/rancher-desktop/backend/steve.ts

    return new Promise((resolve) => {
      const session = Electron.session.fromPartition('steve-healthcheck', { cache: false });

      session.setCertificateVerifyProc((request, callback) => {
        if (request.hostname === '127.0.0.1') {
          // We do not have any more information to narrow down the certificate;
          // given that we're doing this in a private partition, it should be
          // safe to allow all localhost certificates.  In particular, we do not
          // get access to the port number, and all the Steve certificates have
          // generic fields (e.g. subject).
          callback(0);
        } else {
          // Unexpected request; not sure how this could happen in a new session,
          // but we can at least pretend to do the right thing.
          callback(-3); // Use Chromium's default verification.
        }
      });

Fix: set the callback once in the constructor or on first call.

Design Observations ¶

Concerns ¶

1. (future) Port pre-allocation race Codex Gemini ¶

The app pre-allocates Steve's port externally and races Steve to bind it. A cleaner design is to let Steve listen on port 0, report its chosen port back, and have the caller read it. This would eliminate the TOCTOU window, the duplicate port state across background.ts/DashboardServer/main/networking, and the mismatch-on-restart race. Acknowledged by reviewer mook-as: "The correct fix is to have steve auto-select a port, and then have it report the port number. Will punt for now."

Strengths ¶

The getAvailablePorts() design of binding all ports simultaneously before releasing any elegantly prevents the "two calls get the same port" problem. The TupleOf type gives callers compile-time length guarantees. Claude
The wrapper-function pattern in DashboardServer.init() (lines 101-104) supports proxy recreation without re-registering Express routes — the falsy guard handles the window before first setStevePort() gracefully. Claude Codex
The health check evolution (TCP → HTTP → HTTPS /v1/namespaces probe) correctly addresses the real problem: Steve accepts connections before its API controllers finish schema discovery. Claude Codex
The tighter WebSocket upgrade matching (req.url === key || req.url?.startsWith(key + '/')) prevents accidental upgrades for unrelated routes. Gemini

Testing Assessment ¶

The getAvailablePorts utility has solid unit test coverage (count, positivity, usability, distinctness, dynamic count). Untested scenarios by risk:

Steve startup → proxy recreation → UI notification lifecycle — the core flow of the PR. Would require mocking Electron and K8s backend, which may not be practical in unit tests.
/api/steve-port endpoint returning the correct port (or 0 before first setStevePort call). A lightweight Express test could cover this.
Steve failure path in background.ts — the catch block that suppresses errors and allows STARTED state emission.
setSteveCertPort() updating the certificate-error allowlist across port changes.

Documentation Assessment ¶

No documentation gaps. The PR description is thorough, commit messages are descriptive, and code comments explain the "why" for non-obvious decisions (TOCTOU window, certificate handling, proxy recreation pattern).

Commit Structure ¶

Commits 1 (ba16a032a — dynamic ports) and 2 (b69b292f7 — HTTP readiness check) are clean standalone concepts. Commit 3 (24e9a1bdd — fix healthcheck) patches commit 2's approach because HTTP redirects to HTTPS, but it is by a different author addressing reviewer feedback — keeping it separate is reasonable. Commit 4 (fa2cbca72 — "address AI review comments") bundles several small fixes; the message could be more descriptive of the actual changes, but this is minor.

Acknowledged Limitations ¶

TOCTOU race in port allocation — utils/networks.ts:20-22: "The ports are released before returning, so there is a small TOCTOU race before the caller binds them." Reviewer mook-as noted the correct long-term fix is to let Steve auto-select a port. The current approach is pragmatic; the race window is realistically a few hundred milliseconds.
Dashboard button enabled after Steve failure — background.ts:1294-1297: The k8s-check-state emission is outside the Steve try/catch block, so it fires even when Steve throws. Filed as #10060.
Health check certificate verification — steve.ts:160-173: Accepts all localhost certificates in the steve-healthcheck session partition. Reviewer mook-as acknowledged this can be improved in the future.
Orphaned timer in health check — steve.ts:188-195: The 1000ms timer in isPortReady is left running even if the request succeeds early. Reviewer mook-as explained this is safe: req.abort() on a completed request is a documented no-op.

Unresolved Feedback ¶

None. All substantive reviewer comments were either addressed in subsequent commits or explicitly acknowledged as acceptable/deferred. Issue #10060 was filed during review and remains open as an intentional follow-up.

Agent Performance Retro ¶

Claude ¶

Claude Opus 4.6 produced a thorough, well-calibrated review. It correctly identified the #10060 gap as an acknowledged limitation rather than a new finding, and surfaced three unique suggestions (misleading docstring, silent stdio return, redundant cert proc). Its coverage was complete — every file reviewed with clear rationale. The verdict ("Approve with minor suggestions") accurately reflected the PR's quality. No false positives. Notably, Claude was the only agent to thoroughly document all four acknowledged limitations.

Codex ¶

Codex GPT 5.4 was the fastest agent (221s) and produced a focused, concise review. Its two findings — Steve failure enabling the dashboard (Important) and the TOCTOU test flakiness (Suggestion) — were both well-supported with blame verification and concrete fix suggestions. The design observation about letting Steve self-select its port matched the reviewer's own assessment. Coverage was complete. No false positives. The "Changes requested" verdict was arguably too aggressive given that the important finding is already tracked as #10060.

Gemini ¶

Gemini 3.1 Pro raised the most findings (4) and was the only agent to flag the fast-restart race condition and the proxy-to-released-port gap. However, it rated the restart race as Critical, which overstates the practical likelihood — K8s restart cycles are orders of magnitude slower than Steve process exit. The proxy gap was rated Important, but the dashboard button is disabled during STOPPING state, limiting exposure. Two unique suggestions (error listener, timeout message) were valid. Coverage was complete. One severity inflation (Critical → should be Suggestion).

Summary ¶

Metric	Claude Opus 4.6	Codex GPT 5.4	Gemini 3.1 Pro
Duration	10:41	3:41	4:56
Critical	0	0	1 (inflated)
Important	0	1	1
Suggestion	3	1	2
Design observations	5	2	3
False positives	0	0	0
Unique insights	3	1	3
Files reviewed	8/8	8/8	8/8
Coverage misses	0	0	0

All three agents achieved full file coverage with no false positives. Codex was fastest and most efficient (highest signal-to-noise). Claude provided the best-calibrated severity ratings and most comprehensive documentation of acknowledged limitations. Gemini contributed the most unique findings but overrated one severity. The multi-agent approach was valuable: Gemini's unique restart-race finding was a genuine insight (even if over-rated), and no single agent captured all seven consolidated suggestions.