Deep Review: 20260512-213934-pr-4967

Date2026-05-12 21:39
Repolima-vm/lima
Round1
Author@WeishiZ
PR#4967 — hostagent: gate Ready on raw-TCP SSH banner probe
Branchhostagent-ssh-banner-readiness
Commits42655076 hostagent: gate Ready on raw-TCP SSH banner probe
ReviewersClaude Opus 4.7 (effort: xhigh), Codex GPT 5.5 (effort: xhigh), Gemini 3.1 Pro (effort: default)
VerdictMerge with fixes — single Important (preamble-line handling per RFC 4253 §4.2) plus a handful of small robustness suggestions.
Wall-clock time11 min 43 s


Executive Summary

The PR adds a host-side raw-TCP probe that reads the SSH identification banner before the hostagent fires Ready. The placement (head of essentialRequirements), the rationale (Lima's existing script-based check is masked by the SSH client's internal retries, so external clients see a connect()write() EPIPE window), and the test scaffolding are all sound. One Important issue: the line-reader treats the first line as the SSH identification string, so an SSH server that emits the RFC 4253 §4.2-allowed preamble lines would block the gate for all 200 retries. The rest are minor robustness suggestions: probe doesn't propagate ctx cancellation into the blocking read, hardcoded 127.0.0.1 diverges from the existing requirement's use of a.instSSHAddress, the debugHint mentions a "serial log" that only exists for Darwin guests, and the requirement doc-comment claims mutual exclusion of script/check that the dispatcher does not enforce.

Structure: 0 Critical, 1 Important, 5 Suggestions, 1 design observation.


Critical Issues

None.


Important Issues

I1. Probe rejects RFC 4253 §4.2 preamble lines Codex Gemini
	}
	// RFC4253 §4.2: an SSH server sends an identification string ending in CR
	// LF (or just LF historically) before the version exchange. If the proxied
	// connection inside the guest has no live peer, the host side will close
	// and ReadString returns EOF.
	banner, err := bufio.NewReader(conn).ReadString('\n')
	if err != nil {
		return fmt.Errorf("read SSH banner from %s: %w", addr, err)
	}
	if !strings.HasPrefix(banner, "SSH-2.0-") && !strings.HasPrefix(banner, "SSH-1.99-") {
		return fmt.Errorf("unexpected banner from %s: %q", addr, strings.TrimRight(banner, "\r\n"))
	}
	return nil
}

RFC 4253 §4.2 lets the server emit other lines of data before the version string and requires clients to skip them. The probe reads exactly one line and treats it as the identification string, so any preamble line — a TCP-wrapper notice, a proxy warning, a deployment-specific notice some operators wire into sshd — falls through to the unexpected banner branch. The next retry hits the same preamble, and the gate fails for all 200 attempts (~10 minutes) while the existing script-based ssh requirement at line 154 — which goes through Lima's SSH client — would have connected fine.

			ConfigFile:     sshConfig.ConfigFile,
			Persist:        false,
			AdditionalArgs: sshutil.DisableControlMasterOptsFromSSHArgs(sshConfig.AdditionalArgs),
		}
	}
	stdout, stderr, err := ssh.ExecuteScript(a.instSSHAddress, a.sshLocalPort, sshConfig, script, r.description)
	logrus.Debugf("stdout=%q, stderr=%q, err=%v", stdout, stderr, err)
	if err != nil {
		return fmt.Errorf("stdout=%q, stderr=%q: %w", stdout, stderr, err)
	}
	return nil

Lima's stock templates run OpenSSH without preamble, so no in-tree configuration hits this today, but the new gate is strictly less tolerant of compliant servers than the existing one, and the fix is small.

Fix: read lines in a loop, bound the count to cap unbounded input from a wrong service, and only treat an SSH- line as the banner.

-	banner, err := bufio.NewReader(conn).ReadString('\n')
-	if err != nil {
-		return fmt.Errorf("read SSH banner from %s: %w", addr, err)
-	}
-	if !strings.HasPrefix(banner, "SSH-2.0-") && !strings.HasPrefix(banner, "SSH-1.99-") {
-		return fmt.Errorf("unexpected banner from %s: %q", addr, strings.TrimRight(banner, "\r\n"))
-	}
-	return nil
+	reader := bufio.NewReader(conn)
+	for range 50 {
+		line, err := reader.ReadString('\n')
+		if err != nil {
+			return fmt.Errorf("read SSH banner from %s: %w", addr, err)
+		}
+		line = strings.TrimRight(line, "\r\n")
+		if strings.HasPrefix(line, "SSH-2.0-") || strings.HasPrefix(line, "SSH-1.99-") {
+			return nil
+		}
+		if strings.HasPrefix(line, "SSH-") {
+			return fmt.Errorf("unexpected banner from %s: %q", addr, line)
+		}
+	}
+	return fmt.Errorf("too many pre-identification lines from %s", addr)

Add a test that feeds one preamble line followed by SSH-2.0-… and asserts the probe accepts it.


Suggestions

S1. Hardcoded 127.0.0.1 diverges from the existing requirement's address Claude
	// Lima's own SSH client, which has internal retry behavior that can mask a
	// race where the hostagent accepts a TCP connection on 127.0.0.1:sshLocalPort
	// before guest sshd has bound :22. External clients (ssh-keyscan, sftp,
	// rsync, ...) that connect during that window get EPIPE on the first write.
	// This native probe proves the public-facing path is end-to-end usable.
	port := a.sshLocalPort
	req = append(req,
		requirement{
			description: "sshLocalPort serves an SSH banner",
			check: func(ctx context.Context) error {
				return probeSSHBannerOnLocalPort(ctx, port)
			},
			debugHint: `The host agent is listening on 127.0.0.1:<sshLocalPort> but the
proxied connection into the guest is failing. Most commonly this means
guest sshd has not yet bound :22. Check the serial log for sshd startup.
`,
func probeSSHBannerOnLocalPort(ctx context.Context, port int) error {
    addr := fmt.Sprintf("127.0.0.1:%d", port)

The existing script-based ssh requirement at requirements.go:154 dials (a.instSSHAddress, a.sshLocalPort). The new probe ignores a.instSSHAddress and hardcodes 127.0.0.1. Every in-tree driver — qemu, vz, krunkit, wsl2 — returns "127.0.0.1" from SSHAddress(ctx) today, so the two checks target the same address in practice. The divergence matters only for an external driver plugin (pkg/driver/external/client/methods.go:323) that returns a non-loopback host: the existing ssh requirement would honor the plugin's address while the new probe would still dial loopback and either fail forever or — worse — match an unrelated SSH server on the host.

			ConfigFile:     sshConfig.ConfigFile,
			Persist:        false,
			AdditionalArgs: sshutil.DisableControlMasterOptsFromSSHArgs(sshConfig.AdditionalArgs),
		}
	}
	stdout, stderr, err := ssh.ExecuteScript(a.instSSHAddress, a.sshLocalPort, sshConfig, script, r.description)
	logrus.Debugf("stdout=%q, stderr=%q, err=%v", stdout, stderr, err)
	if err != nil {
		return fmt.Errorf("stdout=%q, stderr=%q: %w", stdout, stderr, err)
	}
	return nil

func (d *DriverClient) InspectStatus(_ context.Context, _ *limatype.Instance) string {
	return ""
}

func (d *DriverClient) SSHAddress(ctx context.Context) (string, error) {
	d.logger.Debug("Getting SSH address for the driver instance")

	resp, err := d.DriverSvc.SSHAddress(ctx, &emptypb.Empty{})
	if err != nil {
		d.logger.WithError(err).Error("Failed to get SSH address")

Fix: thread a.instSSHAddress through alongside the port, mirroring the existing requirement.

-	port := a.sshLocalPort
 	req = append(req,
 		requirement{
 			description: "sshLocalPort serves an SSH banner",
 			check: func(ctx context.Context) error {
-				return probeSSHBannerOnLocalPort(ctx, port)
+				return probeSSHBanner(ctx, a.instSSHAddress, a.sshLocalPort)
 			},
-func probeSSHBannerOnLocalPort(ctx context.Context, port int) error {
-	addr := fmt.Sprintf("127.0.0.1:%d", port)
+func probeSSHBanner(ctx context.Context, host string, port int) error {
+	addr := net.JoinHostPort(host, strconv.Itoa(port))

Also rename the function (probeSSHBannerOnLocalPort encodes the loopback assumption) and update the doc comment at requirements.go:338-344, which currently describes "the host agent's TCP forwarder on 127.0.0.1:port."

S2. Probe read does not respect ctx cancellation Claude Gemini
	conn, err := d.DialContext(ctx, "tcp", addr)
	if err != nil {
		return fmt.Errorf("dial %s: %w", addr, err)
	}
	defer conn.Close()
	if err := conn.SetReadDeadline(time.Now().Add(probeSSHBannerTimeout)); err != nil {
		return fmt.Errorf("set read deadline on %s: %w", addr, err)
	}
	// RFC4253 §4.2: an SSH server sends an identification string ending in CR
	// LF (or just LF historically) before the version exchange. If the proxied
	// connection inside the guest has no live peer, the host side will close
	// and ReadString returns EOF.
	banner, err := bufio.NewReader(conn).ReadString('\n')
	if err != nil {
		return fmt.Errorf("read SSH banner from %s: %w", addr, err)
	}
	if !strings.HasPrefix(banner, "SSH-2.0-") && !strings.HasPrefix(banner, "SSH-1.99-") {
		return fmt.Errorf("unexpected banner from %s: %q", addr, strings.TrimRight(banner, "\r\n"))
	}
	return nil

DialContext honors ctx, but the subsequent ReadString only respects the 5-second read deadline. If the user runs limactl stop while a probe attempt is mid-read, the goroutine still sits in the read for up to 5 seconds. waitForRequirements does check ctx.Done() between attempts (requirements.go:48-53), so this is bounded; it just means the shutdown deadline is "next attempt boundary plus 5 seconds" rather than "immediate."

			}
			if j == retries-1 {
				errs = append(errs, fmt.Errorf("failed to satisfy the %s requirement %d of %d %q: %s: %w", label, i+1, len(requirements), req.description, req.debugHint, err))
				break retryLoop
			}
			select {
			case <-ctx.Done():
				errs = append(errs, fmt.Errorf("failed to satisfy the %s requirement %d of %d %q: %s: %w", label, i+1, len(requirements), req.description, req.debugHint, ctx.Err()))
				return errors.Join(errs...)
			case <-time.After(sleepDuration):
			}
		}
	}
	return errors.Join(errs...)
}

Fix: wire ctx into the read with context.AfterFunc(ctx, func() { conn.Close() }).

 	defer conn.Close()
+	stop := context.AfterFunc(ctx, func() { _ = conn.Close() })
+	defer stop()
 	if err := conn.SetReadDeadline(time.Now().Add(probeSSHBannerTimeout)); err != nil {
S3. TestProbeSSHBannerOnLocalPort_HungWrite deadlocks if DialContext errors Gemini
	// Server accepts but neither writes a banner nor closes, so the probe must
	// time out on the read deadline rather than hang forever.
	var lc net.ListenConfig
	ln, err := lc.Listen(t.Context(), "tcp", "127.0.0.1:0")
	assert.NilError(t, err)
	t.Cleanup(func() { _ = ln.Close() })
	accepted := make(chan net.Conn, 1)
	go func() {
		conn, err := ln.Accept()
		if err != nil {
			return
		}
		accepted <- conn
	}()
	_, portStr, err := net.SplitHostPort(ln.Addr().String())
	assert.NilError(t, err)
	port, err := strconv.Atoi(portStr)
	assert.NilError(t, err)

	ctx, cancel := context.WithTimeout(t.Context(), 2*probeSSHBannerTimeout)
	defer cancel()
	start := time.Now()
	err = probeSSHBannerOnLocalPort(ctx, port)
	elapsed := time.Since(start)
	if conn := <-accepted; conn != nil {
		_ = conn.Close()
	}
	assert.Assert(t, err != nil)
	assert.Assert(t, elapsed < 2*probeSSHBannerTimeout, "probe did not honor read deadline: elapsed=%s", elapsed)
}

If DialContext fails (rare on loopback but not impossible — local port exhaustion, sandboxed CI), the probe returns immediately. The Accept goroutine never unblocks because nothing dialed it. The main goroutine then blocks on <-accepted forever, t.Cleanup never fires (the test function has not returned), and the test stalls until the Go test framework's outer timeout. The realistic failure mode is "test hang, hard to root-cause" rather than a silent miss.

Fix: drop the channel and tie the accepted connection to the test context.

-	accepted := make(chan net.Conn, 1)
 	go func() {
-		conn, err := ln.Accept()
+		conn, err := ln.Accept()
 		if err != nil {
 			return
 		}
-		accepted <- conn
+		defer conn.Close()
+		<-t.Context().Done()
 	}()
 	// ...
 	err = probeSSHBannerOnLocalPort(ctx, port)
 	elapsed := time.Since(start)
-	if conn := <-accepted; conn != nil {
-		_ = conn.Close()
-	}
S4. debugHint points at "the serial log" which only exists on Darwin guests Claude
		requirement{
			description: "sshLocalPort serves an SSH banner",
			check: func(ctx context.Context) error {
				return probeSSHBannerOnLocalPort(ctx, port)
			},
			debugHint: `The host agent is listening on 127.0.0.1:<sshLocalPort> but the
proxied connection into the guest is failing. Most commonly this means
guest sshd has not yet bound :22. Check the serial log for sshd startup.
`,
		})
	req = append(req,
		requirement{
			description: "ssh",
			script: `#!/bin/sh

The existing finalRequirements block at requirements.go:307-311 already branches on *a.instConfig.OS == limatype.DARWIN to pick the right log file — cloud-init-output.log in the guest for Linux, serialv.log in the host for Darwin. The new hint hardcodes "serial log" and sends Linux-guest users — the overwhelming majority — looking for a file that does not exist on their setup.

	}
	return req
}

func (a *HostAgent) finalRequirements() []requirement {
	req := make([]requirement, 0)
	logLocation := "/var/log/cloud-init-output.log in the guest"
	if *a.instConfig.OS == limatype.DARWIN {
		logLocation = "serialv.log in the host"
	}
	req = append(req,
		requirement{
			description: "boot scripts must have finished",
			script: fmt.Sprintf(`#!/bin/sh
set -eux

Fix: mirror the existing pattern, building the hint based on *a.instConfig.OS.

S5. requirement doc comment overstates mutual exclusion Claude

func (a *HostAgent) bashAvailable() bool {
	return *a.instConfig.OS != limatype.FREEBSD
}

func (a *HostAgent) waitForRequirement(ctx context.Context, r requirement) error {
	if r.check != nil {
		logrus.Debugf("checking requirement %q", r.description)
		return r.check(ctx)
	}
	logrus.Debugf("executing script %q", r.description)
	script := r.script
	if a.bashAvailable() {
		var err error
		// FIXME: prefixExportParam depends on bash
func (a *HostAgent) waitForRequirement(ctx context.Context, r requirement) error {
    if r.check != nil {
        logrus.Debugf("checking requirement %q", r.description)
        return r.check(ctx)
    }
    logrus.Debugf("executing script %q", r.description)

The dispatcher silently prefers check over script if both are set; the "exactly one" claim from the doc-comment is documentation, not invariant. Either tighten the dispatcher with an explicit panic/error when both are set, or rephrase the comment as "if check is set, script is ignored."

The package-internal callers all set exactly one today, so this is a small future-proofing note, not a current bug.


Design Observations

Strengths


Testing Assessment

The six new unit tests cover the probe's main shapes without needing a VM: valid modern banner, SSH-1.99 legacy banner, accept-then-close (the race shape), wrong banner, no listener, and hung read. Coverage gaps the review identified:


Documentation Assessment


Commit Structure

Single commit with a clear, descriptive subject and a body that lists the four code shape changes. No concerns.


Acknowledged Limitations


Unresolved Feedback

No prior PR review comments.


Agent Performance Retro

Claude

Claude landed two suggestions no other agent raised (the debugHint/serial-log mismatch, the requirement doc-comment overstating mutual exclusion) and tagged the consistency concern between the new probe's hardcoded loopback and the existing ssh requirement's a.instSSHAddress. Claude pitched that consistency concern as a regression on the WSL2 driver — citing inst.SSHLocalPort = 22 and getSSHAddress returning a guest-routable IP — but LimaWslDriver.SSHAddress(ctx) returns "127.0.0.1" directly (line 325-327), and that is the function that hostagent.go:427-432 calls. Every in-tree driver returns "127.0.0.1" from SSHAddress(ctx), so the "regression" framing is a false read of the WSL2 code path. The underlying observation — hardcoding diverges from the existing pattern and is fragile for external driver plugins — survives at suggestion severity.

Codex

Codex's single finding (RFC 4253 §4.2 preamble lines) was the most consequential issue in the review and is the one I'd most want a reviewer to catch on this PR. Otherwise the review was disciplined: no padding, only the finding the author can act on. Codex also ran the test suite locally to verify the change builds and passes, which is the only agent that did.

Gemini

Gemini independently surfaced the same RFC 4253 issue as Codex (confirming the concern is not idiosyncratic to one model) and also caught the HungWrite test's deadlock-on-dial-failure shape, which neither other agent noticed. The deadlock framing was correct but the severity was inflated to "important, correctness" — the failure mode requires DialContext to fail on a freshly-opened loopback listener, which essentially never happens in practice. The realistic impact is "test hangs to outer timeout, hard to debug" rather than "wrong result." Downgraded to suggestion.

Summary

Claude Opus 4.7Codex GPT 5.5Gemini 3.1 Pro
Duration3m 30s5m 03s
Findings5S1I1I 2S
Tool calls21 (Read 10, Grep 9, Bash 2)35 (shell 33, stdin 2)
Design observations311
False positives000
Unique insights201
Files reviewed333
Coverage misses000
Totals5S1I1I 2S
Downgraded1 (I→S)01 (I→S)
Dropped000

Reconciliation: Claude's I1 (WSL2 hardcoded 127.0.0.1, "important regression") was downgraded to suggestion S1 (consistency / external-driver fragility) — the regression claim relied on a misread of LimaWslDriver.SSHAddress(ctx). Gemini's I2 (test deadlock, "important correctness") was downgraded to suggestion S3 — the failure mode requires a loopback-dial to fail, which is essentially never observed in practice. Codex's I1 and Gemini's I1 were merged as the single Important finding I1.

Codex was the highest-signal review for this PR: one finding, the right severity, and the only review that ran the tests. Gemini was a close second, especially for catching the test-side deadlock. Claude raised the most suggestions but spent its Important slot on a finding that did not survive verification.


Review Process Notes

Skill improvements

Repo context updates