Deep Review: 20260512-213934-pr-4967
| Date | 2026-05-12 21:39 |
| Repo | lima-vm/lima |
| Round | 1 |
| Author | @WeishiZ |
| PR | #4967 — hostagent: gate Ready on raw-TCP SSH banner probe |
| Branch | hostagent-ssh-banner-readiness |
| Commits | 42655076 hostagent: gate Ready on raw-TCP SSH banner probe |
| Reviewers | Claude Opus 4.7 (effort: xhigh), Codex GPT 5.5 (effort: xhigh), Gemini 3.1 Pro (effort: default) |
| Verdict | Merge with fixes — single Important (preamble-line handling per RFC 4253 §4.2) plus a handful of small robustness suggestions. |
| Wall-clock time | 11 min 43 s |
Executive Summary ¶
The PR adds a host-side raw-TCP probe that reads the SSH identification banner before the hostagent fires Ready. The placement (head of essentialRequirements), the rationale (Lima's existing script-based check is masked by the SSH client's internal retries, so external clients see a connect()→write() EPIPE window), and the test scaffolding are all sound. One Important issue: the line-reader treats the first line as the SSH identification string, so an SSH server that emits the RFC 4253 §4.2-allowed preamble lines would block the gate for all 200 retries. The rest are minor robustness suggestions: probe doesn't propagate ctx cancellation into the blocking read, hardcoded 127.0.0.1 diverges from the existing requirement's use of a.instSSHAddress, the debugHint mentions a "serial log" that only exists for Darwin guests, and the requirement doc-comment claims mutual exclusion of script/check that the dispatcher does not enforce.
Structure: 0 Critical, 1 Important, 5 Suggestions, 1 design observation.
Critical Issues ¶
None.
Important Issues ¶
}
// RFC4253 §4.2: an SSH server sends an identification string ending in CR
// LF (or just LF historically) before the version exchange. If the proxied
// connection inside the guest has no live peer, the host side will close
// and ReadString returns EOF.
banner, err := bufio.NewReader(conn).ReadString('\n')
if err != nil {
return fmt.Errorf("read SSH banner from %s: %w", addr, err)
}
if !strings.HasPrefix(banner, "SSH-2.0-") && !strings.HasPrefix(banner, "SSH-1.99-") {
return fmt.Errorf("unexpected banner from %s: %q", addr, strings.TrimRight(banner, "\r\n"))
}
return nil
}
RFC 4253 §4.2 lets the server emit other lines of data before the version string and requires clients to skip them. The probe reads exactly one line and treats it as the identification string, so any preamble line — a TCP-wrapper notice, a proxy warning, a deployment-specific notice some operators wire into sshd — falls through to the unexpected banner branch. The next retry hits the same preamble, and the gate fails for all 200 attempts (~10 minutes) while the existing script-based ssh requirement at line 154 — which goes through Lima's SSH client — would have connected fine.
ConfigFile: sshConfig.ConfigFile,
Persist: false,
AdditionalArgs: sshutil.DisableControlMasterOptsFromSSHArgs(sshConfig.AdditionalArgs),
}
}
stdout, stderr, err := ssh.ExecuteScript(a.instSSHAddress, a.sshLocalPort, sshConfig, script, r.description)
logrus.Debugf("stdout=%q, stderr=%q, err=%v", stdout, stderr, err)
if err != nil {
return fmt.Errorf("stdout=%q, stderr=%q: %w", stdout, stderr, err)
}
return nil
Lima's stock templates run OpenSSH without preamble, so no in-tree configuration hits this today, but the new gate is strictly less tolerant of compliant servers than the existing one, and the fix is small.
Fix: read lines in a loop, bound the count to cap unbounded input from a wrong service, and only treat an SSH- line as the banner.
- banner, err := bufio.NewReader(conn).ReadString('\n')
- if err != nil {
- return fmt.Errorf("read SSH banner from %s: %w", addr, err)
- }
- if !strings.HasPrefix(banner, "SSH-2.0-") && !strings.HasPrefix(banner, "SSH-1.99-") {
- return fmt.Errorf("unexpected banner from %s: %q", addr, strings.TrimRight(banner, "\r\n"))
- }
- return nil
+ reader := bufio.NewReader(conn)
+ for range 50 {
+ line, err := reader.ReadString('\n')
+ if err != nil {
+ return fmt.Errorf("read SSH banner from %s: %w", addr, err)
+ }
+ line = strings.TrimRight(line, "\r\n")
+ if strings.HasPrefix(line, "SSH-2.0-") || strings.HasPrefix(line, "SSH-1.99-") {
+ return nil
+ }
+ if strings.HasPrefix(line, "SSH-") {
+ return fmt.Errorf("unexpected banner from %s: %q", addr, line)
+ }
+ }
+ return fmt.Errorf("too many pre-identification lines from %s", addr)
Add a test that feeds one preamble line followed by SSH-2.0-… and asserts the probe accepts it.
Suggestions ¶
// Lima's own SSH client, which has internal retry behavior that can mask a
// race where the hostagent accepts a TCP connection on 127.0.0.1:sshLocalPort
// before guest sshd has bound :22. External clients (ssh-keyscan, sftp,
// rsync, ...) that connect during that window get EPIPE on the first write.
// This native probe proves the public-facing path is end-to-end usable.
port := a.sshLocalPort
req = append(req,
requirement{
description: "sshLocalPort serves an SSH banner",
check: func(ctx context.Context) error {
return probeSSHBannerOnLocalPort(ctx, port)
},
debugHint: `The host agent is listening on 127.0.0.1:<sshLocalPort> but the
proxied connection into the guest is failing. Most commonly this means
guest sshd has not yet bound :22. Check the serial log for sshd startup.
`,
func probeSSHBannerOnLocalPort(ctx context.Context, port int) error {
addr := fmt.Sprintf("127.0.0.1:%d", port)
The existing script-based ssh requirement at requirements.go:154 dials (a.instSSHAddress, a.sshLocalPort). The new probe ignores a.instSSHAddress and hardcodes 127.0.0.1. Every in-tree driver — qemu, vz, krunkit, wsl2 — returns "127.0.0.1" from SSHAddress(ctx) today, so the two checks target the same address in practice. The divergence matters only for an external driver plugin (pkg/driver/external/client/methods.go:323) that returns a non-loopback host: the existing ssh requirement would honor the plugin's address while the new probe would still dial loopback and either fail forever or — worse — match an unrelated SSH server on the host.
ConfigFile: sshConfig.ConfigFile,
Persist: false,
AdditionalArgs: sshutil.DisableControlMasterOptsFromSSHArgs(sshConfig.AdditionalArgs),
}
}
stdout, stderr, err := ssh.ExecuteScript(a.instSSHAddress, a.sshLocalPort, sshConfig, script, r.description)
logrus.Debugf("stdout=%q, stderr=%q, err=%v", stdout, stderr, err)
if err != nil {
return fmt.Errorf("stdout=%q, stderr=%q: %w", stdout, stderr, err)
}
return nil
func (d *DriverClient) InspectStatus(_ context.Context, _ *limatype.Instance) string {
return ""
}
func (d *DriverClient) SSHAddress(ctx context.Context) (string, error) {
d.logger.Debug("Getting SSH address for the driver instance")
resp, err := d.DriverSvc.SSHAddress(ctx, &emptypb.Empty{})
if err != nil {
d.logger.WithError(err).Error("Failed to get SSH address")
Fix: thread a.instSSHAddress through alongside the port, mirroring the existing requirement.
- port := a.sshLocalPort
req = append(req,
requirement{
description: "sshLocalPort serves an SSH banner",
check: func(ctx context.Context) error {
- return probeSSHBannerOnLocalPort(ctx, port)
+ return probeSSHBanner(ctx, a.instSSHAddress, a.sshLocalPort)
},
-func probeSSHBannerOnLocalPort(ctx context.Context, port int) error {
- addr := fmt.Sprintf("127.0.0.1:%d", port)
+func probeSSHBanner(ctx context.Context, host string, port int) error {
+ addr := net.JoinHostPort(host, strconv.Itoa(port))
Also rename the function (probeSSHBannerOnLocalPort encodes the loopback assumption) and update the doc comment at requirements.go:338-344, which currently describes "the host agent's TCP forwarder on 127.0.0.1:port."
conn, err := d.DialContext(ctx, "tcp", addr)
if err != nil {
return fmt.Errorf("dial %s: %w", addr, err)
}
defer conn.Close()
if err := conn.SetReadDeadline(time.Now().Add(probeSSHBannerTimeout)); err != nil {
return fmt.Errorf("set read deadline on %s: %w", addr, err)
}
// RFC4253 §4.2: an SSH server sends an identification string ending in CR
// LF (or just LF historically) before the version exchange. If the proxied
// connection inside the guest has no live peer, the host side will close
// and ReadString returns EOF.
banner, err := bufio.NewReader(conn).ReadString('\n')
if err != nil {
return fmt.Errorf("read SSH banner from %s: %w", addr, err)
}
if !strings.HasPrefix(banner, "SSH-2.0-") && !strings.HasPrefix(banner, "SSH-1.99-") {
return fmt.Errorf("unexpected banner from %s: %q", addr, strings.TrimRight(banner, "\r\n"))
}
return nil
DialContext honors ctx, but the subsequent ReadString only respects the 5-second read deadline. If the user runs limactl stop while a probe attempt is mid-read, the goroutine still sits in the read for up to 5 seconds. waitForRequirements does check ctx.Done() between attempts (requirements.go:48-53), so this is bounded; it just means the shutdown deadline is "next attempt boundary plus 5 seconds" rather than "immediate."
}
if j == retries-1 {
errs = append(errs, fmt.Errorf("failed to satisfy the %s requirement %d of %d %q: %s: %w", label, i+1, len(requirements), req.description, req.debugHint, err))
break retryLoop
}
select {
case <-ctx.Done():
errs = append(errs, fmt.Errorf("failed to satisfy the %s requirement %d of %d %q: %s: %w", label, i+1, len(requirements), req.description, req.debugHint, ctx.Err()))
return errors.Join(errs...)
case <-time.After(sleepDuration):
}
}
}
return errors.Join(errs...)
}
Fix: wire ctx into the read with context.AfterFunc(ctx, func() { conn.Close() }).
defer conn.Close()
+ stop := context.AfterFunc(ctx, func() { _ = conn.Close() })
+ defer stop()
if err := conn.SetReadDeadline(time.Now().Add(probeSSHBannerTimeout)); err != nil {
// Server accepts but neither writes a banner nor closes, so the probe must
// time out on the read deadline rather than hang forever.
var lc net.ListenConfig
ln, err := lc.Listen(t.Context(), "tcp", "127.0.0.1:0")
assert.NilError(t, err)
t.Cleanup(func() { _ = ln.Close() })
accepted := make(chan net.Conn, 1)
go func() {
conn, err := ln.Accept()
if err != nil {
return
}
accepted <- conn
}()
_, portStr, err := net.SplitHostPort(ln.Addr().String())
assert.NilError(t, err)
port, err := strconv.Atoi(portStr)
assert.NilError(t, err)
ctx, cancel := context.WithTimeout(t.Context(), 2*probeSSHBannerTimeout)
defer cancel()
start := time.Now()
err = probeSSHBannerOnLocalPort(ctx, port)
elapsed := time.Since(start)
if conn := <-accepted; conn != nil {
_ = conn.Close()
}
assert.Assert(t, err != nil)
assert.Assert(t, elapsed < 2*probeSSHBannerTimeout, "probe did not honor read deadline: elapsed=%s", elapsed)
}
If DialContext fails (rare on loopback but not impossible — local port exhaustion, sandboxed CI), the probe returns immediately. The Accept goroutine never unblocks because nothing dialed it. The main goroutine then blocks on <-accepted forever, t.Cleanup never fires (the test function has not returned), and the test stalls until the Go test framework's outer timeout. The realistic failure mode is "test hang, hard to root-cause" rather than a silent miss.
Fix: drop the channel and tie the accepted connection to the test context.
- accepted := make(chan net.Conn, 1)
go func() {
- conn, err := ln.Accept()
+ conn, err := ln.Accept()
if err != nil {
return
}
- accepted <- conn
+ defer conn.Close()
+ <-t.Context().Done()
}()
// ...
err = probeSSHBannerOnLocalPort(ctx, port)
elapsed := time.Since(start)
- if conn := <-accepted; conn != nil {
- _ = conn.Close()
- }
requirement{
description: "sshLocalPort serves an SSH banner",
check: func(ctx context.Context) error {
return probeSSHBannerOnLocalPort(ctx, port)
},
debugHint: `The host agent is listening on 127.0.0.1:<sshLocalPort> but the
proxied connection into the guest is failing. Most commonly this means
guest sshd has not yet bound :22. Check the serial log for sshd startup.
`,
})
req = append(req,
requirement{
description: "ssh",
script: `#!/bin/sh
The existing finalRequirements block at requirements.go:307-311 already branches on *a.instConfig.OS == limatype.DARWIN to pick the right log file — cloud-init-output.log in the guest for Linux, serialv.log in the host for Darwin. The new hint hardcodes "serial log" and sends Linux-guest users — the overwhelming majority — looking for a file that does not exist on their setup.
}
return req
}
func (a *HostAgent) finalRequirements() []requirement {
req := make([]requirement, 0)
logLocation := "/var/log/cloud-init-output.log in the guest"
if *a.instConfig.OS == limatype.DARWIN {
logLocation = "serialv.log in the host"
}
req = append(req,
requirement{
description: "boot scripts must have finished",
script: fmt.Sprintf(`#!/bin/sh
set -eux
Fix: mirror the existing pattern, building the hint based on *a.instConfig.OS.
func (a *HostAgent) bashAvailable() bool {
return *a.instConfig.OS != limatype.FREEBSD
}
func (a *HostAgent) waitForRequirement(ctx context.Context, r requirement) error {
if r.check != nil {
logrus.Debugf("checking requirement %q", r.description)
return r.check(ctx)
}
logrus.Debugf("executing script %q", r.description)
script := r.script
if a.bashAvailable() {
var err error
// FIXME: prefixExportParam depends on bash
func (a *HostAgent) waitForRequirement(ctx context.Context, r requirement) error {
if r.check != nil {
logrus.Debugf("checking requirement %q", r.description)
return r.check(ctx)
}
logrus.Debugf("executing script %q", r.description)
The dispatcher silently prefers check over script if both are set; the "exactly one" claim from the doc-comment is documentation, not invariant. Either tighten the dispatcher with an explicit panic/error when both are set, or rephrase the comment as "if check is set, script is ignored."
The package-internal callers all set exactly one today, so this is a small future-proofing note, not a current bug.
Design Observations ¶
Strengths ¶
- Native probe sidesteps Lima's SSH client retries (in-scope) Codex Gemini Claude. The whole point of the change: the existing
sshrequirement masks the race because the client retries internally. A rawnet.Dial+bufio.NewReader.ReadString('\n')cannot mask the race the PR is fixing. AcceptThenClosetest directly reproduces the race shape (in-scope) Claude. The test simulates the exact failure: TCP forwarder accepts but the proxied guest peer is dead, so the host side closes immediately and the client sees EOF. That is the regression-prevention shape the PR wants.ctxpropagation intowaitForRequirements's sleep is minimal and correct (in-scope) Claude Codex. The newselectonctx.Done()versustime.After(sleepDuration)is the right shape; the three updated call sites inhostagent.goare mechanical and consistent.
Testing Assessment ¶
The six new unit tests cover the probe's main shapes without needing a VM: valid modern banner, SSH-1.99 legacy banner, accept-then-close (the race shape), wrong banner, no listener, and hung read. Coverage gaps the review identified:
- Preamble line acceptance (covers I1): once the line-reader loops, a test that sends one preamble line followed by
SSH-2.0-…locks in the new contract. - Context cancellation during read (covers S2): cancel
ctx~50 ms into a probe against a hung server and assert the probe returns promptly rather than waiting out the 5-second deadline. waitForRequirementsctxpropagation (covers the newselectatrequirements.go:48-53): a unit test with a list of always-failing requirements that cancelsctxduring the 3-second sleep would lock in the new propagation contract.- Dial-failure resilience in
HungWrite(covers S3): once the helper goroutine is fixed to wait ont.Context().Done(), the test no longer depends on the probe dialing successfully.
Documentation Assessment ¶
- The doc-comment on the
requirementstruct overstates mutual exclusion (S5). - The probe's doc-comment at
requirements.go:338-344describes "the host agent's TCP forwarder on 127.0.0.1:port"; if S1 is taken, this prose needs to reflecta.instSSHAddressrather than loopback.
Commit Structure ¶
Single commit with a clear, descriptive subject and a body that lists the four code shape changes. No concerns.
Acknowledged Limitations ¶
- The PR description discloses that the race-after-Ready scenario was not reproduced on the author's hardware (macOS aarch64 +
vz+ Ubuntu 24.04). The empirical table shows the race window is present on every cycle, just closed by later checks in time on this rig. Downstream reports (imbue-ai, pyinfra, ansible) suggest the race surfaces on slower hardware, in--plainmode (which skips the post-sshchecks), or with non-Linux guests whereessentialRequirementsreturns afterssh+ControlMasteronly. The PR's argument that this change is correct regardless of whether the race is reproducible — because it makes the wait explicit instead of relying on coincidence — is sound. - The PR explicitly punts a vsock-based guest-side readiness signal as out of scope. That is the right scope decision for this change.
Unresolved Feedback ¶
No prior PR review comments.
Agent Performance Retro ¶
Claude ¶
Claude landed two suggestions no other agent raised (the debugHint/serial-log mismatch, the requirement doc-comment overstating mutual exclusion) and tagged the consistency concern between the new probe's hardcoded loopback and the existing ssh requirement's a.instSSHAddress. Claude pitched that consistency concern as a regression on the WSL2 driver — citing inst.SSHLocalPort = 22 and getSSHAddress returning a guest-routable IP — but LimaWslDriver.SSHAddress(ctx) returns "127.0.0.1" directly (line 325-327), and that is the function that hostagent.go:427-432 calls. Every in-tree driver returns "127.0.0.1" from SSHAddress(ctx), so the "regression" framing is a false read of the WSL2 code path. The underlying observation — hardcoding diverges from the existing pattern and is fragile for external driver plugins — survives at suggestion severity.
Codex ¶
Codex's single finding (RFC 4253 §4.2 preamble lines) was the most consequential issue in the review and is the one I'd most want a reviewer to catch on this PR. Otherwise the review was disciplined: no padding, only the finding the author can act on. Codex also ran the test suite locally to verify the change builds and passes, which is the only agent that did.
Gemini ¶
Gemini independently surfaced the same RFC 4253 issue as Codex (confirming the concern is not idiosyncratic to one model) and also caught the HungWrite test's deadlock-on-dial-failure shape, which neither other agent noticed. The deadlock framing was correct but the severity was inflated to "important, correctness" — the failure mode requires DialContext to fail on a freshly-opened loopback listener, which essentially never happens in practice. The realistic impact is "test hangs to outer timeout, hard to debug" rather than "wrong result." Downgraded to suggestion.
Summary ¶
| Claude Opus 4.7 | Codex GPT 5.5 | Gemini 3.1 Pro | |
|---|---|---|---|
| Duration | 3m 30s | 5m 03s | — |
| Findings | 5S | 1I | 1I 2S |
| Tool calls | 21 (Read 10, Grep 9, Bash 2) | 35 (shell 33, stdin 2) | — |
| Design observations | 3 | 1 | 1 |
| False positives | 0 | 0 | 0 |
| Unique insights | 2 | 0 | 1 |
| Files reviewed | 3 | 3 | 3 |
| Coverage misses | 0 | 0 | 0 |
| Totals | 5S | 1I | 1I 2S |
| Downgraded | 1 (I→S) | 0 | 1 (I→S) |
| Dropped | 0 | 0 | 0 |
Reconciliation: Claude's I1 (WSL2 hardcoded 127.0.0.1, "important regression") was downgraded to suggestion S1 (consistency / external-driver fragility) — the regression claim relied on a misread of LimaWslDriver.SSHAddress(ctx). Gemini's I2 (test deadlock, "important correctness") was downgraded to suggestion S3 — the failure mode requires a loopback-dial to fail, which is essentially never observed in practice. Codex's I1 and Gemini's I1 were merged as the single Important finding I1.
Codex was the highest-signal review for this PR: one finding, the right severity, and the only review that ran the tests. Gemini was a close second, especially for catching the test-side deadlock. Claude raised the most suggestions but spent its Important slot on a finding that did not survive verification.
Review Process Notes ¶
Skill improvements ¶
- Verify dynamic-address driver claims by reading every
SSHAddressmethod, not just the dynamic-resolution helper. When an agent claims that a driver feature flag (e.g.,DynamicSSHAddress) routes through a runtime helper to produce a non-loopback address, check the actual implementation of the public method the hostagent calls, not the helper the driver uses internally for other purposes. Helpers like a WSL2getSSHAddressfunction may exist for status reporting while the hostagent-facingSSHAddress(ctx)returns loopback. The review prompt should remind agents: "Before claiming a driver returns a non-loopback address, find and read the exact method the hostagent calls."
Repo context updates ¶
- [repo] Lima driver SSHAddress convention. Every in-tree Lima driver (
qemu,vz,krunkit,wsl2) returns"127.0.0.1"fromDriver.SSHAddress(ctx context.Context) (string, error). TheDynamicSSHAddressfeature flag exists for drivers (like WSL2) whose SSH endpoint might in principle vary at runtime, but in practice the resolution path also returns loopback. Reviewers evaluating code that targets the SSH-forwarded port should treat hardcoded127.0.0.1as functionally equivalent toa.instSSHAddressfor any in-tree driver; the divergence only matters for external driver plugins (pkg/driver/external/client/methods.go).