Deep Review: 20260426-165750-pr-4889

Date	2026-04-26 16:57
Repo	lima-vm/lima
Round	1
Author	@mn-ram
PR	#4889 — hostagent: close stale GuestAgentClient on reconnect to stop ClientConn leak
Commits	`3b4f6cef` hostagent: close stale GuestAgentClient on reconnect to stop ClientConn leak
Reviewers	Claude Opus 4.7 (effort: xhigh), Codex GPT 5.5 (effort: xhigh), Gemini 3.1 Pro (effort: default)
Verdict	Merge with fixes — the leak fix itself is correct and minimal, but the explicit `a.client = nil` introduces a data race against unprotected reads in `watchGuestAgentEvents`, and closing the previous client strands the dialer captured by long-lived port-forward listeners.
Wall-clock time	11 min 25 s

Executive Summary ¶

The PR retains the *grpc.ClientConn inside GuestAgentClient, exposes Close(), closes the previous client in getOrCreateClient before replacing it, and registers a shutdown cleanup. The leak fix is correct, the new locking is consistent with the existing clientMu, and the diff is small and well-commented.

Two important issues survive consolidation:

The new a.client = nil write under lock collides with two pre-existing unprotected reads of a.client in watchGuestAgentEvents. Pre-PR those reads were a benign race (the field was never nilled between writes); post-PR the field is observable as nil between Close() and the new assignment, so the second read in if a.client == nil || !isGuestAgentSocketAccessible(ctx, a.client) can dereference nil and crash the hostagent.
pkg/portfwd/listener.go's listener map keys by host/guest and skips re-creation, so the dialer closure captured by DialContextToGRPCTunnel(client) outlives the client. Closing the stale client now permanently breaks dialing through any listener that was set up before the reconnect, even though pre-PR the gRPC reconnect machinery would have papered over a guest-agent restart on the same socket.

Structure: 0 critical, 2 important, 2 suggestions.

Critical Issues ¶

None.

Important Issues ¶

I1. Closing the stale client strands dialers held by existing port-forward listeners Codex Claude¶

pkg/hostagent/hostagent.go

	}
	// The previous client (if any) is unreachable: close its underlying gRPC
	// ClientConn before replacing it, otherwise the resolver/transport
	// goroutines and the dialed net.Conn (e.g. forwarded ga.sock) leak across
	// every guest agent restart or VM reboot.
	if a.client != nil {
		if err := a.client.Close(); err != nil {
			logrus.WithError(err).Debug("failed to close stale guest agent client")
		}
		a.client = nil
	}
	var err error
	a.client, err = guestagentclient.NewGuestAgentClient(a.createConnection)
	return a.client, err
}

processGuestAgentEvents builds the dialer once per loop iteration as dialContext := portfwd.DialContextToGRPCTunnel(client) and hands it to a.grpcPortForwarder.OnEvent (pkg/hostagent/hostagent.go:898-899). DialContextToGRPCTunnel captures client in a closure (pkg/portfwd/client.go:46-67); each future TCP/UDP accept calls client.Tunnel(...) against that captured pointer.

pkg/hostagent/hostagent.go

			}
		}
		if useSSHFwd {
			a.portForwarder.OnEvent(ctx, ev)
		} else {
			dialContext := portfwd.DialContextToGRPCTunnel(client)
			a.grpcPortForwarder.OnEvent(ctx, dialContext, ev)
		}
	}

	if err := client.Events(ctx, onEvent); err != nil {
		if status.Code(err) == codes.Canceled {

pkg/portfwd/client.go

	}()
	proxy.Run()
	logrus.Debugf("udp proxy for guestAddr: %s closed", guestAddr)
}

func DialContextToGRPCTunnel(client *guestagentclient.GuestAgentClient) func(ctx context.Context, network, addr string) (net.Conn, error) {
	// gvisor-tap-vsock's UDPProxy demultiplexes client connections internally based on their source address.
	// It calls this dialer function only when it receives a datagram from a new, unrecognized client.
	// For each new client, we must return a new net.Conn, which in our case is a new gRPC stream.
	// The atomic counter ensures that each stream has a unique ID to distinguish them on the server side.
	var connectionCounter atomic.Uint32
	return func(_ context.Context, network, addr string) (net.Conn, error) {
		// Passed context.Context is used for timeout on initiate connection, not for the lifetime of the connection.
		// We use context.Background() here to avoid unexpected cancellation.
		stream, err := client.Tunnel(context.Background())
		if err != nil {
			return nil, fmt.Errorf("could not open tunnel for addr: %s error:%w", addr, err)
		}
		// Handshake message to start tunnel
		id := fmt.Sprintf("%s-%s-%d", network, addr, connectionCounter.Add(1))
		if err := stream.Send(&api.TunnelMessage{Id: id, Protocol: network, GuestAddr: addr}); err != nil {
			return nil, fmt.Errorf("could not start tunnel for id: %s addr: %s error:%w", id, addr, err)
		}
		rw := &GrpcClientRW{stream: stream, id: id, addr: addr, protocol: network}
		return rw, nil
	}
}

type GrpcClientRW struct {
	id   string
	addr string

ClosableListeners.forwardTCP keys the listener map by tcp/host/guest and early-returns when the same key already exists, so the dialer is not refreshed on a follow-up event:

_, ok := p.listeners[key]
if ok {
    p.listenersRW.Unlock()
    return
}

(pkg/portfwd/listener.go:99-103)

pkg/portfwd/listener.go

}

func (p *ClosableListeners) forwardTCP(ctx context.Context, dialContext func(ctx context.Context, network string, addr string) (net.Conn, error), hostAddress, guestAddress string) {
	key := key("tcp", hostAddress, guestAddress)

	p.listenersRW.Lock()
	_, ok := p.listeners[key]
	if ok {
		p.listenersRW.Unlock()
		return
	}
	tcpLis, err := Listen(ctx, p.listenConfig, hostAddress)
	if err != nil {
		logListenError(err, "tcp", hostAddress)
		p.listenersRW.Unlock()

After a guest-agent restart with an unchanged guest port set, the existing host listener keeps its old dialContext; new TCP accepts call Tunnel() on the now-Close()d ClientConn and fail immediately with ErrClientConnClosing. Pre-PR the same scenario worked because gRPC's automatic reconnect (via the WithContextDialer that re-runs a.createConnection) re-dialed the forwarded ga.sock on transport failure — the cost was the leak this PR fixes.

Fix: have the dialer resolve the current client on every dial, so existing listeners pick up new clients automatically:

dialContext := func(ctx context.Context, network, addr string) (net.Conn, error) {
    client, err := a.getOrCreateClient(ctx)
    if err != nil {
        return nil, err
    }
    return portfwd.DialContextToGRPCTunnel(client)(ctx, network, addr)
}
a.grpcPortForwarder.OnEvent(ctx, dialContext, ev)

The same problem applies more weakly to startInotify (pkg/hostagent/inotify.go:34-38), which captures the client once and never re-fetches. That stream is bound to a single ctx and dies naturally on guest-agent restart, so it is less urgent, but the re-fetch pattern would close that gap too.

I2. Data race on unprotected reads of a.client can crash hostagent with nil-pointer panic Codex Gemini¶

pkg/hostagent/hostagent.go

		return errors.Join(errs...)
	})

	go func() {
		if a.instConfig.MountInotify != nil && *a.instConfig.MountInotify {
			if a.client == nil || !isGuestAgentSocketAccessible(ctx, a.client) {
				if a.driver.ForwardGuestAgent() {
					sshAddress, sshPort := a.sshAddressPort()
					_ = forwardSSH(ctx, a.sshConfig, sshAddress, sshPort, localUnix, remoteUnix, verbForward, false)
				}
			}

(pkg/hostagent/hostagent.go:758)

pkg/hostagent/hostagent.go

		a.client = nil
		return err
	})

	for {
		if a.client == nil || !isGuestAgentSocketAccessible(ctx, a.client) {
			if a.driver.ForwardGuestAgent() {
				sshAddress, sshPort := a.sshAddressPort()
				_ = forwardSSH(ctx, a.sshConfig, sshAddress, sshPort, localUnix, remoteUnix, verbForward, false)
			}
		}

Three goroutines call getOrCreateClient concurrently: watchGuestAgentEvents (hostagent.go:764), startInotify (pkg/hostagent/inotify.go:34), and the timesync loop (pkg/hostagent/timesync.go:49). Inside getOrCreateClient the new code at lines 843-847 explicitly nils a.client between Close() and the new NewGuestAgentClient assignment — a window of microseconds inside the locked region.

pkg/hostagent/inotify.go

	mountWatchCh := make(chan notify.EventInfo, 128)
	err := a.setupWatchers(mountWatchCh)
	if err != nil {
		return err
	}
	client, err := a.getOrCreateClient(ctx)
	if err != nil {
		logrus.WithError(err).Error("failed to create client for inotify")
	}
	inotifyClient, err := client.Inotify(ctx)
	if err != nil {

pkg/hostagent/timesync.go

		}
	}
}

func (a *HostAgent) syncTimeOnce(ctx context.Context) {
	client, err := a.getOrCreateClient(ctx)
	if err != nil {
		logrus.WithError(err).Debug("Time sync: failed to get client")
		return
	}

pkg/hostagent/hostagent.go

	}
	// The previous client (if any) is unreachable: close its underlying gRPC
	// ClientConn before replacing it, otherwise the resolver/transport
	// goroutines and the dialed net.Conn (e.g. forwarded ga.sock) leak across
	// every guest agent restart or VM reboot.
	if a.client != nil {
		if err := a.client.Close(); err != nil {
			logrus.WithError(err).Debug("failed to close stale guest agent client")
		}
		a.client = nil
	}
	var err error
	a.client, err = guestagentclient.NewGuestAgentClient(a.createConnection)
	return a.client, err
}

The watcher loop at lines 731 and 758 reads a.client twice in the short-circuit || expression without taking clientMu. If goroutine A enters that window while goroutine B's loop is mid- expression, the first read can observe the prior pointer (a.client == nil returns false) and the second read can observe the in-window nil. isGuestAgentSocketAccessible(ctx, nil) calls client.Info(ctx) → c.cli.GetInfo(...), which dereferences nil and panics, taking the hostagent down.

Pre-PR this race already existed for line 731 (the inotify goroutine), but a.client was never assigned nil between successful constructions, so the second read could only observe a stale-but- non-nil pointer — a benign data race. The new a.client = nil in the Close path turns it into a crashable race.

Fix: serialize every read of a.client through the same mutex. A small accessor avoids touching the loop body shape:

func (a *HostAgent) getClient() *guestagentclient.GuestAgentClient {
    a.clientMu.RLock()
    defer a.clientMu.RUnlock()
    return a.client
}

Then in watchGuestAgentEvents:

-        if a.client == nil || !isGuestAgentSocketAccessible(ctx, a.client) {
+        client := a.getClient()
+        if client == nil || !isGuestAgentSocketAccessible(ctx, client) {

Apply the same change at line 731 (inside the inotify-startup goroutine). Note: keeping a.client = nil in getOrCreateClient is fine once readers go through the accessor — the assignment at line 850 will repopulate it before the lock is released, but the explicit nil makes the failure mode clear if NewGuestAgentClient itself returns an error.

Severity calibration: Codex flagged this as Important; Gemini escalated to Critical citing the panic. We keep it at Important because the race window is narrow (Close + re-assign happen back-to- back inside one critical section) and the second read at line 758 is the same goroutine that calls getOrCreateClient, so only the inotify/timesync interleaving exposes it. The crash is real, but not deterministic.

Suggestions ¶

S1. Cleanup hook fires before context cancellation, so the watch loop can recreate one final client that then leaks Claude Gemini¶

pkg/hostagent/hostagent.go

		}
	}()

	// ensure close before ctx is cancelled
	a.cleanUp(a.grpcPortForwarder.Close)
	a.cleanUp(func() error {
		a.clientMu.Lock()
		defer a.clientMu.Unlock()
		if a.client == nil {
			return nil
		}
		err := a.client.Close()
		a.client = nil
		return err
	})

	for {
		if a.client == nil || !isGuestAgentSocketAccessible(ctx, a.client) {
			if a.driver.ForwardGuestAgent() {
				sshAddress, sshPort := a.sshAddressPort()

startRoutinesAndWait (pkg/hostagent/hostagent.go:514-517) calls a.close() first and then cancelHA(). close() walks the cleanup stack in LIFO order, which means this hook closes the client and nils it before the watcher's parent ctx is cancelled. If the watchGuestAgentEvents for-loop is at the top of an iteration in that microsecond gap, it observes a.client == nil, calls getOrCreateClient, dials a fresh ClientConn, and then <-ctx.Done() fires on the next select — the new client never gets closed.

pkg/hostagent/hostagent.go

		logrus.Infof("Driver stopped due to error: %q", driverErr)
	case sig := <-a.signalCh:
		logrus.Infof("Received %s, shutting down the host agent", osutil.SignalName(sig))
	}
	// close the host agent routines before cancelling the context
	if closeErr := a.close(); closeErr != nil {
		logrus.WithError(closeErr).Warn("an error during shutting down the host agent")
	}
	cancelHA()
	return a.driver.Stop(ctx)
}

func (a *HostAgent) Info(_ context.Context) (*hostagentapi.Info, error) {
	info := &hostagentapi.Info{

Practical impact: at most one leaked ClientConn at process exit, which the OS reclaims anyway. Worth a guard for hygiene rather than correctness. The least invasive fix is a shuttingDown flag set inside the cleanup under clientMu and checked at the top of getOrCreateClient. A simpler alternative is to drop the cleanup entirely and rely on cancelHA() to shut everything down — the FDs are released on process exit and the leak this PR fixes is the per-reconnect growth, not the one-conn-at-shutdown case.

Severity calibration: Gemini flagged this as Important arguing the PR's clean-shutdown intent is defeated; we downgrade to Suggestion because the at-most-one-leaked-conn-at-exit blast radius does not match Important.

S2. Close() docstring promises idempotency that the implementation does not enforce Claude Codex¶

pkg/guestagent/api/client/client.go

		cli:  client,
		conn: clientConn,
	}, nil
}

// Close releases the underlying gRPC connection. It is safe to call on a nil
// receiver or a client whose connection has already been closed.
func (c *GuestAgentClient) Close() error {
	if c == nil || c.conn == nil {
		return nil
	}
	return c.conn.Close()
}

func (c *GuestAgentClient) Info(ctx context.Context) (*api.Info, error) {
	return c.cli.GetInfo(ctx, &emptypb.Empty{})
}

c.conn is set once in the constructor and never cleared, so the "already closed" branch is unreachable; the second Close() call dispatches into grpc.ClientConn.Close() again. In current grpc-go that returns nil after a full shutdown, but the implementation does not actually rely on that. The docstring sets up an invariant the code does not guarantee, and a future caller that takes the comment at its word will be surprised when grpc-go semantics change.

Both current call sites (getOrCreateClient and the cleanup) set a.client = nil after Close, so the duplicate-close path is unreachable today. Fix is one of:

 func (c *GuestAgentClient) Close() error {
     if c == nil || c.conn == nil {
         return nil
     }
-    return c.conn.Close()
+    conn := c.conn
+    c.conn = nil
+    return conn.Close()
 }

or tighten the docstring to drop the "already-closed" promise.

Design Observations ¶

Concerns ¶

Long-lived consumers cache the client and lose the gRPC auto-reconnect benefit (future) — pkg/hostagent/inotify.go:34-38, pkg/portfwd/client.go:46-67 Claude. Same root cause as I1 but broader: any code path that captures a *GuestAgentClient and reuses it after a reconnect now sees ErrClientConnClosing instead of an auto-redialed conn. The right long-term fix is a small reconnect loop in startInotify and a current-client lookup in DialContextToGRPCTunnel. Out of scope for this PR.
isGuestAgentSocketAccessible performs a unary gRPC call while holding clientMu (future, pre-existing) — pkg/hostagent/hostagent.go:836 Gemini. If the guest agent is unresponsive without dropping the socket, the lock is held until the caller's context times out, blocking timesync and inotify callers of getOrCreateClient. Predates this PR.

Strengths ¶

Ownership boundary is correct (in-scope) Claude Codex. Close() lives on GuestAgentClient rather than leaking *grpc.ClientConn out to callers, so the package that allocates the resource is the one that releases it.
No new locks (in-scope) Claude. Both Close→nil sequences are guarded by the existing clientMu, which keeps the change minimal.
Inline comment captures the *why* (in-scope) Claude. The comment at lines 839-842 explains the leak mechanism, which is the bar — future readers do not need the issue thread.

Testing Assessment ¶

No unit tests added. Goroutine/FD leak counts are notoriously fragile because they depend on grpc-go internals, so the absence is defensible; the manual pgrep//proc/<pid>/fd reproducer in the PR description is a reasonable substitute.

Untested scenarios that map to consolidated findings:

The stranded-dialer path (I1): a BATS or integration test that restarts lima-guestagent and then opens a new TCP connection to an existing forward would catch the regression. The agents confirmed a clean go test -race ./pkg/hostagent/... ./pkg/guestagent/... run, but the existing suite does not exercise this path.
The shutdown-window race (I2): exercised by stress-running getOrCreateClient from multiple goroutines under -race.
Close() idempotency (S2): one-line table test in pkg/guestagent/api/client/client_test.go.

Documentation Assessment ¶

The new Close() docstring overstates the implementation's guarantee (S2). The inline comment at hostagent.go:839-842 is well-written and captures the leak mechanism, which is the bar. No other documentation issues.

Commit Structure ¶

Single commit with a focused subject and a body that walks through symptom → mechanism → fix. The PR description mirrors the issue and gives a deterministic reproducer. Clean.

Agent Performance Retro ¶

Claude ¶

Caught both documentation issues (S2) and the broader long-lived-consumer problem as a design observation, but missed elevating the listener stranding to an actionable Important-severity finding — flagged it as "future" rather than "regression-now". Calibration was conservative: chose "approve with minor nits" where the consensus shifts to "merge with fixes" once the data-race and listener issues are weighed. Strong on locking analysis (correctly noted that clientMu already serialises the writers).

Codex ¶

Strongest pass overall. Caught the listener-stranding regression (I1) with a complete trace through OnEvent → forwardTCP → early-return on duplicate key, named the right fix (resolve-client-on-dial), and flagged the data race (I2) with a clean accessor pattern. Severity calibration matched consolidation — both Important findings landed at Important. Ran go test -race to confirm the existing suite passes, which Claude and Gemini did not.

Gemini ¶

Found the data race independently and escalated it to Critical with the panic scenario spelled out — useful framing even though we downgraded to Important. Missed the listener-stranding regression entirely. Reported the same shutdown-window race as Claude but at Important severity, which overstates the at-most-one-leaked-conn-at-exit impact. Did not run git blame (consistent with the daily-quota limitation noted in the skill); regression labelling was correct anyway because the diff is small and the touched lines are all new.

Summary ¶

	Claude Opus 4.7	Codex GPT 5.5	Gemini 3.1 Pro
Duration	7m 08s	3m 38s	5m 49s
Findings	2S	2I 1S	1I 1S
Tool calls	26 (Read 12, Grep 8, Bash 6)	25 (shell 24, stdin 1)	10 (grepsearch 7, readfile 2, runshellcommand 1)
Design observations	5	1	1
False positives	0	0	0
Unique insights	1	1	0
Files reviewed	2	2	2
Coverage misses	0	0	1
Totals	2S	2I 1S	1I 1S
Downgraded	0	0	2 (I→S)
Dropped	0	0	0

Reconciliation

Gemini P1 data-race: critical → important (I2). The race is real and the panic is reachable, but the window is narrow (a single locked Close+reassign sequence) and only one of the two affected reads is in a different goroutine from the writer, so deterministic crashes are rare.
Gemini P1 cleanup leak: important → suggestion (S1). Blast radius is one leaked ClientConn at process exit, which the OS reclaims — hygiene, not correctness.

Codex provided the most actionable findings; Gemini's independent data-race catch confirmed Codex's I2; Claude's locking analysis was correct on the writers but missed how the || short-circuit re-reads a.client and missed elevating the listener-stranding observation.

Review Process Notes ¶

Repo context updates ¶

Add — long-lived dialer closures in pkg/portfwd. Listeners registered through ClosableListeners.forwardTCP and forwardUDP key by host/guest and skip re-creation when the same key is seen again. Any dialContext closure passed in via OnEvent therefore outlives the client it captured. When reviewing changes that close, replace, or invalidate a *GuestAgentClient (or any other dial target captured into a port-forwarder closure), check whether the listener registration path skips re-creation for already-known keys — if so, the closed/replaced target permanently breaks accepts on every existing listener for the unchanged keys. Flag as a regression unless the dialer resolves its target lazily on each accept.

Why this is repo-context-worthy: the listener-stranding regression was visible only by tracing the OnEvent → forwardTCP → p.listeners[key] early-return chain across two packages. Two of three agents missed it; codifying the pattern catches it on the first pass next time.