rx: prevent leak of cache manager NAT ping rx_connections

Both the Unix and Windows cache managers maintain a set of persistent
client rx_connections ("conns") to the known fileservers for each active
AFS user.  These conns are periodically refreshed (destroyed and
possibly rebuilt) approximately NOTOKETIMEOUT (2h) after token
expiration for authenticated (rx_kad) conns, or every NOTOKETIMEOUT (2h)
for anonymous (rx_null) conns.

Both cache managers enable NAT ping for one rx_connection to each known
fileserver.  Thus we see this common idiom when a cache manager destroys
a fileserver connection:

  rx_SetConnSecondsUntilNatPing(conn, 0);
  rx_DestroyConnection(conn);

Doing this for all conns is harmless, even if a given conn doesn't have
NAT ping enabled.

It is important to note that rx_SetConnSecondsUntilNatPing(conn, 0) does
not actually cancel the conn's natKeepAliveEvent (if it has one); it
merely sets conn->secondsUntilNatPing to 0.  If there is a
natKeepAliveEvent, this prevents it from being rescheduled after the
next event.  This was fine in the past because rx_DestroyConnection
eventually did cancel natKeepAliveEvent and destroy the rx_connection
itself.

However, this idiom broke after commit 304d758983 "Standardize
rx_event usage" introduced a number of changes:
 - an extra rx_connection refCount for each outstanding conn event
 - a requirement (enforced by osi_Assert) that all conn events be
   canceled before rx_DestroyConnection can succeed

Therefore, if natKeepAliveEvent is still active when we enter
rx_DestroyConnection, it will return early, due to the extra conn
refCount associated with the NAT ping event.  The rx_connection is not
destroyed in this case.

Eventually, the final scheduled natKeepAliveEvent will fire; the event
will not be rescheduled, and the final conn refCount will be removed via
putConnection and not rx_DestroyConnection (and so the conn is not
destroyed). This rx_connection now has refCount 0, but can never be
destroyed - it has been leaked:
 - The cache manager has "forgotten" this rx_connection and has no means
   to invoke rx_DestroyConnection again.
 - rxi_ReapConnections will not destroy it because it is a client conn
   and is still on the rx_connHashTable, not the rx_connCleanup_list.
 - If there is still a dallying rx_call, any eventual call to
   rxi_CheckCall -> rxi_FreeCall will not destroy the conn because it
   has not been flagged RX_CONN_DESTROY_ME.

Modify rx_SetConnSecondsUntilNatPing to explicitly cancel any
natKeepAliveEvent and remove its refcount if cancelled.  With this
change in place, the cache managers will no longer periodically leak
client rx_connections.

Change-Id: I4e89ebc4bd2c95b6e61b95bd8f91867d451dd34c
Reviewed-on: https://gerrit.openafs.org/14951
Reviewed-by: Cheyenne Wills <cwills@sinenomine.net>
Tested-by: BuildBot <buildbot@rampaginggeek.com>
Reviewed-by: Michael Meffie <mmeffie@sinenomine.net>
Reviewed-by: Andrew Deason <adeason@sinenomine.net>
This commit is contained in:
Mark Vitale 2022-04-14 00:34:29 -04:00 committed by Andrew Deason
parent 983a670b9d
commit c5993b0a4f

View File

@ -6659,7 +6659,10 @@ rx_SetConnSecondsUntilNatPing(struct rx_connection *conn, afs_int32 seconds)
{
MUTEX_ENTER(&conn->conn_data_lock);
conn->secondsUntilNatPing = seconds;
if (seconds != 0) {
if (seconds == 0) {
if (rxevent_Cancel(&conn->natKeepAliveEvent))
putConnection(conn);
} else {
if (!(conn->flags & RX_CONN_ATTACHWAIT))
rxi_ScheduleNatKeepAliveEvent(conn);
else