From db2ddfaf1b322710e1bd4edce6d7519157c3c9eb Mon Sep 17 00:00:00 2001 From: Nickolai Zeldovich Date: Sat, 10 Nov 2001 18:14:30 +0000 Subject: [PATCH] rx-dont-ackall-a-connection-were-waiting-for-retransmits-on-20011110 "My theory of what happened is roughly as follows: Process tries to read data from AFS (as part of a page fault); issues a new Rx call on an Rx connection to the fileserver. The server transmits some data back to the client, but some packet is lost. Something tries to garbage-collect/destroy the connection; since there is an active call, it can't do so, but issues an rx_AckAll anyway, which acknowledges all packets transmitted by the server as having been received. Server flushes its retransmit queue. Client waits forever for the lost packet to arrive, but since the server has already flushed the transmit queue, it cannot possibly retransmit it. All this is happening while the client has read-locked its address space (since the read is part of a page fault). /proc accesses that try to poke into that processes address space hang waiting for said lock, causing the lossage we actually observed." --- src/rx/rx.c | 7 ++++++- 1 file changed, 6 insertions(+), 1 deletion(-) diff --git a/src/rx/rx.c b/src/rx/rx.c index 1ecacf6c0b..1a503e35dd 100644 --- a/src/rx/rx.c +++ b/src/rx/rx.c @@ -897,7 +897,12 @@ static void rxi_DestroyConnectionNoLock(conn) * last reply packets */ rxevent_Cancel(call->delayedAckEvent, call, RX_CALL_REFCOUNT_DELAY); - rxi_AckAll((struct rxevent *)0, call, 0); + if (call->state == RX_STATE_PRECALL || + call->state == RX_STATE_ACTIVE) { + rxi_SendDelayedAck(call->delayedAckEvent, call, 0); + } else { + rxi_AckAll((struct rxevent *)0, call, 0); + } } MUTEX_EXIT(&call->lock); }