> I think I'm getting a clue what might be happening here.
> Could someone in the know please save my weekend and comment on this?
> If tcp_timer_rexmt() gets called during PMTU discovery, TCPT_REXMT will have
> fired, but will not be disarmed. This means that tcp_output() will never ever
> arm the retransmit timer again beacause it thinks its armed already.
it seems like a bug to me.
> Now, if we have a SACK hole, retransmit the hole, loose the segment again
> (or the ACK), tcp_output will exit at just_return without sending anything.
> There's even a comment above that this is possible in SACK, so there's code
> to re-arm the retransmit timer without having sent a segment, only this will
> not happen because the timer is already considered armed (i.e.
> TCP_TIMER_ISARMED(tp, TCP_REXMIT) wil be true), but it will not fire.
> So we hang.
> The problem is I can't get the NFS connection to hang right now and although
> I did print out the entire PCB with gdb, I didn't notice there was a SACK
> hole so I don't have that hole's retransmission state. So I can't be entirely
> sure this analysis is correct until I manage to make the connection hang again
> and print out the missing information.
> Second problem is I can't just go ahead and install a patched kernel on that
> file server unless I'm really confident (i. e. someone in the know concurs)
> I found the problem, because, well, that's the file server.
> I must admit I don't fully understand how the code is supposed to behave if
> it's still in SACK recovery (but hasn't timed out on the lost segment) and
> new data comes in to be sent.
> So, please, could someone comment whether this sounds plausible or whether
> I'm missing the point?