TCP_SACK patch broke TCP retransmit

Issues related to configuring your network
Post Reply
BatteryKing
Posts: 1
Joined: 2019/07/13 12:43:01

TCP_SACK patch broke TCP retransmit

Post by BatteryKing » 2019/07/13 13:09:02

Scope: As far as I can tell this problem definitely affects CentOS 6 and maybe CentOS 7.

For the outfit I work for, I need to transfer files over long distances. I have an automated download setup to run every night. On the far side where round trip latency is around 100ms and transfer rates can be over 10MB/s, meaning 1MB or more TCP segments can be in flight, everything was working fine until the CentOS 6 TCP_SACK patched kernel was installed on the remote server. Once this happened connections dropped out like mad. We spent a long time thinking it was some sort of network error, however everything checked out.

How I found the problem was analyzing with Wireshark on my side. I found the transfer would start, pick up speed, and then a packet would get lost in transit. After the packet got lost, my side would send over 100 duplicate ACKs for all of the packets that were already in flight arriving (the thing TCP does when it receives packets out of order as it is an in order protocol) and then nothing.

How we 'fixed' the problem was to revert to the unpatched kernel. This is not a good place to be as now we are running with a known vulnerability. However we cannot run the patch because it has a serious regression. I have speculated something along the lines of the default max TCP retransmit window size in Linux by default is set too small (it looks to be set to 200KB) for this kind of long distance, higher speed transfer, people don't usually screw around with sysfs and tweaking kernel parameters, especially not parameters around max TCP retransmit window size, and before purging the oldest entries from the retransmit buffer, the check for ACKs has been effectively deleted, meaning when a packet does get lost, there is nothing in the buffer to retransmit. In other words TCP turns into UDP in reliability, except TCP program are not designed for this and so the transmission just stops. I am tentatively calling this the TCP window derail regression until proven otherwise.

I am particularly focused on the TCP_SACK patch because the problems are very heavily constrained to this patch and it is the only major network core code change I am aware of. Anyways TCP_SACK is focused on optimizing when packets are lost by allowing the sender to know what out of order packets were received and now that transmission, at least on long distance links at high speeds, is completely broken.

User avatar
TrevorH
Site Admin
Posts: 33202
Joined: 2009/09/24 10:40:56
Location: Brighton, UK

Re: TCP_SACK patch broke TCP retransmit

Post by TrevorH » 2019/07/13 14:31:30

You need to report this on bugzilla.redhat.com. CentOS is a rebuild of RHEL and the only changes made are to remove branding and logos. If it's a bug in RHEL and not present in CentOS then it would be a bug in CentOS that it worked! We aim to be bug for bug compatible with RHEL.
The future appears to be RHEL or Debian. I think I'm going Debian.
Info for USB installs on http://wiki.centos.org/HowTos/InstallFromUSBkey
CentOS 5 and 6 are deadest, do not use them.
Use the FAQ Luke

Post Reply