Network backup brings network term sessions to a crawl
Posted: 2019/02/14 18:14:14
Two PowerEdge servers both CentOS release 6.10 (Final).
Call them A (T320) and B (T620), where a large filesystem on a big virtual disk is being backed up
from B to A.
The problem is that the sending machine (B) keeps hanging up, network wise, for any other ssh connection. Sometimes it can go 10 or 20 seconds without responding. Without the renice it was worse, putty sessions would hang for 30s or longer and then drop completely. It will still do that occasionally even with the renice, but it happens much less frequently. I tried renice 19 but it didn't improve the situation. There is one user job running, which top shows at priority 30, but it is 99.999% compute bound, other than writing an ~50 character status message to stderr every 5s. That log file is written to a different virtual disk/file system. Plenty of unused memory and CPUs. There are no related error or warning messages in /var/log/messages or at the end of dmesg. Here is a snapshot of top:
This is the sending port on B from ifconfig, which doesn't show anything (to my eye) that indicates any sort of network errors:
So what is causing these network glitches? The nice/renice seem to have reduced CPU usage for the tar, nc, and user job to a low enough priority that this should not be happening. But unless those stall for some reason (no reason they should) the network IO is still pedal to the metal. Perhaps some part of the network transmission takes place at an elevated priority in the kernel or near it? Pretty sure the pauses are because nc is saturating that network port, not leaving enough residual bandwidth for other processes to communicate smoothly.
Although it is too late for this big job, is there anything one can do with nc (or a similar program) to reduce the fraction of the total network bandwidth it uses? iperf has a --bandwidth option, but I'm not aware of anything like that for netcat.
Call them A (T320) and B (T620), where a large filesystem on a big virtual disk is being backed up
from B to A.
Code: Select all
#A as root, in an ssh session
#modify /etc/sysconfig/iptables
# to open 3456 in the firewall, service iptables restart
#A as mathog, in an ssh session
cat >receive.sh <<'EOD'
#!/bin/bash
nc -d -l 3456 >b_fs.tar 2>problems.log
EOD
chmod 755 receive.sh
nohup ./receive.sh &
#nothing else going on, so leave default priorities
#B as root in ssh session
cd /tmp
cat >copybootroot.sh <<EOD
#!/bin/bash
NOW=`date`; echo "$NOW starting tar of /boot and /"
tar --one-file-system -cf - /boot / 2>tar_messages.log \
| nc $A_FULL_NAME 3456
NOW=`date`; echo "$NOW completed tar with exit status $?, bye"
EOD
chmod 755 copybootroot.sh
nohup ./copybootroot.sh >copybootroot.log 2>&1 &
renice 9 6744 6743 6741
Code: Select all
Tasks: 1013 total, 2 running, 1011 sleeping, 0 stopped, 0 zombie
Cpu(s): 0.0%us, 29.6%sy, 14.8%ni, 55.5%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Mem: 529395396k total, 524698108k used, 4697288k free, 5070348k buffers
Swap: 4194300k total, 76356k used, 4117944k free, 494701888k cached
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
35127 mathog 30 10 20.8g 18g 1612 R 2120.5 3.7 29912:00 GraphFromFasta
6744 root 29 9 18324 944 784 S 9.9 0.0 115:13.68 nc
6743 root 29 9 119m 5936 1068 S 6.6 0.0 90:24.13 tar
35323 mathog 20 0 15692 2012 944 R 1.0 0.0 0:00.19 top
466 root 20 0 0 0 0 S 0.3 0.0 0:20.10 kblockd/24
4712 postgres 20 0 211m 872 772 S 0.3 0.0 15:41.28 postmaster
1 root 20 0 19356 1368 1152 S 0.0 0.0 0:05.95 init
2 root 20 0 0 0 0 S 0.0 0.0 0:00.03 kthreadd
Code: Select all
em1 Link encap:Ethernet HWaddr F0:1F:AF:EB:68:8E
inet addr:XXX.XXX.XXX.XXX Bcast:XXX.XXX.XXX.255 Mask:255.255.255.0
inet6 addr: fe80::f21f:afff:feeb:688e/64 Scope:Link
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:755918139 errors:0 dropped:0 overruns:0 frame:0
TX packets:4063366970 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:89853274892 (83.6 GiB) TX bytes:6077614824147 (5.5 TiB)
Memory:ddd00000-dddfffff
Although it is too late for this big job, is there anything one can do with nc (or a similar program) to reduce the fraction of the total network bandwidth it uses? iperf has a --bandwidth option, but I'm not aware of anything like that for netcat.