Page 1 of 1

Network backup brings network term sessions to a crawl

Posted: 2019/02/14 18:14:14
by mathog
Two PowerEdge servers both CentOS release 6.10 (Final).
Call them A (T320) and B (T620), where a large filesystem on a big virtual disk is being backed up
from B to A.

Code: Select all

#A as root, in an ssh session
#modify /etc/sysconfig/iptables
# to open 3456 in the firewall, service iptables restart
#A as mathog, in an ssh session
cat >receive.sh <<'EOD'
#!/bin/bash
nc -d -l 3456 >b_fs.tar 2>problems.log
EOD
chmod 755 receive.sh
nohup ./receive.sh &
#nothing else going on, so leave default priorities

#B as root in ssh session
cd /tmp
cat >copybootroot.sh <<EOD
#!/bin/bash
NOW=`date`; echo "$NOW starting tar of /boot and /"
tar --one-file-system -cf - /boot / 2>tar_messages.log \
  | nc $A_FULL_NAME 3456
NOW=`date`; echo "$NOW completed tar with exit status $?,  bye"
EOD
chmod 755 copybootroot.sh
nohup ./copybootroot.sh >copybootroot.log 2>&1 &
renice 9 6744 6743 6741
The problem is that the sending machine (B) keeps hanging up, network wise, for any other ssh connection. Sometimes it can go 10 or 20 seconds without responding. Without the renice it was worse, putty sessions would hang for 30s or longer and then drop completely. It will still do that occasionally even with the renice, but it happens much less frequently. I tried renice 19 but it didn't improve the situation. There is one user job running, which top shows at priority 30, but it is 99.999% compute bound, other than writing an ~50 character status message to stderr every 5s. That log file is written to a different virtual disk/file system. Plenty of unused memory and CPUs. There are no related error or warning messages in /var/log/messages or at the end of dmesg. Here is a snapshot of top:

Code: Select all

Tasks: 1013 total,   2 running, 1011 sleeping,   0 stopped,   0 zombie
Cpu(s):  0.0%us, 29.6%sy, 14.8%ni, 55.5%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Mem:  529395396k total, 524698108k used,  4697288k free,  5070348k buffers
Swap:  4194300k total,    76356k used,  4117944k free, 494701888k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
35127 mathog    30  10 20.8g  18g 1612 R 2120.5  3.7  29912:00 GraphFromFasta
 6744 root      29   9 18324  944  784 S  9.9  0.0 115:13.68 nc
 6743 root      29   9  119m 5936 1068 S  6.6  0.0  90:24.13 tar
35323 mathog    20   0 15692 2012  944 R  1.0  0.0   0:00.19 top
  466 root      20   0     0    0    0 S  0.3  0.0   0:20.10 kblockd/24
 4712 postgres  20   0  211m  872  772 S  0.3  0.0  15:41.28 postmaster
    1 root      20   0 19356 1368 1152 S  0.0  0.0   0:05.95 init
    2 root      20   0     0    0    0 S  0.0  0.0   0:00.03 kthreadd
This is the sending port on B from ifconfig, which doesn't show anything (to my eye) that indicates any sort of network errors:

Code: Select all

em1       Link encap:Ethernet  HWaddr F0:1F:AF:EB:68:8E
          inet addr:XXX.XXX.XXX.XXX  Bcast:XXX.XXX.XXX.255  Mask:255.255.255.0
          inet6 addr: fe80::f21f:afff:feeb:688e/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:755918139 errors:0 dropped:0 overruns:0 frame:0
          TX packets:4063366970 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:89853274892 (83.6 GiB)  TX bytes:6077614824147 (5.5 TiB)
          Memory:ddd00000-dddfffff
So what is causing these network glitches? The nice/renice seem to have reduced CPU usage for the tar, nc, and user job to a low enough priority that this should not be happening. But unless those stall for some reason (no reason they should) the network IO is still pedal to the metal. Perhaps some part of the network transmission takes place at an elevated priority in the kernel or near it? Pretty sure the pauses are because nc is saturating that network port, not leaving enough residual bandwidth for other processes to communicate smoothly.

Although it is too late for this big job, is there anything one can do with nc (or a similar program) to reduce the fraction of the total network bandwidth it uses? iperf has a --bandwidth option, but I'm not aware of anything like that for netcat.

Re: Network backup brings network term sessions to a crawl

Posted: 2019/02/14 22:38:16
by tunk
Looks like both have dual NICs. If you have one free on both you could
connect a cable between them and setup both ports (e.g. 192.168.0.x+y).
Then try to run the backup through this local network.

Re: Network backup brings network term sessions to a crawl

Posted: 2019/02/14 23:03:32
by mathog
tunk wrote:
2019/02/14 22:38:16
Looks like both have dual NICs. If you have one free on both you could
connect a cable between them and setup both ports (e.g. 192.168.0.x+y).
Then try to run the backup through this local network.
They are in different buildings, so that was not an option.

Thanks anyway.

Re: Network backup brings network term sessions to a crawl

Posted: 2019/02/14 23:13:41
by tunk
If it's the NIC that's the bottleneck (and not routers/switches) you
could connect the second, assign it an IP address (on the same subnet)
and use that for backup.

Re: Network backup brings network term sessions to a crawl

Posted: 2019/03/11 11:11:11
by tyler2016
It might not have anything to do with the servers themselves. Do you control the network infrastructure? I would check or ask to have checked the network connection between the buildings. Are there any QoS rules configured on any network devices between the two? Are you connected to the problem server through the same connection path that it is using for the backup? If all else fails check this out:

https://unix.stackexchange.com/question ... -a-process