How to export tmpfs/ramfs over pNFS?

Issues related to configuring your network
alpha754293
Posts: 69
Joined: 2019/07/29 16:15:14

Re: How to export tmpfs/ramfs over pNFS?

Post by alpha754293 » 2019/08/21 02:23:30

For those that might be following the saga, here's an update:

I was unable to mount tmpfs using pNFS.

Other people (here and elsewhere) suggested that I use GlusterFS, so I've deployed that and am testing it now.

On my compute nodes, I created a 64 GB RAM drive on each node:

Code: Select all

# mount -t tmpfs -o size=64g /bricks/brick1
and edited my /etc/fstab likewise.

I then created the mount points for the GlusterFS volume and then created said volume:

Code: Select all

# gluster volume create gv0 transport=rdma node{1..4}:/bricks/brick1/gv0
but that was a no-go when I tried to mount it, so I disabled SELinux (based on the error message that was being written to the log file), deleted the volume, and created it again with:

Code: Select all

# gluster volume create gv0 transport=tcp,rdma node{1..4}:/bricks/brick1/gv0
Started the volume up, and I was able to mount it now with:

Code: Select all

# mount -t glusterfs -o transport=rdma,direct-io-mode=enable node1:/gv0 /mnt/gv0
Out of all of the test trials, here's the best result that I've been able to get so far. (The results are VERY sporadic and they're kind of all over the map. I haven't quite why just yet.

Code: Select all

[root@node1 gv0]# for i in `seq -w 1 4`; do dd if=/dev/zero of=10Gfile$i bs=1024k count=10240; done
10240+0 records in
10240+0 records out
10737418240 bytes (11 GB) copied, 5.47401 s, 2.0 GB/s
10240+0 records in
10240+0 records out
10737418240 bytes (11 GB) copied, 5.64206 s, 1.9 GB/s
10240+0 records in
10240+0 records out
10737418240 bytes (11 GB) copied, 5.70306 s, 1.9 GB/s
10240+0 records in
10240+0 records out
10737418240 bytes (11 GB) copied, 5.56882 s, 1.9 GB/s
Interestingly enough, when I try to do the same thing on /dev/shm, I only max out at around 2.8 GB/s.

So at best right now, with GlusterFS, I'm able to get about 16 Gbps throughput on four 64 GB RAM drives (for a total of 256 GB split acrossed four nodes).

Note that IS with a distributed volume for the time being.

Here are the results with the dispersed volume:

Code: Select all

[root@node1 gv1]# for i in `seq -w 1 4`; do dd if=/dev/zero of=10Gfile$i bs=1024k count=10240; done
10240+0 records in
10240+0 records out
10737418240 bytes (11 GB) copied, 19.7886 s, 543 MB/s
10240+0 records in
10240+0 records out
10737418240 bytes (11 GB) copied, 20.9642 s, 512 MB/s
10240+0 records in
10240+0 records out
10737418240 bytes (11 GB) copied, 20.6107 s, 521 MB/s
10240+0 records in
10240+0 records out
10737418240 bytes (11 GB) copied, 21.7163 s, 494 MB/s
It's quite a lot slower.

User avatar
TrevorH
Site Admin
Posts: 33202
Joined: 2009/09/24 10:40:56
Location: Brighton, UK

Re: How to export tmpfs/ramfs over pNFS?

Post by TrevorH » 2019/08/21 09:42:57

dd if=/dev/zero of=10Gfile$i bs=1024k count=10240
You need to add oflag=direct or conv=fdatasync or some set of parameters to dd that tell it to bypass the system cache or all you will measure is the speed of the RAM on the machine performing the test.
The future appears to be RHEL or Debian. I think I'm going Debian.
Info for USB installs on http://wiki.centos.org/HowTos/InstallFromUSBkey
CentOS 5 and 6 are deadest, do not use them.
Use the FAQ Luke

alpha754293
Posts: 69
Joined: 2019/07/29 16:15:14

Re: How to export tmpfs/ramfs over pNFS?

Post by alpha754293 » 2019/08/21 11:36:46

TrevorH wrote:
2019/08/21 09:42:57
dd if=/dev/zero of=10Gfile$i bs=1024k count=10240
You need to add oflag=direct or conv=fdatasync or some set of parameters to dd that tell it to bypass the system cache or all you will measure is the speed of the RAM on the machine performing the test.
Oh yeah. That's right. I forgot about that.

Thank you!

alpha754293
Posts: 69
Joined: 2019/07/29 16:15:14

Re: How to export tmpfs/ramfs over pNFS?

Post by alpha754293 » 2019/08/21 12:13:29

For those that might be interested, here is what some of the solver outputs look like in regards to time, cpu, memory, and disk usage:

Code: Select all

Communication speed from master to core     1 =  8608.82 MB/sec
Communication speed from master to core     2 =  8612.43 MB/sec
Communication speed from master to core     3 =  8665.18 MB/sec
Communication speed from master to core     4 =  8733.33 MB/sec
Communication speed from master to core     5 =  8548.84 MB/sec
Communication speed from master to core     6 =  8701.26 MB/sec
Communication speed from master to core     7 =  8620.94 MB/sec
Communication speed from master to core     8 =  5553.01 MB/sec
Communication speed from master to core     9 =  5543.40 MB/sec
Communication speed from master to core    10 =  5498.32 MB/sec
Communication speed from master to core    11 =  5508.38 MB/sec
Communication speed from master to core    12 =  5522.10 MB/sec
Communication speed from master to core    13 =  5528.50 MB/sec
Communication speed from master to core    14 =  5373.67 MB/sec
Communication speed from master to core    15 =  5595.27 MB/sec
Communication speed from master to core    16 =  9772.36 MB/sec
Communication speed from master to core    32 =  9864.38 MB/sec
Communication speed from master to core    48 =  9935.42 MB/sec

Total CPU time for main thread                    :    25165.2 seconds
Total CPU time summed for all threads             :    25686.4 seconds

Elapsed time spent pre-processing model (/PREP7)  :        5.1 seconds
Elapsed time spent solution - preprocessing       :        9.2 seconds
Elapsed time spent computing solution             :    25656.2 seconds
Elapsed time spent solution - postprocessing      :        1.0 seconds
Elapsed time spent post-processing model (/POST1) :        0.0 seconds
 
Equation solver used                              :            Sparse (symmetric)
Equation solver computational rate                :      533.2 Gflops
Equation solver effective I/O rate                :      180.6 GB/sec

Maximum total memory used                         :       86.4 GB
Maximum total memory allocated                    :      147.2 GB
Total physical memory available                   :        126 GB
Maximum total memory available (all machines)     :        504 GB

Total amount of I/O written to disk               :     2322.0 GB
Total amount of I/O read from disk                :     4233.0 GB

+------ E N D   *********************************   S T A T I S T I C S -------+


 *---------------------------------------------------------------------------*
 |                                                                           |
 |                       ***************** RUN COMPLETED                     |
 |                                                                           |
 |---------------------------------------------------------------------------|
 |                                                                           |
 | *************               **********         **********     LINUX x64   |
 |                                                                           |
 |---------------------------------------------------------------------------|
 |                                                                           |
 | Database Requested(-db)  1024 MB    Scratch Memory Requested      1024 MB |
 | Maximum Database Used     721 MB    Maximum Scratch Memory Used   1679 MB |
 |                                                                           |
 |---------------------------------------------------------------------------|
 |                                                                           |
 |        CP Time      (sec) =      25686.412       Time  =  10:51:26        |
 |        Elapsed Time (sec) =      25761.000       Date  =  03/01/2019      |
 |                                                                           |
 *---------------------------------------------------------------------------*
(I've sanitised it a little bit.)

You can see the transfers from master to core 16 is 9772.36 MB/s (81.976 Gbps), the transfer to core 32 is 9864.38 MB/s (82.748 Gbps), and the transfer to core 48 is 9935.42 MB/s (83.344 Gbps).

You can also see how it wrote and read about 6.5 TB of data during the course of the run which took a total time of 25761 seconds (7.156 hours) to finish.

I also have another example that I'll post later.

hunter86_bg
Posts: 2019
Joined: 2015/02/17 15:14:33
Location: Bulgaria
Contact:

Re: How to export tmpfs/ramfs over pNFS?

Post by hunter86_bg » 2019/08/21 19:01:43

With tmpfs , you have to check also the numa nodes on the system.
For example a system has 2 numa nodes and each addresses 128GB of RAM.
If you create tmpfs of size 256 GB you will have strange results. Learned it when creating tmpfs for a large SAP HANA. You can check here to get the idea.

Also, contact the gluster mailing list for optimizations. I'm pretty convinced that you use FUSE native mounts, which is not the best option from performance perspective. There are lots of tunables that they can recommend you.
Is your gluster using RDMA (I think you mentioned it somewhere) ?

I have observed opinions that for HPC the best option is Lustre (and that's a no surprise - it is it's main goal) - have you tried that?

alpha754293
Posts: 69
Joined: 2019/07/29 16:15:14

Re: How to export tmpfs/ramfs over pNFS?

Post by alpha754293 » 2019/08/21 21:20:41

hunter86_bg wrote:
2019/08/21 19:01:43
With tmpfs , you have to check also the numa nodes on the system.
For example a system has 2 numa nodes and each addresses 128GB of RAM.
If you create tmpfs of size 256 GB you will have strange results. Learned it when creating tmpfs for a large SAP HANA. You can check here to get the idea.

Also, contact the gluster mailing list for optimizations. I'm pretty convinced that you use FUSE native mounts, which is not the best option from performance perspective. There are lots of tunables that they can recommend you.
Is your gluster using RDMA (I think you mentioned it somewhere) ?

I have observed opinions that for HPC the best option is Lustre (and that's a no surprise - it is it's main goal) - have you tried that?
I haven't tried Lustre yet only because Gluster seemed, at least, on the surface, easier to install, deploy, and administer.

NUMA nodes might potentially help, but I also reckon that with the Mellanox ConnectX-4 cards, because of the physical PCIe 3.0 x16 slot that the card is installed in (with the compute/blade nodes, you don't really get to pick the expansion card topology), so I'm sure that there's probably something to that as well.

I'm sure that there are a lot of ways to test and tune this.

Right now though, with the early and preliminary results, the thing that I am looking at now is whether having the GlusterFS volume using RDMA is actually having an adverse performance impact on the solve times itself.

Comparing against that I had ran previously where I was also using 64 cores, but using SATA 6 Gbps SSDs vs. using the RAM drive mounted a GlusterFS volume over RDMA, the latter is so far, potentially showing about a 7.5% performance hit with this kind of a set up.

I'm running it again, but instead of using the mounted GlusterFS volume, it's running locally on the RAM drive brick/mount point itself, so I'll have those results when I get home from work later on this evening.

alpha754293
Posts: 69
Joined: 2019/07/29 16:15:14

Re: How to export tmpfs/ramfs over pNFS?

Post by alpha754293 » 2019/08/22 01:05:11

Here's what the results looks like when I run it/use the local RAM drives as scratch rather than the GlusterFS volume (consisting of the same):

Code: Select all

Communication speed from master to core     1 =  7460.92 MB/sec
Communication speed from master to core     2 =  8026.46 MB/sec
Communication speed from master to core     3 =  7943.44 MB/sec
Communication speed from master to core     4 =  8058.55 MB/sec
Communication speed from master to core     5 =  8059.35 MB/sec
Communication speed from master to core     6 =  7996.66 MB/sec
Communication speed from master to core     7 =  8013.89 MB/sec
Communication speed from master to core     8 =  5167.39 MB/sec
Communication speed from master to core     9 =  5171.92 MB/sec
Communication speed from master to core    10 =  5203.58 MB/sec
Communication speed from master to core    11 =  5160.35 MB/sec
Communication speed from master to core    12 =  5140.17 MB/sec
Communication speed from master to core    13 =  5150.58 MB/sec
Communication speed from master to core    14 =  5158.66 MB/sec
Communication speed from master to core    15 =  5191.25 MB/sec
Communication speed from master to core    16 =  4581.97 MB/sec
Communication speed from master to core    32 =  4682.02 MB/sec
Communication speed from master to core    48 =  4587.82 MB/sec

Total CPU time for main thread                    :    25138.5 seconds
Total CPU time summed for all threads             :    25762.9 seconds

Elapsed time spent pre-processing model (/PREP7)  :        5.4 seconds
Elapsed time spent solution - preprocessing       :        9.7 seconds
Elapsed time spent computing solution             :    25696.6 seconds
Elapsed time spent solution - postprocessing      :        1.4 seconds
Elapsed time spent post-processing model (/POST1) :        0.0 seconds

Equation solver used                              :            Sparse (symmetric)
Equation solver computational rate                :      604.1 Gflops
Equation solver effective I/O rate                :      179.6 GB/sec

Maximum total memory used                         :       85.8 GB
Maximum total memory allocated                    :      146.9 GB
Total physical memory available                   :        126 GB
Maximum total memory available (all machines)     :        503 GB

Total amount of I/O written to disk               :     2351.7 GB
Total amount of I/O read from disk                :     4199.9 GB

+------ E N D   *********************************   S T A T I S T I C S -------+


 *---------------------------------------------------------------------------*
 |                                                                           |
 |                       ***************** RUN COMPLETED                     |
 |                                                                           |
 |---------------------------------------------------------------------------|
 |                                                                           |
 | *************               **********         **********     LINUX x64   |
 |                                                                           |
 |---------------------------------------------------------------------------|
 |                                                                           |
 | Database Requested(-db)  1024 MB    Scratch Memory Requested      1024 MB |
 | Maximum Database Used     721 MB    Maximum Scratch Memory Used   1547 MB |
 |                                                                           |
 |---------------------------------------------------------------------------|
 |                                                                           |
 |        CP Time      (sec) =      25762.881       Time  =  14:04:49        |
 |        Elapsed Time (sec) =      25776.000       Date  =  08/21/2019      |
 |                                                                           |
 *---------------------------------------------------------------------------*
 
This is in contrast to the same run, but where the scratch directory was on the GlusterFS volume (again, consisting of four 64 GB RAM drives (tmpfs)):

Code: Select all

Communication speed from master to core     1 =  6312.69 MB/sec
Communication speed from master to core     2 =  7794.31 MB/sec
Communication speed from master to core     3 =  7649.30 MB/sec
Communication speed from master to core     4 =  7718.20 MB/sec
Communication speed from master to core     5 =  7675.39 MB/sec
Communication speed from master to core     6 =  7718.44 MB/sec
Communication speed from master to core     7 =  7731.96 MB/sec
Communication speed from master to core     8 =  5039.01 MB/sec
Communication speed from master to core     9 =  5063.61 MB/sec
Communication speed from master to core    10 =  5049.48 MB/sec
Communication speed from master to core    11 =  5026.94 MB/sec
Communication speed from master to core    12 =  5001.96 MB/sec
Communication speed from master to core    13 =  5002.78 MB/sec
Communication speed from master to core    14 =  5019.93 MB/sec
Communication speed from master to core    15 =  5046.57 MB/sec
Communication speed from master to core    16 =  4665.35 MB/sec
Communication speed from master to core    32 =  4700.56 MB/sec
Communication speed from master to core    48 =  4686.67 MB/sec

Total CPU time for main thread                    :    25425.9 seconds
Total CPU time summed for all threads             :    26059.7 seconds

Elapsed time spent pre-processing model (/PREP7)  :        8.3 seconds
Elapsed time spent solution - preprocessing       :       11.4 seconds
Elapsed time spent computing solution             :    27411.6 seconds
Elapsed time spent solution - postprocessing      :       12.4 seconds
Elapsed time spent post-processing model (/POST1) :        0.0 seconds

Equation solver used                              :            Sparse (symmetric)
Equation solver computational rate                :      600.3 Gflops
Equation solver effective I/O rate                :      174.2 GB/sec

Maximum total memory used                         :       86.4 GB
Maximum total memory allocated                    :      147.1 GB
Total physical memory available                   :        126 GB
Maximum total memory available (all machines)     :        503 GB

Total amount of I/O written to disk               :     2311.4 GB
Total amount of I/O read from disk                :     4230.1 GB

+------ E N D   *********************************   S T A T I S T I C S -------+


 *---------------------------------------------------------------------------*
 |                                                                           |
 |                       ***************** RUN COMPLETED                     |
 |                                                                           |
 |---------------------------------------------------------------------------|
 |                                                                           |
 | *************               **********         **********     LINUX x64   |
 |                                                                           |
 |---------------------------------------------------------------------------|
 |                                                                           |
 | Database Requested(-db)  1024 MB    Scratch Memory Requested      1024 MB |
 | Maximum Database Used     721 MB    Maximum Scratch Memory Used   1679 MB |
 |                                                                           |
 |---------------------------------------------------------------------------|
 |                                                                           |
 |        CP Time      (sec) =      26059.717       Time  =  06:49:26        |
 |        Elapsed Time (sec) =      27473.000       Date  =  08/21/2019      |
 |                                                                           |
 *---------------------------------------------------------------------------*
 
You can see how the total elapsed time increases.

alpha754293
Posts: 69
Joined: 2019/07/29 16:15:14

Re: How to export tmpfs/ramfs over pNFS?

Post by alpha754293 » 2019/08/22 01:26:43

TrevorH wrote:
2019/08/21 09:42:57
dd if=/dev/zero of=10Gfile$i bs=1024k count=10240
You need to add oflag=direct or conv=fdatasync or some set of parameters to dd that tell it to bypass the system cache or all you will measure is the speed of the RAM on the machine performing the test.
When I tried it with oflag=direct, I got this error message:

Code: Select all

[root@node1 shm]# dd if=/dev/zero of=10Gfile oflag=direct bs=1024k count=10240
dd: failed to open `10Gfile': Invalid argument
But when I tried with conv=fdatasync, I get the same results with and without it.

Thanks.

*edit*

Apparently, tmpfs doesn't support direct I/O.

(Source: https://stackoverflow.com/questions/210 ... e-to-tmpfs)

hunter86_bg
Posts: 2019
Joined: 2015/02/17 15:14:33
Location: Bulgaria
Contact:

Re: How to export tmpfs/ramfs over pNFS?

Post by hunter86_bg » 2019/08/22 03:57:09

You need to tune your gluster volumes. Check in the docs all options containing 'performance'.

alpha754293
Posts: 69
Joined: 2019/07/29 16:15:14

Re: How to export tmpfs/ramfs over pNFS?

Post by alpha754293 » 2019/08/22 17:29:14

hunter86_bg wrote:
2019/08/22 03:57:09
You need to tune your gluster volumes. Check in the docs all options containing 'performance'.
Thanks.

Post Reply