Page 2 of 2

Re: How to export tmpfs/ramfs over pNFS?

Posted: 2019/08/21 02:23:30
by alpha754293
For those that might be following the saga, here's an update:

I was unable to mount tmpfs using pNFS.

Other people (here and elsewhere) suggested that I use GlusterFS, so I've deployed that and am testing it now.

On my compute nodes, I created a 64 GB RAM drive on each node:

Code: Select all

# mount -t tmpfs -o size=64g /bricks/brick1
and edited my /etc/fstab likewise.

I then created the mount points for the GlusterFS volume and then created said volume:

Code: Select all

# gluster volume create gv0 transport=rdma node{1..4}:/bricks/brick1/gv0
but that was a no-go when I tried to mount it, so I disabled SELinux (based on the error message that was being written to the log file), deleted the volume, and created it again with:

Code: Select all

# gluster volume create gv0 transport=tcp,rdma node{1..4}:/bricks/brick1/gv0
Started the volume up, and I was able to mount it now with:

Code: Select all

# mount -t glusterfs -o transport=rdma,direct-io-mode=enable node1:/gv0 /mnt/gv0
Out of all of the test trials, here's the best result that I've been able to get so far. (The results are VERY sporadic and they're kind of all over the map. I haven't quite why just yet.

Code: Select all

[root@node1 gv0]# for i in `seq -w 1 4`; do dd if=/dev/zero of=10Gfile$i bs=1024k count=10240; done
10240+0 records in
10240+0 records out
10737418240 bytes (11 GB) copied, 5.47401 s, 2.0 GB/s
10240+0 records in
10240+0 records out
10737418240 bytes (11 GB) copied, 5.64206 s, 1.9 GB/s
10240+0 records in
10240+0 records out
10737418240 bytes (11 GB) copied, 5.70306 s, 1.9 GB/s
10240+0 records in
10240+0 records out
10737418240 bytes (11 GB) copied, 5.56882 s, 1.9 GB/s
Interestingly enough, when I try to do the same thing on /dev/shm, I only max out at around 2.8 GB/s.

So at best right now, with GlusterFS, I'm able to get about 16 Gbps throughput on four 64 GB RAM drives (for a total of 256 GB split acrossed four nodes).

Note that IS with a distributed volume for the time being.

Here are the results with the dispersed volume:

Code: Select all

[root@node1 gv1]# for i in `seq -w 1 4`; do dd if=/dev/zero of=10Gfile$i bs=1024k count=10240; done
10240+0 records in
10240+0 records out
10737418240 bytes (11 GB) copied, 19.7886 s, 543 MB/s
10240+0 records in
10240+0 records out
10737418240 bytes (11 GB) copied, 20.9642 s, 512 MB/s
10240+0 records in
10240+0 records out
10737418240 bytes (11 GB) copied, 20.6107 s, 521 MB/s
10240+0 records in
10240+0 records out
10737418240 bytes (11 GB) copied, 21.7163 s, 494 MB/s
It's quite a lot slower.

Re: How to export tmpfs/ramfs over pNFS?

Posted: 2019/08/21 09:42:57
by TrevorH
dd if=/dev/zero of=10Gfile$i bs=1024k count=10240
You need to add oflag=direct or conv=fdatasync or some set of parameters to dd that tell it to bypass the system cache or all you will measure is the speed of the RAM on the machine performing the test.

Re: How to export tmpfs/ramfs over pNFS?

Posted: 2019/08/21 11:36:46
by alpha754293
TrevorH wrote:
2019/08/21 09:42:57
dd if=/dev/zero of=10Gfile$i bs=1024k count=10240
You need to add oflag=direct or conv=fdatasync or some set of parameters to dd that tell it to bypass the system cache or all you will measure is the speed of the RAM on the machine performing the test.
Oh yeah. That's right. I forgot about that.

Thank you!

Re: How to export tmpfs/ramfs over pNFS?

Posted: 2019/08/21 12:13:29
by alpha754293
For those that might be interested, here is what some of the solver outputs look like in regards to time, cpu, memory, and disk usage:

Code: Select all

Communication speed from master to core     1 =  8608.82 MB/sec
Communication speed from master to core     2 =  8612.43 MB/sec
Communication speed from master to core     3 =  8665.18 MB/sec
Communication speed from master to core     4 =  8733.33 MB/sec
Communication speed from master to core     5 =  8548.84 MB/sec
Communication speed from master to core     6 =  8701.26 MB/sec
Communication speed from master to core     7 =  8620.94 MB/sec
Communication speed from master to core     8 =  5553.01 MB/sec
Communication speed from master to core     9 =  5543.40 MB/sec
Communication speed from master to core    10 =  5498.32 MB/sec
Communication speed from master to core    11 =  5508.38 MB/sec
Communication speed from master to core    12 =  5522.10 MB/sec
Communication speed from master to core    13 =  5528.50 MB/sec
Communication speed from master to core    14 =  5373.67 MB/sec
Communication speed from master to core    15 =  5595.27 MB/sec
Communication speed from master to core    16 =  9772.36 MB/sec
Communication speed from master to core    32 =  9864.38 MB/sec
Communication speed from master to core    48 =  9935.42 MB/sec

Total CPU time for main thread                    :    25165.2 seconds
Total CPU time summed for all threads             :    25686.4 seconds

Elapsed time spent pre-processing model (/PREP7)  :        5.1 seconds
Elapsed time spent solution - preprocessing       :        9.2 seconds
Elapsed time spent computing solution             :    25656.2 seconds
Elapsed time spent solution - postprocessing      :        1.0 seconds
Elapsed time spent post-processing model (/POST1) :        0.0 seconds
 
Equation solver used                              :            Sparse (symmetric)
Equation solver computational rate                :      533.2 Gflops
Equation solver effective I/O rate                :      180.6 GB/sec

Maximum total memory used                         :       86.4 GB
Maximum total memory allocated                    :      147.2 GB
Total physical memory available                   :        126 GB
Maximum total memory available (all machines)     :        504 GB

Total amount of I/O written to disk               :     2322.0 GB
Total amount of I/O read from disk                :     4233.0 GB

+------ E N D   *********************************   S T A T I S T I C S -------+


 *---------------------------------------------------------------------------*
 |                                                                           |
 |                       ***************** RUN COMPLETED                     |
 |                                                                           |
 |---------------------------------------------------------------------------|
 |                                                                           |
 | *************               **********         **********     LINUX x64   |
 |                                                                           |
 |---------------------------------------------------------------------------|
 |                                                                           |
 | Database Requested(-db)  1024 MB    Scratch Memory Requested      1024 MB |
 | Maximum Database Used     721 MB    Maximum Scratch Memory Used   1679 MB |
 |                                                                           |
 |---------------------------------------------------------------------------|
 |                                                                           |
 |        CP Time      (sec) =      25686.412       Time  =  10:51:26        |
 |        Elapsed Time (sec) =      25761.000       Date  =  03/01/2019      |
 |                                                                           |
 *---------------------------------------------------------------------------*
(I've sanitised it a little bit.)

You can see the transfers from master to core 16 is 9772.36 MB/s (81.976 Gbps), the transfer to core 32 is 9864.38 MB/s (82.748 Gbps), and the transfer to core 48 is 9935.42 MB/s (83.344 Gbps).

You can also see how it wrote and read about 6.5 TB of data during the course of the run which took a total time of 25761 seconds (7.156 hours) to finish.

I also have another example that I'll post later.

Re: How to export tmpfs/ramfs over pNFS?

Posted: 2019/08/21 19:01:43
by hunter86_bg
With tmpfs , you have to check also the numa nodes on the system.
For example a system has 2 numa nodes and each addresses 128GB of RAM.
If you create tmpfs of size 256 GB you will have strange results. Learned it when creating tmpfs for a large SAP HANA. You can check here to get the idea.

Also, contact the gluster mailing list for optimizations. I'm pretty convinced that you use FUSE native mounts, which is not the best option from performance perspective. There are lots of tunables that they can recommend you.
Is your gluster using RDMA (I think you mentioned it somewhere) ?

I have observed opinions that for HPC the best option is Lustre (and that's a no surprise - it is it's main goal) - have you tried that?

Re: How to export tmpfs/ramfs over pNFS?

Posted: 2019/08/21 21:20:41
by alpha754293
hunter86_bg wrote:
2019/08/21 19:01:43
With tmpfs , you have to check also the numa nodes on the system.
For example a system has 2 numa nodes and each addresses 128GB of RAM.
If you create tmpfs of size 256 GB you will have strange results. Learned it when creating tmpfs for a large SAP HANA. You can check here to get the idea.

Also, contact the gluster mailing list for optimizations. I'm pretty convinced that you use FUSE native mounts, which is not the best option from performance perspective. There are lots of tunables that they can recommend you.
Is your gluster using RDMA (I think you mentioned it somewhere) ?

I have observed opinions that for HPC the best option is Lustre (and that's a no surprise - it is it's main goal) - have you tried that?
I haven't tried Lustre yet only because Gluster seemed, at least, on the surface, easier to install, deploy, and administer.

NUMA nodes might potentially help, but I also reckon that with the Mellanox ConnectX-4 cards, because of the physical PCIe 3.0 x16 slot that the card is installed in (with the compute/blade nodes, you don't really get to pick the expansion card topology), so I'm sure that there's probably something to that as well.

I'm sure that there are a lot of ways to test and tune this.

Right now though, with the early and preliminary results, the thing that I am looking at now is whether having the GlusterFS volume using RDMA is actually having an adverse performance impact on the solve times itself.

Comparing against that I had ran previously where I was also using 64 cores, but using SATA 6 Gbps SSDs vs. using the RAM drive mounted a GlusterFS volume over RDMA, the latter is so far, potentially showing about a 7.5% performance hit with this kind of a set up.

I'm running it again, but instead of using the mounted GlusterFS volume, it's running locally on the RAM drive brick/mount point itself, so I'll have those results when I get home from work later on this evening.

Re: How to export tmpfs/ramfs over pNFS?

Posted: 2019/08/22 01:05:11
by alpha754293
Here's what the results looks like when I run it/use the local RAM drives as scratch rather than the GlusterFS volume (consisting of the same):

Code: Select all

Communication speed from master to core     1 =  7460.92 MB/sec
Communication speed from master to core     2 =  8026.46 MB/sec
Communication speed from master to core     3 =  7943.44 MB/sec
Communication speed from master to core     4 =  8058.55 MB/sec
Communication speed from master to core     5 =  8059.35 MB/sec
Communication speed from master to core     6 =  7996.66 MB/sec
Communication speed from master to core     7 =  8013.89 MB/sec
Communication speed from master to core     8 =  5167.39 MB/sec
Communication speed from master to core     9 =  5171.92 MB/sec
Communication speed from master to core    10 =  5203.58 MB/sec
Communication speed from master to core    11 =  5160.35 MB/sec
Communication speed from master to core    12 =  5140.17 MB/sec
Communication speed from master to core    13 =  5150.58 MB/sec
Communication speed from master to core    14 =  5158.66 MB/sec
Communication speed from master to core    15 =  5191.25 MB/sec
Communication speed from master to core    16 =  4581.97 MB/sec
Communication speed from master to core    32 =  4682.02 MB/sec
Communication speed from master to core    48 =  4587.82 MB/sec

Total CPU time for main thread                    :    25138.5 seconds
Total CPU time summed for all threads             :    25762.9 seconds

Elapsed time spent pre-processing model (/PREP7)  :        5.4 seconds
Elapsed time spent solution - preprocessing       :        9.7 seconds
Elapsed time spent computing solution             :    25696.6 seconds
Elapsed time spent solution - postprocessing      :        1.4 seconds
Elapsed time spent post-processing model (/POST1) :        0.0 seconds

Equation solver used                              :            Sparse (symmetric)
Equation solver computational rate                :      604.1 Gflops
Equation solver effective I/O rate                :      179.6 GB/sec

Maximum total memory used                         :       85.8 GB
Maximum total memory allocated                    :      146.9 GB
Total physical memory available                   :        126 GB
Maximum total memory available (all machines)     :        503 GB

Total amount of I/O written to disk               :     2351.7 GB
Total amount of I/O read from disk                :     4199.9 GB

+------ E N D   *********************************   S T A T I S T I C S -------+


 *---------------------------------------------------------------------------*
 |                                                                           |
 |                       ***************** RUN COMPLETED                     |
 |                                                                           |
 |---------------------------------------------------------------------------|
 |                                                                           |
 | *************               **********         **********     LINUX x64   |
 |                                                                           |
 |---------------------------------------------------------------------------|
 |                                                                           |
 | Database Requested(-db)  1024 MB    Scratch Memory Requested      1024 MB |
 | Maximum Database Used     721 MB    Maximum Scratch Memory Used   1547 MB |
 |                                                                           |
 |---------------------------------------------------------------------------|
 |                                                                           |
 |        CP Time      (sec) =      25762.881       Time  =  14:04:49        |
 |        Elapsed Time (sec) =      25776.000       Date  =  08/21/2019      |
 |                                                                           |
 *---------------------------------------------------------------------------*
 
This is in contrast to the same run, but where the scratch directory was on the GlusterFS volume (again, consisting of four 64 GB RAM drives (tmpfs)):

Code: Select all

Communication speed from master to core     1 =  6312.69 MB/sec
Communication speed from master to core     2 =  7794.31 MB/sec
Communication speed from master to core     3 =  7649.30 MB/sec
Communication speed from master to core     4 =  7718.20 MB/sec
Communication speed from master to core     5 =  7675.39 MB/sec
Communication speed from master to core     6 =  7718.44 MB/sec
Communication speed from master to core     7 =  7731.96 MB/sec
Communication speed from master to core     8 =  5039.01 MB/sec
Communication speed from master to core     9 =  5063.61 MB/sec
Communication speed from master to core    10 =  5049.48 MB/sec
Communication speed from master to core    11 =  5026.94 MB/sec
Communication speed from master to core    12 =  5001.96 MB/sec
Communication speed from master to core    13 =  5002.78 MB/sec
Communication speed from master to core    14 =  5019.93 MB/sec
Communication speed from master to core    15 =  5046.57 MB/sec
Communication speed from master to core    16 =  4665.35 MB/sec
Communication speed from master to core    32 =  4700.56 MB/sec
Communication speed from master to core    48 =  4686.67 MB/sec

Total CPU time for main thread                    :    25425.9 seconds
Total CPU time summed for all threads             :    26059.7 seconds

Elapsed time spent pre-processing model (/PREP7)  :        8.3 seconds
Elapsed time spent solution - preprocessing       :       11.4 seconds
Elapsed time spent computing solution             :    27411.6 seconds
Elapsed time spent solution - postprocessing      :       12.4 seconds
Elapsed time spent post-processing model (/POST1) :        0.0 seconds

Equation solver used                              :            Sparse (symmetric)
Equation solver computational rate                :      600.3 Gflops
Equation solver effective I/O rate                :      174.2 GB/sec

Maximum total memory used                         :       86.4 GB
Maximum total memory allocated                    :      147.1 GB
Total physical memory available                   :        126 GB
Maximum total memory available (all machines)     :        503 GB

Total amount of I/O written to disk               :     2311.4 GB
Total amount of I/O read from disk                :     4230.1 GB

+------ E N D   *********************************   S T A T I S T I C S -------+


 *---------------------------------------------------------------------------*
 |                                                                           |
 |                       ***************** RUN COMPLETED                     |
 |                                                                           |
 |---------------------------------------------------------------------------|
 |                                                                           |
 | *************               **********         **********     LINUX x64   |
 |                                                                           |
 |---------------------------------------------------------------------------|
 |                                                                           |
 | Database Requested(-db)  1024 MB    Scratch Memory Requested      1024 MB |
 | Maximum Database Used     721 MB    Maximum Scratch Memory Used   1679 MB |
 |                                                                           |
 |---------------------------------------------------------------------------|
 |                                                                           |
 |        CP Time      (sec) =      26059.717       Time  =  06:49:26        |
 |        Elapsed Time (sec) =      27473.000       Date  =  08/21/2019      |
 |                                                                           |
 *---------------------------------------------------------------------------*
 
You can see how the total elapsed time increases.

Re: How to export tmpfs/ramfs over pNFS?

Posted: 2019/08/22 01:26:43
by alpha754293
TrevorH wrote:
2019/08/21 09:42:57
dd if=/dev/zero of=10Gfile$i bs=1024k count=10240
You need to add oflag=direct or conv=fdatasync or some set of parameters to dd that tell it to bypass the system cache or all you will measure is the speed of the RAM on the machine performing the test.
When I tried it with oflag=direct, I got this error message:

Code: Select all

[root@node1 shm]# dd if=/dev/zero of=10Gfile oflag=direct bs=1024k count=10240
dd: failed to open `10Gfile': Invalid argument
But when I tried with conv=fdatasync, I get the same results with and without it.

Thanks.

*edit*

Apparently, tmpfs doesn't support direct I/O.

(Source: https://stackoverflow.com/questions/210 ... e-to-tmpfs)

Re: How to export tmpfs/ramfs over pNFS?

Posted: 2019/08/22 03:57:09
by hunter86_bg
You need to tune your gluster volumes. Check in the docs all options containing 'performance'.

Re: How to export tmpfs/ramfs over pNFS?

Posted: 2019/08/22 17:29:14
by alpha754293
hunter86_bg wrote:
2019/08/22 03:57:09
You need to tune your gluster volumes. Check in the docs all options containing 'performance'.
Thanks.