How to export tmpfs/ramfs over pNFS?

Issues related to configuring your network
alpha754293
Posts: 69
Joined: 2019/07/29 16:15:14

How to export tmpfs/ramfs over pNFS?

Post by alpha754293 » 2019/08/12 03:46:46

Sorry - yet another stupid question.

So the background to this question is I've recently worn through the write endurance of all of my SSDs as swap/scratch disks and I'm now looking for an alternative solution.

Is there a way to have multiple blade nodes, whereby the amount of RAM per blade node is maxed out, and then I can either create a ramfs or tmpfs on it (RAM drive), and then export said RAM drive using pNFS or some other distributed/parallel file system/format so that to the cluster, it would appear as though it was one giant tmpfs/ramfs RAM drive?

Thank you.

chemal
Posts: 776
Joined: 2013/12/08 19:44:49

Re: How to export tmpfs/ramfs over pNFS?

Post by chemal » 2019/08/13 00:45:40

RH doesn't support pNFS server-side, but GlusterFS may be an option.

alpha754293
Posts: 69
Joined: 2019/07/29 16:15:14

Re: How to export tmpfs/ramfs over pNFS?

Post by alpha754293 » 2019/08/13 01:46:07

chemal wrote:
2019/08/13 00:45:40
RH doesn't support pNFS server-side, but GlusterFS may be an option.
Thank you.

Yeah, I've started doing some researching into the step-by-step deployment guide and how I can set that up. Not sure that it will do what I am hoping that it will be able to do, but the research continues.

Thank you.

*edit*
Sorry - more stupid questions re: GlusterFS

I was reading here: https://wiki.centos.org/HowTos/GlusterFSonCentOS and it said that it uses xfs bricks to create the GlusterFS volumes.

Does that mean that I won't be able to create a GlusterFS volume consisting of RAM drives (tmpfs) by virtue of it being tmpfs?

chemal
Posts: 776
Joined: 2013/12/08 19:44:49

Re: How to export tmpfs/ramfs over pNFS?

Post by chemal » 2019/08/13 03:46:48

I guess they didn't expect the spanish inquisition -- I mean someone with "the amount of RAM per blade node [...] maxed out" to substitute SSDs. XFS is just a recommendation for ordinary setups limited by common financial resources.

alpha754293
Posts: 69
Joined: 2019/07/29 16:15:14

Re: How to export tmpfs/ramfs over pNFS?

Post by alpha754293 » 2019/08/13 04:27:37

chemal wrote:
2019/08/13 03:46:48
I guess they didn't expect the spanish inquisition -- I mean someone with "the amount of RAM per blade node [...] maxed out" to substitute SSDs. XFS is just a recommendation for ordinary setups limited by common financial resources.
Yeah...again, the motivation for this is driven by this:
alpha754293 wrote: So the background to this question is I've recently worn through the write endurance of all of my SSDs as swap/scratch disks and I'm now looking for an alternative solution.
So...does that mean that I CAN create a GlusterFS volume using tmpfs mount points?

(I'm in the process of RMA-ing SSDs #2 through 5, inclusively because I've worn out the write endurance on them.)

Drives that are capable of > 3 DWPD are very difficult to find, and it still doesn't stop nor prevent the fact that the NAND/flash memory chips in said SSDs WILL wear out over time, so, rather than having to replace the drives every 6 months to 2 years, the idea now is to load up a whole slew of blades to max out the RAM capacity and use the volatile RAM as swap/scratch space in lieu of SSDs. Over the life of the deployment of the system, the cost of having to constantly replace the SSDs because they keep wearing out will overtake the initial capex cost of a blade server that exists almost solely to host RAM, and therefore; the swap/scratch space.

With volatile RAM, I don't/won't have to worry about the RAM memory chips wearing out like I do with the NAND/flash memory chips in SSDs.

And if I am to understand correctly as well, tmpfs WILL use swap if the amount of writes exceed the size limit, which means that I can use SSDs as swap, and tmpfs RAM as primary such that it will only use the SSDs when/if the tmpfs runs out of space and starts swapping.

This will help resolve this issue that I am currently facing by using ONLY SSDs.

alpha754293
Posts: 69
Joined: 2019/07/29 16:15:14

Re: How to export tmpfs/ramfs over pNFS?

Post by alpha754293 » 2019/08/18 18:54:27

re: GlusterFS

(using the pNFS terminology, because I'm still learning about GlusterFS and using GlusterFS' terminology)

Can the data servers also be the clients or is it expected that the clients will be separate from the data servers and the metadata server(s)?

aks
Posts: 3073
Joined: 2014/09/20 11:22:14

Re: How to export tmpfs/ramfs over pNFS?

Post by aks » 2019/08/20 18:55:41

Sorry - yet another stupid question.
That's okay, I have many (and I have more than you!)
worn through the write endurance of all of my SSDs as swap/scratch disks and I'm now looking for an alternative solution.
A solution to wear levelling problem, or the idea that you're actually missing the principle? A solution: don't do that.
whereby the amount of RAM per blade node is maxed out, and then I can either create a ramfs or tmpfs on it (RAM drive),
How? There's is no RAM!

If you're really talking about anonymous RAM SHARED between different physical nodes (so NUMA++++), then yeah that's still research (AFAIK) and it's a long, long way away. There's a WHOLE lot of problems there (including, but not limited to, cross talk and security at the very last).

Gluster is an "object storage" system - kind of different to file based systems (although it's still a bunch of zeros and ones stored in some persistent manner). It's is NOT aimed at a "traditional" RAMFS - i.e: all in memory (RAM) and flushed to persistent storage at events (like write barriers or cache full etc.). That's not the use case.
NAND/flash memory chips in said SSDs WILL wear out over time
Correction, all will wear out over time. The trade off is cost versus the amount of time.
With volatile RAM, I don't/won't have to worry about the RAM memory chips wearing out like I do with the NAND/flash memory chips in SSDs.
Actually it can happen and happen faster than one would think. All physical things in the "real" world will die (including me and you). The question is when.
This will help resolve this issue that I am currently facing by using ONLY SSDs.
So what's the real question? All I know is you have some bizarre use case whereby you MUST use "fast" (marketing term) persistent storage as a "place holder - "virtual memory") for something. Why are you flushing to swap so much? Maybe your real question revolves around "I don't have enough RAM for my use case, what can I do"?

Swap is not "primary" storage, I consider it backup. Get enough RAM and be done.

Otr I'm missing the point.

alpha754293
Posts: 69
Joined: 2019/07/29 16:15:14

Re: How to export tmpfs/ramfs over pNFS?

Post by alpha754293 » 2019/08/20 19:23:40

aks wrote:
2019/08/20 18:55:41
How? There's is no RAM!
???

I'm confused by your statement because my first post talks about how I have multiple blade nodes with RAM in it.

So...again, I'm confused by your statement.
If you're really talking about anonymous RAM SHARED between different physical nodes (so NUMA++++), then yeah that's still research (AFAIK) and it's a long, long way away. There's a WHOLE lot of problems there (including, but not limited to, cross talk and security at the very last).

Gluster is an "object storage" system - kind of different to file based systems (although it's still a bunch of zeros and ones stored in some persistent manner). It's is NOT aimed at a "traditional" RAMFS - i.e: all in memory (RAM) and flushed to persistent storage at events (like write barriers or cache full etc.). That's not the use case.
So...the original question is a little bit old.

I actually tried that last night using four virtual machines where each VM had 4 GB of RAM and half of it was used as a GlusterFS brick for a volume.

I wasn't able to get pNFS working, but I was able to create a tmpfs mount point, make that the brick for GlusterFS, and then create a Gluster distributed volume using those bricks across the four VMs.

Striping/dispersed didn't work (which, in some ways, I am not surprised that it didn't) and neither did distributed dispersed.

Four VMs, with each VM contributing 2 GB of RAM, to create a distributed GlusterFS volume with a total size of 8 GB worked.

But being that they were VMs, the transport was only Gigabit TCP, so I am going to try it again tonight on my actual compute nodes with 100 Gbps Infiniband and RDMA.

The nodes won't see the GlusterFS distributed volume consisting of tmpfs mount points/bricks as RAM, but it will and does see it as a distributed volume of tmpfs mount points like NFS/distributed file system.
Correction, all will wear out over time. The trade off is cost versus the amount of time.
How do I figure out the write endurance of RAM modules/chips?
So what's the real question? All I know is you have some bizarre use case whereby you MUST use "fast" (marketing term) persistent storage as a "place holder - "virtual memory") for something. Why are you flushing to swap so much? Maybe your real question revolves around "I don't have enough RAM for my use case, what can I do"?

Swap is not "primary" storage, I consider it backup. Get enough RAM and be done.

Otr I'm missing the point.
I don't think that I've ever said anything about it needing to be persistent.

High performance computing (HPC)/computer aided engineering (CAE)/finite element analysis (FEA)/computational fluid dynamics (CFD) applications can have very high memory usage requirements.

Direct, sparse solver solutions for FEAs for only a couple hundred thousand quad shell elements can produce scratch files as a part of the solution process (even if the solution runs "in-core") that are about 11 GB with little effort. These scratch files are written to the scratch directory during and as part of the solution process and disabling it actually makes the total solution time take LONGER, NOT shorter.

At 128 GB of RAM per node (512 GB across four nodes), the solver has enough RAM to solve the sparse direct matrix solution in-core (i.e. all in RAM). But that still doesn't prevent nor stop it from scratching to disk during the course of it. (Please refer to the MSC.NASTRAN User's Guide, as an example.)

So, again, my question hasn't changed - how do I export tmpfs over pNFS such that if I allocate half of the RAM (64 GB) from each node, I would be be able to create a pNFS volume that's a total of 4*64 GB = 256 GB.

And I'm doing this because the nodes doesn't have PCIe 3.0 x4 M.2 slots for NVMe SSDs (and like you said, the NAND flash memory will wear out - they all do); SATA 6 Gbps SSDs can't keep up with the 100 Gbps 4x EDR Infiniband interconnect, and I don't have any free PCIe 3.0 x16 slots in the blades because they're already taken up by the Mellanox ConnectX-4 dual port 4X EDR IB NIC.

So the idea was that RAM drive was one of the options that would create a fast volume where all four nodes would be the data server AND the client simultaneously.

(Most of the research and papers that I've seen/done shows that the MDS server is one server, the DS servers are different servers, and the clients are separate systems as well. What I'm looking to do is that node1 is the MDS, DS, and client, and nodes 2 through 4, inclusively are both the DS and client, and the MDS controls all four DSes together.)

I tried doing that, and that didn't work. Hence my question.

GlusterFS was able to do that, but now I have to try it again with the "real" blade/node hardware now so that I can see if writing a large sequential file to the 64 GB tmpfs mount point can be done as fast as the local system writes to /dev/shm or if it will always be slower when writing to the GlusterFS volume consisting of bricks that are made of tmpfs mount points.

(I also further suspect, based on the testing that I did last night with the VMs, that not being able to stripe across the tmpfs mount points is hurting the maximum practical throughput, but I won't be able to tell what impact that will have until I test it tonight on my actual nodes, with the Gluster volume created using "transport rdma" instead of TCP.)

Again, to recapitulate:

The goals and objectives of this are:

1) Have a super fast mount point that's accessible by all nodes/clients on the 100 Gbps interconnect

2) Increase/improve the write endurance of the mount point (I can't seem to find a source that tells me how to calculate the write endurance of RAM memory chips/modules.)

3) Improve the scratch disk performance so that it's faster than SATA 6 Gbps SSD (and also either mitigate or remove the write endurance problem entirely through the use of RAM drives (tmpfs mount points) (either individually/locally to each node, or as a parallel file system) when each node is connected to the 100 Gbps Infiniband system interconnect/network.

4) Minimise the risk of a single point of failure (which would be the case if I have a head node and SSDs (NVMe or SATA 6 Gbps) and a RAID logical volume, whether it would be through a RAID HBA or PCH RAID or LVM RAID.

(But I will consider using a RAID array, on a separate headnode, using SATA 6 Gbps SSDs and then exporting the RAID volume to NFS over RDMA if that proves to be faster and also is able to distribute the wear across the SSDs within the RAID volume.)

Thanks.

chemal
Posts: 776
Joined: 2013/12/08 19:44:49

Re: How to export tmpfs/ramfs over pNFS?

Post by chemal » 2019/08/20 20:43:43

What are these scratch files good for? Who writes/reads them? Do all nodes use one big scratch file or does every node have it's own scratch file?

alpha754293
Posts: 69
Joined: 2019/07/29 16:15:14

Re: How to export tmpfs/ramfs over pNFS?

Post by alpha754293 » 2019/08/20 20:57:03

chemal wrote:
2019/08/20 20:43:43
What are these scratch files good for? Who writes/reads them? Do all nodes use one big scratch file or does every node have it's own scratch file?
Scratch files are usually used by the solver to store temporary results (for example, the results for a current equilibrium iteration within a solution that consists of multiple equilibrium iterations as part of the solve/solution process).

The solver processes (both master and slaves) reads and writes to them.

Depending on the code, sometimes, the solution from multiple slave processes are aggregated into one by the master and then written out to the scratch directory while others will be able to write their own scratch files per process, regardless of whether it's master or slave.

In looking at the SMART results from the SSDs where I've worn through the write endurance limit of, there is no discerable pattern between which node (or process) reads or writes more data than any other node.

If I run a job with all four nodes (so 64 solver processes altogether), again, depending on the software, some will write one giant aggregrated scratch file while another solver will write a scratch file per solver process.

Not all jobs are run utilizing all four nodes however.

There are some runs where I've performed the scalability studies and found that increasing the number of solver processes does not result in a proportional reduction in the total, overall wall clock solution time. Therefore; in those instances, only a fraction of the total cluster will be used to run a case.

For example, one of my explicit dynamics crash simulations will run using only 48 solver processes rather than the full 64.

In another example, because of the way MPI works, increasing the number of solver processes also exponentially increases the amount of "coordination" work that the solver will have to do to "keep track" of the solution, and therefore; it will actually be beneficial to purposely limit the number of solver processes that are used for that type of a solution. (e.g. smooth particle hydrodynamics (SPH) because the algorithm needs to search for the neighbouring particles in order to determine the pairs of particle-particle interactions so that it will be able to figure out what happens to the current particle that it is interested in, that it is conducting the search for (and vice versa).)

In other works, if you have nCr(64,2), that would create a lot more pair-wise combinations than if you only had nCr(32,2).

So, it depends.

It depends on the solver.

It depends on the nature of the problem that is being solved.

It depends on the solution.

And it depends on which solution method the solver is using for that problem, and also the specifics about the problem as well. (e.g. non-linear materials and non-linear contacts is vastly more difficult to solve than problems that have either only non-linear materials, only non-linear contacts, or neither (everything's linear)).

The underlying statement that would be true in all cases though is that if there is a way to speed up the mount point where the scratch directory lies, then the total wall clock solution time will decrease. Some propotionally so, others, less so.

Hope this helps.

Thank you.

Post Reply