aks wrote: ↑2019/08/20 18:55:41
How? There's is no RAM!
???
I'm confused by your statement because my first post talks about how I have multiple blade nodes with RAM in it.
So...again, I'm confused by your statement.
If you're really talking about anonymous RAM SHARED between different physical nodes (so NUMA++++), then yeah that's still research (AFAIK) and it's a long, long way away. There's a WHOLE lot of problems there (including, but not limited to, cross talk and security at the very last).
Gluster is an "object storage" system - kind of different to file based systems (although it's still a bunch of zeros and ones stored in some persistent manner). It's is NOT aimed at a "traditional" RAMFS - i.e: all in memory (RAM) and flushed to persistent storage at events (like write barriers or cache full etc.). That's not the use case.
So...the original question is a little bit old.
I actually tried that last night using four virtual machines where each VM had 4 GB of RAM and half of it was used as a GlusterFS brick for a volume.
I wasn't able to get pNFS working, but I was able to create a tmpfs mount point, make that the brick for GlusterFS, and then create a Gluster distributed volume using those bricks across the four VMs.
Striping/dispersed didn't work (which, in some ways, I am not surprised that it didn't) and neither did distributed dispersed.
Four VMs, with each VM contributing 2 GB of RAM, to create a distributed GlusterFS volume with a total size of 8 GB worked.
But being that they were VMs, the transport was only Gigabit TCP, so I am going to try it again tonight on my actual compute nodes with 100 Gbps Infiniband and RDMA.
The nodes won't see the GlusterFS distributed volume consisting of tmpfs mount points/bricks as RAM, but it will and does see it as a distributed volume of tmpfs mount points like NFS/distributed file system.
Correction, all will wear out over time. The trade off is cost versus the amount of time.
How do I figure out the write endurance of RAM modules/chips?
So what's the real question? All I know is you have some bizarre use case whereby you MUST use "fast" (marketing term) persistent storage as a "place holder - "virtual memory") for something. Why are you flushing to swap so much? Maybe your real question revolves around "I don't have enough RAM for my use case, what can I do"?
Swap is not "primary" storage, I consider it backup. Get enough RAM and be done.
Otr I'm missing the point.
I don't think that I've ever said anything about it needing to be persistent.
High performance computing (HPC)/computer aided engineering (CAE)/finite element analysis (FEA)/computational fluid dynamics (CFD) applications can have very high memory usage requirements.
Direct, sparse solver solutions for FEAs for only a couple hundred thousand quad shell elements can produce scratch files as a part of the solution process (even if the solution runs "in-core") that are about 11 GB with little effort. These scratch files are written to the scratch directory during and as part of the solution process and disabling it actually makes the total solution time take LONGER, NOT shorter.
At 128 GB of RAM per node (512 GB across four nodes), the solver has enough RAM to solve the sparse direct matrix solution in-core (i.e. all in RAM). But that still doesn't prevent nor stop it from scratching to disk during the course of it. (Please refer to the MSC.NASTRAN User's Guide, as an example.)
So, again, my question hasn't changed - how do I export tmpfs over pNFS such that if I allocate half of the RAM (64 GB) from each node, I would be be able to create a pNFS volume that's a total of 4*64 GB = 256 GB.
And I'm doing this because the nodes doesn't have PCIe 3.0 x4 M.2 slots for NVMe SSDs (and like you said, the NAND flash memory will wear out - they all do); SATA 6 Gbps SSDs can't keep up with the 100 Gbps 4x EDR Infiniband interconnect, and I don't have any free PCIe 3.0 x16 slots in the blades because they're already taken up by the Mellanox ConnectX-4 dual port 4X EDR IB NIC.
So the idea was that RAM drive was one of the options that would create a fast volume where all four nodes would be the data server AND the client simultaneously.
(Most of the research and papers that I've seen/done shows that the MDS server is one server, the DS servers are different servers, and the clients are separate systems as well. What I'm looking to do is that node1 is the MDS, DS, and client, and nodes 2 through 4, inclusively are both the DS and client, and the MDS controls all four DSes together.)
I tried doing that, and that didn't work. Hence my question.
GlusterFS was able to do that, but now I have to try it again with the "real" blade/node hardware now so that I can see if writing a large sequential file to the 64 GB tmpfs mount point can be done as fast as the local system writes to /dev/shm or if it will always be slower when writing to the GlusterFS volume consisting of bricks that are made of tmpfs mount points.
(I also further suspect, based on the testing that I did last night with the VMs, that not being able to stripe across the tmpfs mount points is hurting the maximum practical throughput, but I won't be able to tell what impact that will have until I test it tonight on my actual nodes, with the Gluster volume created using "transport rdma" instead of TCP.)
Again, to recapitulate:
The goals and objectives of this are:
1) Have a super fast mount point that's accessible by all nodes/clients on the 100 Gbps interconnect
2) Increase/improve the write endurance of the mount point (I can't seem to find a source that tells me how to calculate the write endurance of RAM memory chips/modules.)
3) Improve the scratch disk performance so that it's faster than SATA 6 Gbps SSD (and also either mitigate or remove the write endurance problem entirely through the use of RAM drives (tmpfs mount points) (either individually/locally to each node, or as a parallel file system) when each node is connected to the 100 Gbps Infiniband system interconnect/network.
4) Minimise the risk of a single point of failure (which would be the case if I have a head node and SSDs (NVMe or SATA 6 Gbps) and a RAID logical volume, whether it would be through a RAID HBA or PCH RAID or LVM RAID.
(But I will consider using a RAID array, on a separate headnode, using SATA 6 Gbps SSDs and then exporting the RAID volume to NFS over RDMA if that proves to be faster and also is able to distribute the wear across the SSDs within the RAID volume.)
Thanks.