Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] UC EXTERNAL: Re: How to set up state-less node /tmpfor OpenMPI usage
From: David Turner (dpturner_at_[hidden])
Date: 2011-11-04 19:43:35


Indeed, my terminology is inexact. I believe you are correct; our
diskless nodes use tmpfs, not ramdisk. Thanks for the clarification!

On 11/4/11 11:00 AM, Rushton Martin wrote:
> There appears to be some confusion about ramdisks and tmpfs. A ramdisk
> sets aside a fixed amount of memory for its exclusive use, so that a
> file being written to ramdisk goes first to the cache, then to ramdisk,
> and may exist in both for some time. tmpfs however opens up the cache
> to programs so that a file being written goes to cache and stays there.
> The "size" of a tmpfs pseudo-disk is the maximum it can grow to (which
> according to the mount man page defaults to 50% of memory). Hence only
> enough memory to hold the data is actually used which ties up with David
> Turner's figures.
>
> You can easily tell which method is in use from df. A traditional
> ramdisk will appears as /dev/ramN (N = 0, 1 ...) whereas a tmpfs device
> will be a simple name, often tmpfs. I would guess that the single "-"
> in David's df command is precisely this. On our diskless nodes root
> shows as device compute_x86_64, whilst /tmp, /dev/shm and /var/tmp show
> as "none".
>
> HTH,
>
> Martin Rushton
> HPC System Manager, Weapons Technologies
> Tel: 01959 514777, Mobile: 07939 219057
> email: jmrushton_at_[hidden]
> www.QinetiQ.com
> QinetiQ - Delivering customer-focused solutions
>
> Please consider the environment before printing this email.
> -----Original Message-----
> From: users-bounces_at_[hidden] [mailto:users-bounces_at_[hidden]] On
> Behalf Of Blosch, Edwin L
> Sent: 04 November 2011 16:19
> To: Open MPI Users
> Subject: Re: [OMPI users] EXTERNAL: Re: How to set up state-less node
> /tmpfor OpenMPI usage
>
> OK, I wouldn't have guessed that the space for /tmp isn't actually in
> RAM until it's needed. That's the key piece of knowledge I was missing;
> I really appreciate it. So you can allow /tmp to be reasonably sized,
> but if you aren't actually using it, then it doesn't take up 11 GB of
> RAM. And you prevent users from crashing the node by setting mem limit
> to 4 GB less than the available memory. Got it.
>
> I agree with your earlier comment: these are fairly common systems now.
> We have program- and owner-specific disks where I work, and after the
> program ends, the disks are archived or destroyed. Before the stateless
> configuration option, the entire computer, nodes and switches as well as
> disks, were archived or destroyed after each program. Not too
> cost-effective.
>
> Is this a reasonable final summary? : OpenMPI uses temporary files in
> such a way that it is performance-critical that these so-called session
> files, used for shared-memory communications, must be "local". For
> state-less clusters, this means the node image must include a /tmp or
> /wrk partition, intelligently sized so as not to enable an application
> to exhaust the physical memory of the node, and care must be taken not
> to mask this in-memory /tmp with an NFS mounted filesystem. It is not
> uncommon for cluster enablers to exclude /tmp from a typical base Linux
> filesystem image or mount it over NFS, as a means of providing users
> with a larger-sized /tmp that is not limited to a fraction of the node's
> physical memory, or to avoid garbage accumulation in /tmp taking up the
> physical RAM. But not having /tmp or mounting it over NFS is not a
> viable stateless-node configuration option if you intend to run OpenMPI.
> Instead you could have a /bigtmp which is NFS-mounted and a /tmp whi!
> ch is local, for example. Starting in OpenMPI 1.7.x, shared-memory
> communication will no longer go through memory-mapped files, and
> vendors/users will no longer need to be vigilant concerning this OpenMPI
> performance requirement on stateless node configuration.
>
>
> Is that a reasonable summary?
>
> If so, would it be helpful to include this as an FAQ entry under General
> category? Or the "shared memory" category? Or the "troubleshooting"
> category?
>
>
> Thanks
>
>
>
> -----Original Message-----
> From: users-bounces_at_[hidden] [mailto:users-bounces_at_[hidden]] On
> Behalf Of David Turner
> Sent: Friday, November 04, 2011 1:38 AM
> To: Open MPI Users
> Subject: Re: [OMPI users] EXTERNAL: Re: How to set up state-less node
> /tmp for OpenMPI usage
>
> % df /tmp
> Filesystem 1K-blocks Used Available Use% Mounted on
> - 12330084 822848 11507236 7% /
> % df /
> Filesystem 1K-blocks Used Available Use% Mounted on
> - 12330084 822848 11507236 7% /
>
> That works out to 11GB. But...
>
> The compute nodes have 24GB. Freshly booted, about 3.2GB is consumed by
> the kernel, various services, and the root file system.
> At this time, usage of /tmp is essentially nil.
>
> We set user memory limits to 20GB.
>
> I would imagine that the size of the session directories depends on a
> number of factors; perhaps the developers can comment on that. I have
> only seen total sizes in the 10s of MBs on our 8-node, 24GB nodes.
>
> As long as they're removed after each job, they don't really compete
> with the application for available memory.
>
> On 11/3/11 8:40 PM, Ed Blosch wrote:
>> Thanks very much, exactly what I wanted to hear. How big is /tmp?
>>
>> -----Original Message-----
>> From: users-bounces_at_[hidden] [mailto:users-bounces_at_[hidden]]
>> On Behalf Of David Turner
>> Sent: Thursday, November 03, 2011 6:36 PM
>> To: users_at_[hidden]
>> Subject: Re: [OMPI users] EXTERNAL: Re: How to set up state-less node
>> /tmp for OpenMPI usage
>>
>> I'm not a systems guy, but I'll pitch in anyway. On our cluster, all
>> the compute nodes are completely diskless. The root file system,
>> including /tmp, resides in memory (ramdisk). OpenMPI puts these
>> session directories therein. All our jobs run through a batch system
>> (torque). At the conclusion of each batch job, an epilogue process
>> runs that removes all files belonging to the owner of the current
>> batch job from /tmp (and also looks for and kills orphan processes
>> belonging to the user). This epilogue had to written by our systems
>> staff.
>>
>> I believe this is a fairly common configuration for diskless clusters.
>>
>> On 11/3/11 4:09 PM, Blosch, Edwin L wrote:
>>> Thanks for the help. A couple follow-up-questions, maybe this starts
>
>>> to
>> go outside OpenMPI:
>>>
>>> What's wrong with using /dev/shm? I think you said earlier in this
>>> thread
>> that this was not a safe place.
>>>
>>> If the NFS-mount point is moved from /tmp to /work, would a /tmp
>>> magically
>> appear in the filesystem for a stateless node? How big would it be,
>> given that there is no local disk, right? That may be something I
>> have to ask the vendor, which I've tried, but they don't quite seem to
> get the question.
>>>
>>> Thanks
>>>
>>>
>>>
>>>
>>> -----Original Message-----
>>> From: users-bounces_at_[hidden] [mailto:users-bounces_at_[hidden]]
>>> On
>> Behalf Of Ralph Castain
>>> Sent: Thursday, November 03, 2011 5:22 PM
>>> To: Open MPI Users
>>> Subject: Re: [OMPI users] EXTERNAL: Re: How to set up state-less node
>
>>> /tmp
>> for OpenMPI usage
>>>
>>>
>>> On Nov 3, 2011, at 2:55 PM, Blosch, Edwin L wrote:
>>>
>>>> I might be missing something here. Is there a side-effect or
>>>> performance
>> loss if you don't use the sm btl? Why would it exist if there is a
>> wholly equivalent alternative? What happens to traffic that is
>> intended for another process on the same node?
>>>
>>> There is a definite performance impact, and we wouldn't recommend
>>> doing
>> what Eugene suggested if you care about performance.
>>>
>>> The correct solution here is get your sys admin to make /tmp local.
>>> Making
>> /tmp NFS mounted across multiple nodes is a major "faux pas" in the
>> Linux world - it should never be done, for the reasons stated by Jeff.
>>>
>>>
>>>>
>>>> Thanks
>>>>
>>>>
>>>> -----Original Message-----
>>>> From: users-bounces_at_[hidden] [mailto:users-bounces_at_[hidden]]
>
>>>> On
>> Behalf Of Eugene Loh
>>>> Sent: Thursday, November 03, 2011 1:23 PM
>>>> To: users_at_[hidden]
>>>> Subject: Re: [OMPI users] EXTERNAL: Re: How to set up state-less
>>>> node
>> /tmp for OpenMPI usage
>>>>
>>>> Right. Actually "--mca btl ^sm". (Was missing "btl".)
>>>>
>>>> On 11/3/2011 11:19 AM, Blosch, Edwin L wrote:
>>>>> I don't tell OpenMPI what BTLs to use. The default uses sm and puts
>
>>>>> a
>> session file on /tmp, which is NFS-mounted and thus not a good choice.
>>>>>
>>>>> Are you suggesting something like --mca ^sm?
>>>>>
>>>>>
>>>>> -----Original Message-----
>>>>> From: users-bounces_at_[hidden]
>>>>> [mailto:users-bounces_at_[hidden]] On
>> Behalf Of Eugene Loh
>>>>> Sent: Thursday, November 03, 2011 12:54 PM
>>>>> To: users_at_[hidden]
>>>>> Subject: Re: [OMPI users] EXTERNAL: Re: How to set up state-less
>>>>> node
>> /tmp for OpenMPI usage
>>>>>
>>>>> I've not been following closely. Why must one use shared-memory
>>>>> communications? How about using other BTLs in a "loopback"
> fashion?
>>>>> _______________________________________________
>>>>> users mailing list
>>>>> users_at_[hidden]
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>> _______________________________________________
>>>>> users mailing list
>>>>> users_at_[hidden]
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>> _______________________________________________
>>>> users mailing list
>>>> users_at_[hidden]
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>> _______________________________________________
>>>> users mailing list
>>>> users_at_[hidden]
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>
>>>
>>> _______________________________________________
>>> users mailing list
>>> users_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>> _______________________________________________
>>> users mailing list
>>> users_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>>
>
>
> --
> Best regards,
>
> David Turner
> User Services Group email: dpturner_at_[hidden]
> NERSC Division phone: (510) 486-4027
> Lawrence Berkeley Lab fax: (510) 486-4316
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users
> This email and any attachments to it may be confidential and are
> intended solely for the use of the individual to whom it is
> addressed. If you are not the intended recipient of this email,
> you must neither take any action based upon its contents, nor
> copy or show it to anyone. Please contact the sender if you
> believe you have received this email in error. QinetiQ may
> monitor email traffic data and also the content of email for
> the purposes of security. QinetiQ Limited (Registered in England
> & Wales: Company Number: 3796233) Registered office: Cody Technology
> Park, Ively Road, Farnborough, Hampshire, GU14 0LX http://www.qinetiq.com.
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users

-- 
Best regards,
David Turner
User Services Group        email: dpturner_at_[hidden]
NERSC Division             phone: (510) 486-4027
Lawrence Berkeley Lab        fax: (510) 486-4316