Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |  

This web mail archive is frozen.

This page is part of a frozen web archive of this mailing list.

You can still navigate around this archive, but know that no new mails have been added to it since July of 2016.

Click here to be taken to the new web archives of this list; it includes all the mails that are in this frozen archive plus all new mails that have been sent to the list since it was migrated to the new archives.

Subject: Re: [OMPI users] shm unlinking
From: Rushton Martin (JMRUSHTON_at_[hidden])
Date: 2011-04-14 11:49:30


QLogic IBA 7220

Which is interesting in itself, the IB hasn't worked properly since the
cluster was delivered.

Martin Rushton
HPC System Manager, Weapons Technologies
Tel: 01959 514777, Mobile: 07939 219057
email: jmrushton_at_[hidden]
www.QinetiQ.com
QinetiQ - Delivering customer-focused solutions

Please consider the environment before printing this email.
-----Original Message-----
From: users-bounces_at_[hidden] [mailto:users-bounces_at_[hidden]] On
Behalf Of Jeff Squyres
Sent: 14 April 2011 16:41
To: Open MPI Users
Subject: Re: [OMPI users] shm unlinking

They could be from OMPI -- are you using QLogic IB NICs? That's the
only thing named "PSM" in Open MPI.

On Apr 14, 2011, at 9:46 AM, Rushton Martin wrote:

> A typical file is called
> /dev/shm/psm_shm.41e04667-f3ba-e503-8464-db6c209b3430
>
> I had assumed that these were from OMPI, but clearly I could be wrong.
> They vary in size, but are typically 42MiB, only 0.2% of our small
> diskless nodes' memory, but put a dozen in there and they start to be
> noticed. lsof shows all the processes in a particular job have the
> same one open, the other files are associated chronologically with
> failed jobs.
>
> HTH
>
> Martin Rushton
> HPC System Manager, Weapons Technologies
> Tel: 01959 514777, Mobile: 07939 219057
> email: jmrushton_at_[hidden]
> www.QinetiQ.com
> QinetiQ - Delivering customer-focused solutions
>
> Please consider the environment before printing this email.
> -----Original Message-----
> From: users-bounces_at_[hidden] [mailto:users-bounces_at_[hidden]]
> On Behalf Of Jeff Squyres
> Sent: 14 April 2011 14:33
> To: Open MPI Users
> Subject: Re: [OMPI users] shm unlinking
>
> On Apr 14, 2011, at 9:22 AM, Rushton Martin wrote:
>
>> For your information: we were supplied with a script when we bought
>> the cluster, but the original script made the assumption that all
>> processes and shm files belonging to a specific user ought to be
>> deleted. This is a problem if users submit jobs which only half fill

>> a node and the second job starts on the same node as the first one.
>> The first job to finish causes the continuing job to stop dead. We
>> therefore had to disable any cleanup to allow jobs to run. Now we
>> are
>
>> finding a slow fill up with the shm files and I need to do something;

>> at least now I have a way forward.
>
> Note that Open MPI v1.4.x is likely using mmap files by default --
> these should be under /tmp/ somewhere. If they get left around, they
> can cause shared memory to be filled up, but they should also be
> unrelated in /dev/shm kinds of things. If you're seeing /dev/shm fill

> up, that might be due to something else.
>
> Also, I'm a little confused by your reference to psm_shm... are you
> talking about the QLogic PSM device? If that does some tomfoolery
> with /dev/shm somewhere, I'm unaware of it (i.e., I don't know
> much/anything about what that device does internally).
>
> --
> Jeff Squyres
> jsquyres_at_[hidden]
> For corporate legal information go to:
> http://www.cisco.com/web/about/doing_business/legal/cri/
>
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users
> This email and any attachments to it may be confidential and are
> intended solely for the use of the individual to whom it is addressed.

> If you are not the intended recipient of this email, you must neither
> take any action based upon its contents, nor copy or show it to
> anyone. Please contact the sender if you believe you have received
> this email in error. QinetiQ may monitor email traffic data and also
> the content of email for the purposes of security. QinetiQ Limited
> (Registered in England & Wales: Company Number: 3796233) Registered
> office: Cody Technology Park, Ively Road, Farnborough, Hampshire, GU14

> 0LX http://www.qinetiq.com.
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users

--
Jeff Squyres
jsquyres_at_[hidden]
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/
_______________________________________________
users mailing list
users_at_[hidden]
http://www.open-mpi.org/mailman/listinfo.cgi/users
The QinetiQ e-mail privacy policy and company information is detailed elsewhere in the body of this email.