Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] shm unlinking
From: Rushton Martin (JMRUSHTON_at_[hidden])
Date: 2011-04-14 11:49:30


QLogic IBA 7220

Which is interesting in itself, the IB hasn't worked properly since the
cluster was delivered.

Martin Rushton
HPC System Manager, Weapons Technologies
Tel: 01959 514777, Mobile: 07939 219057
email: jmrushton_at_[hidden]
www.QinetiQ.com
QinetiQ - Delivering customer-focused solutions

Please consider the environment before printing this email.
-----Original Message-----
From: users-bounces_at_[hidden] [mailto:users-bounces_at_[hidden]] On
Behalf Of Jeff Squyres
Sent: 14 April 2011 16:41
To: Open MPI Users
Subject: Re: [OMPI users] shm unlinking

They could be from OMPI -- are you using QLogic IB NICs? That's the
only thing named "PSM" in Open MPI.

On Apr 14, 2011, at 9:46 AM, Rushton Martin wrote:

> A typical file is called
> /dev/shm/psm_shm.41e04667-f3ba-e503-8464-db6c209b3430
>
> I had assumed that these were from OMPI, but clearly I could be wrong.
> They vary in size, but are typically 42MiB, only 0.2% of our small
> diskless nodes' memory, but put a dozen in there and they start to be
> noticed. lsof shows all the processes in a particular job have the
> same one open, the other files are associated chronologically with
> failed jobs.
>
> HTH
>
> Martin Rushton
> HPC System Manager, Weapons Technologies
> Tel: 01959 514777, Mobile: 07939 219057
> email: jmrushton_at_[hidden]
> www.QinetiQ.com
> QinetiQ - Delivering customer-focused solutions
>
> Please consider the environment before printing this email.
> -----Original Message-----
> From: users-bounces_at_[hidden] [mailto:users-bounces_at_[hidden]]
> On Behalf Of Jeff Squyres
> Sent: 14 April 2011 14:33
> To: Open MPI Users
> Subject: Re: [OMPI users] shm unlinking
>
> On Apr 14, 2011, at 9:22 AM, Rushton Martin wrote:
>
>> For your information: we were supplied with a script when we bought
>> the cluster, but the original script made the assumption that all
>> processes and shm files belonging to a specific user ought to be
>> deleted. This is a problem if users submit jobs which only half fill

>> a node and the second job starts on the same node as the first one.
>> The first job to finish causes the continuing job to stop dead. We
>> therefore had to disable any cleanup to allow jobs to run. Now we
>> are
>
>> finding a slow fill up with the shm files and I need to do something;

>> at least now I have a way forward.
>
> Note that Open MPI v1.4.x is likely using mmap files by default --
> these should be under /tmp/ somewhere. If they get left around, they
> can cause shared memory to be filled up, but they should also be
> unrelated in /dev/shm kinds of things. If you're seeing /dev/shm fill

> up, that might be due to something else.
>
> Also, I'm a little confused by your reference to psm_shm... are you
> talking about the QLogic PSM device? If that does some tomfoolery
> with /dev/shm somewhere, I'm unaware of it (i.e., I don't know
> much/anything about what that device does internally).
>
> --
> Jeff Squyres
> jsquyres_at_[hidden]
> For corporate legal information go to:
> http://www.cisco.com/web/about/doing_business/legal/cri/
>
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users
> This email and any attachments to it may be confidential and are
> intended solely for the use of the individual to whom it is addressed.

> If you are not the intended recipient of this email, you must neither
> take any action based upon its contents, nor copy or show it to
> anyone. Please contact the sender if you believe you have received
> this email in error. QinetiQ may monitor email traffic data and also
> the content of email for the purposes of security. QinetiQ Limited
> (Registered in England & Wales: Company Number: 3796233) Registered
> office: Cody Technology Park, Ively Road, Farnborough, Hampshire, GU14

> 0LX http://www.qinetiq.com.
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users

--
Jeff Squyres
jsquyres_at_[hidden]
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/
_______________________________________________
users mailing list
users_at_[hidden]
http://www.open-mpi.org/mailman/listinfo.cgi/users
The QinetiQ e-mail privacy policy and company information is detailed elsewhere in the body of this email.