Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] locked memory problem
From: Jeff Squyres (jsquyres_at_[hidden])
Date: 2008-06-16 18:12:53

Can you check to see what the locked memory limits are *inside of a
job*? This can be different than what they are if you login to the
node independently / outside of an LSF job.

For example, write a quickie script that runs "ulimit -a" and submit
that through LSF and see what results you get. Better yet, use
something like this (typed off the top of my head -- not tested for
correctness/typos at all):


#!/bin/csh -f
set l=`limit -l`
echo `hostname`: limit $l
exit 0


#!/bin/csh -f
mpirun runme.csh

That is, submit the submitme.csh script to LSF and have it mpirun the
runme.csh script so that you can see the limits on all the nodes that
you requested.

On Jun 11, 2008, at 5:59 PM, twurgl_at_[hidden] wrote:

> I get the locked memory error as follows:
> --------------------------------------------------------------------------
> *** An error occurred in MPI_Init
> *** before MPI was initialized
> *** MPI_ERRORS_ARE_FATAL (goodbye)
> [node10:10395] [0,0,0]-[0,1,6] mca_oob_tcp_msg_recv: readv
> failed: Connection reset by peer (104)
> --------------------------------------------------------------------------
> The OpenIB BTL failed to initialize while trying to allocate some
> locked memory. This typically can indicate that the memlock limits
> are set too low. For most HPC installations, the memlock limits
> should be set to "unlimited". The failure occured here:
> Host: node10
> OMPI source: btl_openib.c:830
> Function: ibv_create_cq()
> Device: mlx4_0
> Memlock limit: 32768
> You may need to consult with your system administrator to get this
> problem fixed. This FAQ entry on the Open MPI web site may also be
> helpful:
> --------------------------------------------------------------------------
> I've read the above FAQ and still have problems. Here is the
> scenario. All cluster nodes are (supposed) to be the same.
> I can run just fine on all except a few nodes. For testing, I have
> closed all the nodes, and when I submit the job, LSF puts the job in
> PENDING state.
> Now if I use
> brun -m "node1 node10" jobid
> to release the job, it runs fine.
> But if I use
> brun -m "node10 node1" jobid
> it fails with the above OPENMPI error.
> I've checked the ulimit -a on all nodes, it is set to unlimited.
> I've added a .bashrc file and set the ulimit in there, as well as in
> my .cshrc file
> (I start on a csh shell and the jobs run in sh).
> I've compared environment settings and everything else I can think
> of. 3 nodes have the (bad) behaviour if they happen to be the lead
> node and run
> fine if
> they are not, the rest of the nodes run fine in either position.
> Anyone have any ideas about this?
> thanks!
> tom
> _______________________________________________
> users mailing list
> users_at_[hidden]

Jeff Squyres
Cisco Systems