Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

From: Jeff Squyres (jsquyres_at_[hidden])
Date: 2007-06-07 08:09:44


On Jun 6, 2007, at 5:44 PM, Michael Edwards wrote:

> I am runing open-mpi 1.1.1-1 compiled from OFED1.1 which I downloaded
> from their website.

You might want to upgrade your Open MPI installation; the current
stable version is 1.2.2 (1.2.3 is pending shortly, fixing a few minor
regressions that creeped into 1.2.2). You can upgrade OMPI
independent of OFED. Use the "--with-openib=/usr/local/ofed" option
to OMPI's configure to pick up the OFED 1.1 installation (or, if you
used a different OFED prefix, use that as the value for the --with-
openib flag).

> I am using SGE installed via OSCAR 5.0 and when running under SGE I
> get the "mca_mpool_openib_register: ibv_reg_mr(0x590000,528384) failed
> with error: Cannot allocate memory" error discussed at length in your
> FAQ.
>
> When I run from the command line using mpirun, I don't get the errors.
> Of course, I don't know how to tell if the code is actually using the
> IB interface instead of the GigE network...

You can tell in two ways:

1. You can force the IB network to be used:

        mpirun --mca btl openib,self ...

Alternatively, you can force the use of the gigE network:

        mpirun --mca btl tcp,self ...

2. If you look at the bandwidth/latency of running any benchmark
papplication, they should be obviously far better than the gigE
network. Here's running NetPIPE (http://www.scl.ameslab.gov/netpipe/):

        mpirun -np 2 NPmpi

> I tried the suggestions in the FAQ regarding setting the memlock
> parameter in /etc/security/limits.conf: and all the nodes return
> "unlimited" in response to "ulimit -l" after rebooting the nodes. The
> problem persists under SGE and still does not appear when simply using
> mpirun.

The problem is that the SGE daemons are not starting with these
memory limits. Therefore, processes that start under SGE inherit the
low memory limits, and things go badly from there.

I'm afraid I'm not familiar enough with SGE to know how to fix this.
One Big Thing to check is that when the SGE daemons are started at
init.d/boot time, they have the proper "unlimited" memory locked
limits. Then processes that start under SGE should inherit the
"unlimited" value and be ok. That being said, SGE may also
specifically override the memory locked limits (some resource
managers can do this based on site-wide policies). Check to see if
SGE is doing this.

> I assumed it would work since openmpi 1.1.1 was included as working
> with SGE in OSCAR 5.0, but I don't know how different that version and
> the one included with OFED is.
>
> Any suggestions would be appreciated.
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users

-- 
Jeff Squyres
Cisco Systems