Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

From: Jeff Squyres (jsquyres_at_[hidden])
Date: 2007-06-21 10:19:59


Two things:

1. You might want to update your version of Open MPI if possible; the
v1.1.1 version is quite old. We have added many new bug fixes and
features since v1.1.1 (including tight SGE integration). There is
nothing special about the Open MPI that is included in the OFED
distribution; you can download a new version from the Open MPI web
site (the current stable version is v1.2.3), configure, compile, and
install it with your current OFED installation. You should be able
to configure Open MPI with:

        ./configure --with-openib=/usr/local/ofed ...

(assuming you chose the default location to install OFED) You'll
probably also want to specify a --prefix to install Open MPI to a
specific location, etc.

2. I know little/nothing about SGE, but I'm assuming that you need to
have SGE pass the proper memory lock limits to new processes. In an
interactive login, you showed that the max limit is "8162952" -- you
might just want to make it unlimited, unless you have a reason for
limiting it. See http://www.open-mpi.org/faq/?
category=openfabrics#limiting-registered-memory-usage for details.
Additionally, I *assume* that running under SGE will set different
memory locked limits (most resource managers do) than running under
interactive jobs. You need to find out how to set the memory locked
limits for jobs running under SGE; I'd suggest making the value be
"unlimited".

On Jun 21, 2007, at 9:26 AM, sadfub_at_[hidden] wrote:

> Hi,
>
> I'm having some really strange error causing me some serious
> headaches.
> I want to integrate OpenMPI version 1.1.1 from the OFED package
> version
> 1.1 with SGE version 6.0. For mvapich all works, but for OpenMPI
> not ;(.
> Here is my jobfile and error message:
> #!/bin/csh -f
> #$ -N MPI_Job
> #$ -pe mpi 4
> export PATH=$PATH:/usr/ofed/mpi/gcc/openmpi-1.1.1-1/bin
> export
> LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/ofed/mpi/gcc/openmpi-1.1.1.-1/
> lib64
> /usr/ofed/mpi/gcc/openmpi-1.1.1-1/bin/mpirun -np $NSLOTS -hostfile
> $TMPDIR/machines /usr/ofed/mpi/gcc/openmpi-1.1.1-1/tests/IMB-2.3/
> IMB-MPI1
>
> ERRORMESSAGE:
> [node04:25768] mca_mpool_openib_register: ibv_reg_mr(0x584000,102400)
> failed with error: Cannot allocate memory
> [node04:25768] mca_mpool_openib_register: ibv_reg_mr(0x584000,102400)
> failed with error: Cannot allocate memory
> [node04:25768] mca_mpool_openib_register: ibv_reg_mr(0x584000,528384)
> failed with error: Cannot allocate memory
> [node04:25768] mca_mpool_openib_register: ibv_reg_mr(0x584000,528384)
> failed with error: Cannot allocate memory
> [node04:25769] mca_mpool_openib_register: ibv_reg_mr(0x584000,102400)
> failed with error: Cannot allocate memory
> [node04:25769] mca_mpool_openib_register: ibv_reg_mr(0x584000,102400)
> failed with error: Cannot allocate memory
> [node04:25769] mca_mpool_openib_register: ibv_reg_mr(0x584000,528384)
> failed with error: Cannot allocate memory
> [node04:25769] mca_mpool_openib_register: ibv_reg_mr(0x584000,528384)
> failed with error: Cannot allocate memory
> [node04:25770] mca_mpool_openib_register: ibv_reg_mr(0x584000,102400)
> failed with error: Cannot allocate memory
> [node04:25770] mca_mpool_openib_register: ibv_reg_mr(0x584000,102400)
> failed with error: Cannot allocate memory
> [node04:25770] mca_mpool_openib_register: ibv_reg_mr(0x584000,528384)
> failed with error: Cannot allocate memory
> [node04:25770] mca_mpool_openib_register: ibv_reg_mr(0x584000,528384)
> failed with error: Cannot allocate memory
> [node04:25771] mca_mpool_openib_register: ibv_reg_mr(0x584000,102400)
> failed with error: Cannot allocate memory
> [node04:25771] mca_mpool_openib_register: ibv_reg_mr(0x584000,102400)
> failed with error: Cannot allocate memory
> [node04:25771] mca_mpool_openib_register: ibv_reg_mr(0x584000,528384)
> failed with error: Cannot allocate memory
> [node04:25771] mca_mpool_openib_register: ibv_reg_mr(0x584000,528384)
> failed with error: Cannot allocate memory
> [0,1,1][btl_openib.c:808:mca_btl_openib_create_cq_srq] error creating
> low priority cq for mthca0 errno says Cannot allocate memory
>
> ----------------------------------------------------------------------
> ----
> It looks like MPI_INIT failed for some reason; your parallel
> process is
> likely to abort. There are many reasons that a parallel process can
> fail during MPI_INIT; some of which are due to configuration or
> environment
> problems. This failure appears to be an internal failure; here's some
> additional information (which may only be relevant to an Open MPI
> developer):
>
> PML add procs failed
> --> Returned "Error" (-1) instead of "Success" (0)
> ----------------------------------------------------------------------
> ----
> *** An error occurred in MPI_Init
> *** before MPI was initialized
> *** MPI_ERRORS_ARE_FATAL (goodbye)
> MPI_Job.e111975 (END)
>
>
> If I run the OMPI job just with out SGE => everything works e.g. the
> following command:
> /usr/ofed/mpi/gcc/openmpi-1.1.1-1/bin/mpirun -v -np 4 -H
> node04,node04,node04,node04
> /usr/ofed/mpi/gcc/openmpi-1.1.1-1/tests/IMB-2.3/IMB-MPI1
>
> If I do this with static machinefiles, it works too:
> $ cat /tmp/machines
> node04
> node04
> node04
> node04
>
> /usr/ofed/mpi/gcc/openmpi-1.1.1-1/bin/mpirun -v -np 4 -hostfile
> /tmp/machines /usr/ofed/mpi/gcc/openmpi-1.1.1-1/tests/IMB-2.3/IMB-MPI1
>
> And if I run this in a jobscript it works even with a static
> machinefile
> (not shown below):
> #!/bin/csh -f
> #$ -N MPI_Job
> #$ -pe mpi 4
> export PATH=$PATH:/usr/ofed/mpi/gcc/openmpi-1.1.1-1/bin
> export
> LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/ofed/mpi/gcc/openmpi-1.1.1.-1/
> lib64
> /usr/ofed/mpi/gcc/openmpi-1.1.1-1/bin/mpirun -v -np 4 -H
> node04,node04,node04,node04
> /usr/ofed/mpi/gcc/openmpi-1.1.1-1/tests/IMB-2.3/IMB-MPI1
>
> There are correct ulimits for all nodes in the cluster e.g. node04:
> -sh-3.00$ ssh node04 ulimit -a
> core file size (blocks, -c) 0
> data seg size (kbytes, -d) unlimited
> file size (blocks, -f) unlimited
> pending signals (-i) 1024
> max locked memory (kbytes, -l) 8162952
> max memory size (kbytes, -m) unlimited
> open files (-n) 1024
> pipe size (512 bytes, -p) 8
> POSIX message queues (bytes, -q) 819200
> stack size (kbytes, -s) 10240
> cpu time (seconds, -t) unlimited
> max user processes (-u) 139264
> virtual memory (kbytes, -v) unlimited
> file locks (-x) unlimited
>
> And the infiniband seems to have no troubles at all:
> -sh-3.00$ ibstat
> CA 'mthca0'
> CA type: MT25204
> Number of ports: 1
> Firmware version: 1.0.800
> Hardware version: a0
> Node GUID: 0x0002c90200220ac8
> System image GUID: 0x0002c90200220acb
> Port 1:
> State: Active
> Physical state: LinkUp
> Rate: 10
> Base lid: 18
> LMC: 0
> SM lid: 1
> Capability mask: 0x02510a68
> Port GUID: 0x0002c90200220ac9Hi,
>
> Thanks for any suggestions..
>
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel

-- 
Jeff Squyres
Cisco Systems