Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |  

This web mail archive is frozen.

This page is part of a frozen web archive of this mailing list.

You can still navigate around this archive, but know that no new mails have been added to it since July of 2016.

Click here to be taken to the new web archives of this list; it includes all the mails that are in this frozen archive plus all new mails that have been sent to the list since it was migrated to the new archives.

From: sadfub_at_[hidden]
Date: 2007-06-21 09:26:37


Hi,

I'm having some really strange error causing me some serious headaches.
I want to integrate OpenMPI version 1.1.1 from the OFED package version
1.1 with SGE version 6.0. For mvapich all works, but for OpenMPI not ;(.
Here is my jobfile and error message:
#!/bin/csh -f
#$ -N MPI_Job
#$ -pe mpi 4
export PATH=$PATH:/usr/ofed/mpi/gcc/openmpi-1.1.1-1/bin
export
LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/ofed/mpi/gcc/openmpi-1.1.1.-1/lib64
/usr/ofed/mpi/gcc/openmpi-1.1.1-1/bin/mpirun -np $NSLOTS -hostfile
$TMPDIR/machines /usr/ofed/mpi/gcc/openmpi-1.1.1-1/tests/IMB-2.3/IMB-MPI1

ERRORMESSAGE:
[node04:25768] mca_mpool_openib_register: ibv_reg_mr(0x584000,102400)
failed with error: Cannot allocate memory
[node04:25768] mca_mpool_openib_register: ibv_reg_mr(0x584000,102400)
failed with error: Cannot allocate memory
[node04:25768] mca_mpool_openib_register: ibv_reg_mr(0x584000,528384)
failed with error: Cannot allocate memory
[node04:25768] mca_mpool_openib_register: ibv_reg_mr(0x584000,528384)
failed with error: Cannot allocate memory
[node04:25769] mca_mpool_openib_register: ibv_reg_mr(0x584000,102400)
failed with error: Cannot allocate memory
[node04:25769] mca_mpool_openib_register: ibv_reg_mr(0x584000,102400)
failed with error: Cannot allocate memory
[node04:25769] mca_mpool_openib_register: ibv_reg_mr(0x584000,528384)
failed with error: Cannot allocate memory
[node04:25769] mca_mpool_openib_register: ibv_reg_mr(0x584000,528384)
failed with error: Cannot allocate memory
[node04:25770] mca_mpool_openib_register: ibv_reg_mr(0x584000,102400)
failed with error: Cannot allocate memory
[node04:25770] mca_mpool_openib_register: ibv_reg_mr(0x584000,102400)
failed with error: Cannot allocate memory
[node04:25770] mca_mpool_openib_register: ibv_reg_mr(0x584000,528384)
failed with error: Cannot allocate memory
[node04:25770] mca_mpool_openib_register: ibv_reg_mr(0x584000,528384)
failed with error: Cannot allocate memory
[node04:25771] mca_mpool_openib_register: ibv_reg_mr(0x584000,102400)
failed with error: Cannot allocate memory
[node04:25771] mca_mpool_openib_register: ibv_reg_mr(0x584000,102400)
failed with error: Cannot allocate memory
[node04:25771] mca_mpool_openib_register: ibv_reg_mr(0x584000,528384)
failed with error: Cannot allocate memory
[node04:25771] mca_mpool_openib_register: ibv_reg_mr(0x584000,528384)
failed with error: Cannot allocate memory
[0,1,1][btl_openib.c:808:mca_btl_openib_create_cq_srq] error creating
low priority cq for mthca0 errno says Cannot allocate memory

--------------------------------------------------------------------------
It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort. There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or environment
problems. This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
developer):

  PML add procs failed
  --> Returned "Error" (-1) instead of "Success" (0)
--------------------------------------------------------------------------
*** An error occurred in MPI_Init
*** before MPI was initialized
*** MPI_ERRORS_ARE_FATAL (goodbye)
MPI_Job.e111975 (END)

If I run the OMPI job just with out SGE => everything works e.g. the
following command:
/usr/ofed/mpi/gcc/openmpi-1.1.1-1/bin/mpirun -v -np 4 -H
node04,node04,node04,node04
/usr/ofed/mpi/gcc/openmpi-1.1.1-1/tests/IMB-2.3/IMB-MPI1

If I do this with static machinefiles, it works too:
$ cat /tmp/machines
node04
node04
node04
node04

/usr/ofed/mpi/gcc/openmpi-1.1.1-1/bin/mpirun -v -np 4 -hostfile
/tmp/machines /usr/ofed/mpi/gcc/openmpi-1.1.1-1/tests/IMB-2.3/IMB-MPI1

And if I run this in a jobscript it works even with a static machinefile
(not shown below):
#!/bin/csh -f
#$ -N MPI_Job
#$ -pe mpi 4
export PATH=$PATH:/usr/ofed/mpi/gcc/openmpi-1.1.1-1/bin
export
LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/ofed/mpi/gcc/openmpi-1.1.1.-1/lib64
/usr/ofed/mpi/gcc/openmpi-1.1.1-1/bin/mpirun -v -np 4 -H
node04,node04,node04,node04
/usr/ofed/mpi/gcc/openmpi-1.1.1-1/tests/IMB-2.3/IMB-MPI1

There are correct ulimits for all nodes in the cluster e.g. node04:
-sh-3.00$ ssh node04 ulimit -a
core file size (blocks, -c) 0
data seg size (kbytes, -d) unlimited
file size (blocks, -f) unlimited
pending signals (-i) 1024
max locked memory (kbytes, -l) 8162952
max memory size (kbytes, -m) unlimited
open files (-n) 1024
pipe size (512 bytes, -p) 8
POSIX message queues (bytes, -q) 819200
stack size (kbytes, -s) 10240
cpu time (seconds, -t) unlimited
max user processes (-u) 139264
virtual memory (kbytes, -v) unlimited
file locks (-x) unlimited

And the infiniband seems to have no troubles at all:
-sh-3.00$ ibstat
CA 'mthca0'
        CA type: MT25204
        Number of ports: 1
        Firmware version: 1.0.800
        Hardware version: a0
        Node GUID: 0x0002c90200220ac8
        System image GUID: 0x0002c90200220acb
        Port 1:
                State: Active
                Physical state: LinkUp
                Rate: 10
                Base lid: 18
                LMC: 0
                SM lid: 1
                Capability mask: 0x02510a68
                Port GUID: 0x0002c90200220ac9Hi,

Thanks for any suggestions..