Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

From: Markus Daene (markus.daene_at_[hidden])
Date: 2007-06-22 04:20:36


Hi.

I think it is not necessary to specify the hosts via the hostfile using SGE
and OpenMPI, even the $NSLOTS is not necessary , just run
mpirun executable
this works very well.

to your memory problem:
I had similar problems when I specified the h_vmem option to use in SGE.
Without SGE everything works, but starting with SGE gives such memory errors.
You can easily check this with 'qconf -sc'. If you have used this option, try
without it. The problem in my case was that OpenMPI allocates sometimes a lot
of memory and the job gets immediately killed by SGE, and one gets such error
messages, see my posting some days ago. I am not sure if this helps in your
case but it could be an explanation.

Markus

Am Donnerstag, 21. Juni 2007 15:26 schrieb sadfub_at_[hidden]:
> Hi,
>
> I'm having some really strange error causing me some serious headaches.
> I want to integrate OpenMPI version 1.1.1 from the OFED package version
> 1.1 with SGE version 6.0. For mvapich all works, but for OpenMPI not ;(.
> Here is my jobfile and error message:
> #!/bin/csh -f
> #$ -N MPI_Job
> #$ -pe mpi 4
> export PATH=$PATH:/usr/ofed/mpi/gcc/openmpi-1.1.1-1/bin
> export
> LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/ofed/mpi/gcc/openmpi-1.1.1.-1/lib64
> /usr/ofed/mpi/gcc/openmpi-1.1.1-1/bin/mpirun -np $NSLOTS -hostfile
> $TMPDIR/machines /usr/ofed/mpi/gcc/openmpi-1.1.1-1/tests/IMB-2.3/IMB-MPI1
>
> ERRORMESSAGE:
> [node04:25768] mca_mpool_openib_register: ibv_reg_mr(0x584000,102400)
> failed with error: Cannot allocate memory
> [node04:25768] mca_mpool_openib_register: ibv_reg_mr(0x584000,102400)
> failed with error: Cannot allocate memory
> [node04:25768] mca_mpool_openib_register: ibv_reg_mr(0x584000,528384)
> failed with error: Cannot allocate memory
> [node04:25768] mca_mpool_openib_register: ibv_reg_mr(0x584000,528384)
> failed with error: Cannot allocate memory
> [node04:25769] mca_mpool_openib_register: ibv_reg_mr(0x584000,102400)
> failed with error: Cannot allocate memory
> [node04:25769] mca_mpool_openib_register: ibv_reg_mr(0x584000,102400)
> failed with error: Cannot allocate memory
> [node04:25769] mca_mpool_openib_register: ibv_reg_mr(0x584000,528384)
> failed with error: Cannot allocate memory
> [node04:25769] mca_mpool_openib_register: ibv_reg_mr(0x584000,528384)
> failed with error: Cannot allocate memory
> [node04:25770] mca_mpool_openib_register: ibv_reg_mr(0x584000,102400)
> failed with error: Cannot allocate memory
> [node04:25770] mca_mpool_openib_register: ibv_reg_mr(0x584000,102400)
> failed with error: Cannot allocate memory
> [node04:25770] mca_mpool_openib_register: ibv_reg_mr(0x584000,528384)
> failed with error: Cannot allocate memory
> [node04:25770] mca_mpool_openib_register: ibv_reg_mr(0x584000,528384)
> failed with error: Cannot allocate memory
> [node04:25771] mca_mpool_openib_register: ibv_reg_mr(0x584000,102400)
> failed with error: Cannot allocate memory
> [node04:25771] mca_mpool_openib_register: ibv_reg_mr(0x584000,102400)
> failed with error: Cannot allocate memory
> [node04:25771] mca_mpool_openib_register: ibv_reg_mr(0x584000,528384)
> failed with error: Cannot allocate memory
> [node04:25771] mca_mpool_openib_register: ibv_reg_mr(0x584000,528384)
> failed with error: Cannot allocate memory
> [0,1,1][btl_openib.c:808:mca_btl_openib_create_cq_srq] error creating
> low priority cq for mthca0 errno says Cannot allocate memory
>
> --------------------------------------------------------------------------
> It looks like MPI_INIT failed for some reason; your parallel process is
> likely to abort. There are many reasons that a parallel process can
> fail during MPI_INIT; some of which are due to configuration or environment
> problems. This failure appears to be an internal failure; here's some
> additional information (which may only be relevant to an Open MPI
> developer):
>
> PML add procs failed
> --> Returned "Error" (-1) instead of "Success" (0)
> --------------------------------------------------------------------------
> *** An error occurred in MPI_Init
> *** before MPI was initialized
> *** MPI_ERRORS_ARE_FATAL (goodbye)
> MPI_Job.e111975 (END)
>
>
> If I run the OMPI job just with out SGE => everything works e.g. the
> following command:
> /usr/ofed/mpi/gcc/openmpi-1.1.1-1/bin/mpirun -v -np 4 -H
> node04,node04,node04,node04
> /usr/ofed/mpi/gcc/openmpi-1.1.1-1/tests/IMB-2.3/IMB-MPI1
>
> If I do this with static machinefiles, it works too:
> $ cat /tmp/machines
> node04
> node04
> node04
> node04
>
> /usr/ofed/mpi/gcc/openmpi-1.1.1-1/bin/mpirun -v -np 4 -hostfile
> /tmp/machines /usr/ofed/mpi/gcc/openmpi-1.1.1-1/tests/IMB-2.3/IMB-MPI1
>
> And if I run this in a jobscript it works even with a static machinefile
> (not shown below):
> #!/bin/csh -f
> #$ -N MPI_Job
> #$ -pe mpi 4
> export PATH=$PATH:/usr/ofed/mpi/gcc/openmpi-1.1.1-1/bin
> export
> LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/ofed/mpi/gcc/openmpi-1.1.1.-1/lib64
> /usr/ofed/mpi/gcc/openmpi-1.1.1-1/bin/mpirun -v -np 4 -H
> node04,node04,node04,node04
> /usr/ofed/mpi/gcc/openmpi-1.1.1-1/tests/IMB-2.3/IMB-MPI1
>
> There are correct ulimits for all nodes in the cluster e.g. node04:
> -sh-3.00$ ssh node04 ulimit -a
> core file size (blocks, -c) 0
> data seg size (kbytes, -d) unlimited
> file size (blocks, -f) unlimited
> pending signals (-i) 1024
> max locked memory (kbytes, -l) 8162952
> max memory size (kbytes, -m) unlimited
> open files (-n) 1024
> pipe size (512 bytes, -p) 8
> POSIX message queues (bytes, -q) 819200
> stack size (kbytes, -s) 10240
> cpu time (seconds, -t) unlimited
> max user processes (-u) 139264
> virtual memory (kbytes, -v) unlimited
> file locks (-x) unlimited
>
> And the infiniband seems to have no troubles at all:
> -sh-3.00$ ibstat
> CA 'mthca0'
> CA type: MT25204
> Number of ports: 1
> Firmware version: 1.0.800
> Hardware version: a0
> Node GUID: 0x0002c90200220ac8
> System image GUID: 0x0002c90200220acb
> Port 1:
> State: Active
> Physical state: LinkUp
> Rate: 10
> Base lid: 18
> LMC: 0
> SM lid: 1
> Capability mask: 0x02510a68
> Port GUID: 0x0002c90200220ac9Hi,
>
> Thanks for any suggestions..
>
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel