Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |  

This web mail archive is frozen.

This page is part of a frozen web archive of this mailing list.

You can still navigate around this archive, but know that no new mails have been added to it since July of 2016.

Click here to be taken to the new web archives of this list; it includes all the mails that are in this frozen archive plus all new mails that have been sent to the list since it was migrated to the new archives.

From: Markus Daene (markus.daene_at_[hidden])
Date: 2007-06-22 04:20:36


Hi.

I think it is not necessary to specify the hosts via the hostfile using SGE
and OpenMPI, even the $NSLOTS is not necessary , just run
mpirun executable
this works very well.

to your memory problem:
I had similar problems when I specified the h_vmem option to use in SGE.
Without SGE everything works, but starting with SGE gives such memory errors.
You can easily check this with 'qconf -sc'. If you have used this option, try
without it. The problem in my case was that OpenMPI allocates sometimes a lot
of memory and the job gets immediately killed by SGE, and one gets such error
messages, see my posting some days ago. I am not sure if this helps in your
case but it could be an explanation.

Markus

Am Donnerstag, 21. Juni 2007 15:26 schrieb sadfub_at_[hidden]:
> Hi,
>
> I'm having some really strange error causing me some serious headaches.
> I want to integrate OpenMPI version 1.1.1 from the OFED package version
> 1.1 with SGE version 6.0. For mvapich all works, but for OpenMPI not ;(.
> Here is my jobfile and error message:
> #!/bin/csh -f
> #$ -N MPI_Job
> #$ -pe mpi 4
> export PATH=$PATH:/usr/ofed/mpi/gcc/openmpi-1.1.1-1/bin
> export
> LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/ofed/mpi/gcc/openmpi-1.1.1.-1/lib64
> /usr/ofed/mpi/gcc/openmpi-1.1.1-1/bin/mpirun -np $NSLOTS -hostfile
> $TMPDIR/machines /usr/ofed/mpi/gcc/openmpi-1.1.1-1/tests/IMB-2.3/IMB-MPI1
>
> ERRORMESSAGE:
> [node04:25768] mca_mpool_openib_register: ibv_reg_mr(0x584000,102400)
> failed with error: Cannot allocate memory
> [node04:25768] mca_mpool_openib_register: ibv_reg_mr(0x584000,102400)
> failed with error: Cannot allocate memory
> [node04:25768] mca_mpool_openib_register: ibv_reg_mr(0x584000,528384)
> failed with error: Cannot allocate memory
> [node04:25768] mca_mpool_openib_register: ibv_reg_mr(0x584000,528384)
> failed with error: Cannot allocate memory
> [node04:25769] mca_mpool_openib_register: ibv_reg_mr(0x584000,102400)
> failed with error: Cannot allocate memory
> [node04:25769] mca_mpool_openib_register: ibv_reg_mr(0x584000,102400)
> failed with error: Cannot allocate memory
> [node04:25769] mca_mpool_openib_register: ibv_reg_mr(0x584000,528384)
> failed with error: Cannot allocate memory
> [node04:25769] mca_mpool_openib_register: ibv_reg_mr(0x584000,528384)
> failed with error: Cannot allocate memory
> [node04:25770] mca_mpool_openib_register: ibv_reg_mr(0x584000,102400)
> failed with error: Cannot allocate memory
> [node04:25770] mca_mpool_openib_register: ibv_reg_mr(0x584000,102400)
> failed with error: Cannot allocate memory
> [node04:25770] mca_mpool_openib_register: ibv_reg_mr(0x584000,528384)
> failed with error: Cannot allocate memory
> [node04:25770] mca_mpool_openib_register: ibv_reg_mr(0x584000,528384)
> failed with error: Cannot allocate memory
> [node04:25771] mca_mpool_openib_register: ibv_reg_mr(0x584000,102400)
> failed with error: Cannot allocate memory
> [node04:25771] mca_mpool_openib_register: ibv_reg_mr(0x584000,102400)
> failed with error: Cannot allocate memory
> [node04:25771] mca_mpool_openib_register: ibv_reg_mr(0x584000,528384)
> failed with error: Cannot allocate memory
> [node04:25771] mca_mpool_openib_register: ibv_reg_mr(0x584000,528384)
> failed with error: Cannot allocate memory
> [0,1,1][btl_openib.c:808:mca_btl_openib_create_cq_srq] error creating
> low priority cq for mthca0 errno says Cannot allocate memory
>
> --------------------------------------------------------------------------
> It looks like MPI_INIT failed for some reason; your parallel process is
> likely to abort. There are many reasons that a parallel process can
> fail during MPI_INIT; some of which are due to configuration or environment
> problems. This failure appears to be an internal failure; here's some
> additional information (which may only be relevant to an Open MPI
> developer):
>
> PML add procs failed
> --> Returned "Error" (-1) instead of "Success" (0)
> --------------------------------------------------------------------------
> *** An error occurred in MPI_Init
> *** before MPI was initialized
> *** MPI_ERRORS_ARE_FATAL (goodbye)
> MPI_Job.e111975 (END)
>
>
> If I run the OMPI job just with out SGE => everything works e.g. the
> following command:
> /usr/ofed/mpi/gcc/openmpi-1.1.1-1/bin/mpirun -v -np 4 -H
> node04,node04,node04,node04
> /usr/ofed/mpi/gcc/openmpi-1.1.1-1/tests/IMB-2.3/IMB-MPI1
>
> If I do this with static machinefiles, it works too:
> $ cat /tmp/machines
> node04
> node04
> node04
> node04
>
> /usr/ofed/mpi/gcc/openmpi-1.1.1-1/bin/mpirun -v -np 4 -hostfile
> /tmp/machines /usr/ofed/mpi/gcc/openmpi-1.1.1-1/tests/IMB-2.3/IMB-MPI1
>
> And if I run this in a jobscript it works even with a static machinefile
> (not shown below):
> #!/bin/csh -f
> #$ -N MPI_Job
> #$ -pe mpi 4
> export PATH=$PATH:/usr/ofed/mpi/gcc/openmpi-1.1.1-1/bin
> export
> LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/ofed/mpi/gcc/openmpi-1.1.1.-1/lib64
> /usr/ofed/mpi/gcc/openmpi-1.1.1-1/bin/mpirun -v -np 4 -H
> node04,node04,node04,node04
> /usr/ofed/mpi/gcc/openmpi-1.1.1-1/tests/IMB-2.3/IMB-MPI1
>
> There are correct ulimits for all nodes in the cluster e.g. node04:
> -sh-3.00$ ssh node04 ulimit -a
> core file size (blocks, -c) 0
> data seg size (kbytes, -d) unlimited
> file size (blocks, -f) unlimited
> pending signals (-i) 1024
> max locked memory (kbytes, -l) 8162952
> max memory size (kbytes, -m) unlimited
> open files (-n) 1024
> pipe size (512 bytes, -p) 8
> POSIX message queues (bytes, -q) 819200
> stack size (kbytes, -s) 10240
> cpu time (seconds, -t) unlimited
> max user processes (-u) 139264
> virtual memory (kbytes, -v) unlimited
> file locks (-x) unlimited
>
> And the infiniband seems to have no troubles at all:
> -sh-3.00$ ibstat
> CA 'mthca0'
> CA type: MT25204
> Number of ports: 1
> Firmware version: 1.0.800
> Hardware version: a0
> Node GUID: 0x0002c90200220ac8
> System image GUID: 0x0002c90200220acb
> Port 1:
> State: Active
> Physical state: LinkUp
> Rate: 10
> Base lid: 18
> LMC: 0
> SM lid: 1
> Capability mask: 0x02510a68
> Port GUID: 0x0002c90200220ac9Hi,
>
> Thanks for any suggestions..
>
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel