Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] MPI SGE and IB don't work together
From: Ralph Castain (rhc_at_[hidden])
Date: 2013-10-28 10:18:20


Few thoughts occur:

1. 1.4.3 is awfully old - I would recommend you update to at least the 1.6 series if you can. We don't actively support 1.4 any more, and I don't know what the issues might have been with PSM that long ago

2. I see that you built LSF support for some reason, or there is a stale LSF support library from a prior build. You might want to clean that out just to avoid any future problems.

3. Just looking at your output, I see something a little weird where you appear to load both gcc and icc modules, then load an icc version of OMPI. Any chance you are getting confusing libc's as a result?

4. The error message seems to indicate an issue with initializing the PSM driver. Is it possible that you need to load a module or something to prep PSM - something you do in your environment that ssh would activate (say in a .bashrc), but sge isn't doing automatically for you?

Ralph

On Oct 28, 2013, at 6:58 AM, Luigi Cavallo <Luigi.Cavallo_at_[hidden]> wrote:

>
> Hi,
>
> we are facing problems with openmpi under sge on a cluster equipped with QLogic IB HCAs. Working off sge, openmpi works perfectly, we can dispatch the job as we want, no warning/error messages at all. If we do the same under sge, even the hello-world program crashes. The main issue is PSM related, as you can see from the error message attached at the end of this email. We solved this issue by switching off PSM, basically using 2 possible strategies. Either adding --mca mtl ^psm at the mpirun command, or setting the env variable OMPI_MCA_pml ob1. This way jobs under SGE runs properly. Any preference for one or the two options we found to switch off PSM ?
>
> However, we would really like to understand why we have this PSM error when we run under SGE, since the IB performance without PSM is of course deteriorated. We asked SGE users list, but nothing smart from them. Wondering if this list can help.
>
> Thanks,
> Luigi
>
>
> --------- BEGINNING OF error file from sge ------------
> Loading module gcc version 4.6.0
> Initial gcc version: 4.4.6
> Current gcc version: 4.6.0
> Loading module icc version 11.1.075
> Current icc version: none
> Current icc version: 11.1
> Loading module ifort version 11.1.075
> Current ifort version: none
> Current ifort version: 11.1
> Loading module for compilers-extra
> Extra compiler modules now loaded
> Loading module mpi-openmpi version 1.4.3-icc-11.1
> Current mpi-openmpi version: 1.4.3
> [c1bay2:31113] mca: base: component_find: unable to open /opt/share/mpi-openmpi/1.4.3-icc-11.1/el6/x86_64/lib/openmpi/mca_ess_lsf: perhaps a missing symbol, or compiled for a different version of Open MPI? (ignored)
> [c1bay2:31113] mca: base: component_find: unable to open /opt/share/mpi-openmpi/1.4.3-icc-11.1/el6/x86_64/lib/openmpi/mca_plm_lsf: perhaps a missing symbol, or compiled for a different version of Open MPI? (ignored)
> [c1bay2:31113] mca: base: component_find: unable to open /opt/share/mpi-openmpi/1.4.3-icc-11.1/el6/x86_64/lib/openmpi/mca_ras_lsf: perhaps a missing symbol, or compiled for a different version of Open MPI? (ignored)
> c1bay2.31114Driver initialization failure on /dev/ipath (err=23)
> c1bay2.31116Driver initialization failure on /dev/ipath (err=23)
> c1bay2.31117Driver initialization failure on /dev/ipath (err=23)
> --------------------------------------------------------------------------
> PSM was unable to open an endpoint. Please make sure that the network link is
> active on the node and the hardware is functioning.
>
> Error: Failure in initializing endpoint
> --------------------------------------------------------------------------
> c1bay2.31115Driver initialization failure on /dev/ipath (err=23)
> --------------------------------------------------------------------------
> It looks like MPI_INIT failed for some reason; your parallel process is
> likely to abort. There are many reasons that a parallel process can
> fail during MPI_INIT; some of which are due to configuration or environment
> problems. This failure appears to be an internal failure; here's some
> additional information (which may only be relevant to an Open MPI
> developer):
>
> PML add procs failed
> --> Returned "Error" (-1) instead of "Success" (0)
> --------------------------------------------------------------------------
> *** The MPI_Init() function was called before MPI_INIT was invoked.
> *** This is disallowed by the MPI standard.
> *** Your MPI job will now abort.
> [c1bay2:31114] Abort before MPI_INIT completed successfully; not able to guarantee that all other processes were killed!
> *** The MPI_Init() function was called before MPI_INIT was invoked.
> *** This is disallowed by the MPI standard.
> *** Your MPI job will now abort.
>
> --------- END OF error file from sge ------------
>
>
>
> This message and its contents including attachments are intended solely for the original recipient. If you are not the intended recipient or have received this message in error, please notify me immediately and delete this message from your computer system. Any unauthorized use or distribution is prohibited. Please consider the environment before printing this email.
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users