Hi,

we are facing problems with openmpi under sge on a cluster equipped with QLogic IB HCAs.  Working off sge, openmpi works perfectly, we can dispatch the job as we want, no warning/error messages at all.  If we do the same under sge, even the hello-world program crashes. The main issue is PSM related, as you can see from the error message attached at the end of this email.  We solved this issue by switching off  PSM, basically using 2 possible strategies. Either adding --mca  mtl ^psm  at the mpirun command, or setting the env variable OMPI_MCA_pml ob1.  This way jobs under SGE runs properly.  Any preference for one or the two options we found to switch off PSM ?

However, we would really like to understand why we have this PSM error when we run under SGE, since the IB performance without PSM is of course deteriorated.  We asked SGE users list, but nothing smart from them.  Wondering if this list can help.

Thanks,
Luigi


--------- BEGINNING OF error file from sge ------------
Loading module gcc version 4.6.0
Initial gcc version: 4.4.6
Current gcc version: 4.6.0
Loading module icc version 11.1.075
Current icc version: none
Current icc version: 11.1
Loading module ifort version 11.1.075
Current ifort version: none
Current ifort version: 11.1
Loading module for compilers-extra
Extra compiler modules now loaded
Loading module mpi-openmpi version 1.4.3-icc-11.1
Current mpi-openmpi version: 1.4.3
[c1bay2:31113] mca: base: component_find: unable to open /opt/share/mpi-openmpi/1.4.3-icc-11.1/el6/x86_64/lib/openmpi/mca_ess_lsf: perhaps a missing symbol, or compiled for a different version of Open MPI? (ignored)
[c1bay2:31113] mca: base: component_find: unable to open /opt/share/mpi-openmpi/1.4.3-icc-11.1/el6/x86_64/lib/openmpi/mca_plm_lsf: perhaps a missing symbol, or compiled for a different version of Open MPI? (ignored)
[c1bay2:31113] mca: base: component_find: unable to open /opt/share/mpi-openmpi/1.4.3-icc-11.1/el6/x86_64/lib/openmpi/mca_ras_lsf: perhaps a missing symbol, or compiled for a different version of Open MPI? (ignored)
c1bay2.31114Driver initialization failure on /dev/ipath (err=23)
c1bay2.31116Driver initialization failure on /dev/ipath (err=23)
c1bay2.31117Driver initialization failure on /dev/ipath (err=23)
--------------------------------------------------------------------------
PSM was unable to open an endpoint. Please make sure that the network link is
active on the node and the hardware is functioning.

 Error: Failure in initializing endpoint
--------------------------------------------------------------------------
c1bay2.31115Driver initialization failure on /dev/ipath (err=23)
--------------------------------------------------------------------------
It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or environment
problems.  This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
developer):

 PML add procs failed
 --> Returned "Error" (-1) instead of "Success" (0)
--------------------------------------------------------------------------
*** The MPI_Init() function was called before MPI_INIT was invoked.
*** This is disallowed by the MPI standard.
*** Your MPI job will now abort.
[c1bay2:31114] Abort before MPI_INIT completed successfully; not able to guarantee that all other processes were killed!
*** The MPI_Init() function was called before MPI_INIT was invoked.
*** This is disallowed by the MPI standard.
*** Your MPI job will now abort.

--------- END OF error file from sge ------------




This message and its contents including attachments are intended solely for the original recipient. If you are not the intended recipient or have received this message in error, please notify me immediately and delete this message from your computer system. Any unauthorized use or distribution is prohibited. Please consider the environment before printing this email.