Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] Error with multiple MPI runs inside one Slurm allocation (with QLogic PSM)
From: Rémi Palancher (remi_at_[hidden])
Date: 2012-04-27 10:41:23


 Hi there,

 First, thank you for all your helpful answers!

 On Mon, 2 Apr 2012 20:30:37 -0700, Ralph Castain <rhc_at_[hidden]>
 wrote:
> I'm afraid the 1.5 series doesn't offer any help in this regard. The
> required changes only exist in the developers trunk, which will be
> released as the 1.7 series in the not-too-distant future.

 I've tested the same use case with 1.5.5 and I obtain the exact same
 result than with 1.4.5. I confirm this version doesn't offer any help on
 this.

 I've also tested the last available snapshot 1.7a1r26338 of the trunk,
 but it seems to have 2 regressions:

   - when PSM enabled, undefined symbol error within mca_mtl_psm.so:

 $ mpirun -n 1 get-allowed-cpu-ompi
 [cn0286:23252] mca: base: component_find: unable to open
 /home/H76170/openmpi/1.7a1r26338/lib/openmpi/mca_mtl_psm:
 /home/H76170/openmpi/1.7a1r26338/lib/openmpi/mca_mtl_psm.so: undefined
 symbol: ompi_mtl_psm_imrecv (ignored)
 --------------------------------------------------------------------------
 A requested component was not found, or was unable to be opened. This
 means that this component is either not installed or is unable to be
 used on your system (e.g., sometimes this means that shared libraries
 that the component requires are unable to be found/loaded). Note that
 Open MPI stopped checking at the first component that it did not find.

 Host: cn0286
 Framework: mtl
 Component: psm
 --------------------------------------------------------------------------
 [cn0286:23252] mca: base: components_open: component pml / cm open
 function failed
 --------------------------------------------------------------------------
 No available pml components were found!

 This means that there are no components of this type installed on your
 system or all the components reported that they could not be used.

 This is a fatal error; your MPI process is likely to abort. Check the
 output of the "ompi_info" command and ensure that components of this
 type are available on your system. You may also wish to check the
 value of the "component_path" MCA parameter and ensure that it has at
 least one directory that contains valid MCA components.
 --------------------------------------------------------------------------
 [cn0286:23252] PML cm cannot be selected

   - when disabling PSM support (in order to avoid previous error),
 binding to cores allocated by Slurm fails:

 $ salloc --qos=debug -N 2 -n 20
 $ srun hostname | sort | uniq -c
      12 cn0564
       8 cn0565
 $ module load openmpi_1.7a1r26338
 $ unset OMPI_MCA_mtl OMPI_MCA_pml
 $ mpicc -o get-allowed-cpu-ompi get-allowed-cpu.c
 $ mpirun get-allowed-cpu-ompi
 Launch (null) Task 12 of 20 (cn0565): 0-23
 Launch (null) Task 13 of 20 (cn0565): 0-23
 Launch (null) Task 14 of 20 (cn0565): 0-23
 Launch (null) Task 15 of 20 (cn0565): 0-23
 Launch (null) Task 16 of 20 (cn0565): 0-23
 Launch (null) Task 17 of 20 (cn0565): 0-23
 Launch (null) Task 18 of 20 (cn0565): 0-23
 Launch (null) Task 19 of 20 (cn0565): 0-23
 Launch (null) Task 07 of 20 (cn0564): 0-23
 Launch (null) Task 08 of 20 (cn0564): 0-23
 Launch (null) Task 09 of 20 (cn0564): 0-23
 Launch (null) Task 10 of 20 (cn0564): 0-23
 Launch (null) Task 11 of 20 (cn0564): 0-23
 Launch (null) Task 00 of 20 (cn0564): 0-23
 Launch (null) Task 01 of 20 (cn0564): 0-23
 Launch (null) Task 02 of 20 (cn0564): 0-23
 Launch (null) Task 03 of 20 (cn0564): 0-23
 Launch (null) Task 04 of 20 (cn0564): 0-23
 Launch (null) Task 05 of 20 (cn0564): 0-23
 Launch (null) Task 06 of 20 (cn0564): 0-23

 FYI, I am using Slurm 2.3.3.

 Did I missed something with this trunk version?

 Do you want me to send the corresponding generated config.log,
 "ompi_info" and "mpirun ompi full"?

 Regards,

-- 
 Rémi Palancher
 http://rezib.org