Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] Error with multiple MPI runs inside one Slurm allocation (with QLogic PSM)
From: Ralph Castain (rhc_at_[hidden])
Date: 2012-04-27 10:56:15


Couple of things:

1. please do send the output from ompi_info

2. please send the slurm envars from your allocation - i.e., after you do your salloc.

Are you sure that slurm is actually "binding" us during this launch? If you just srun your get-allowed-cpu, what does it show? I'm wondering if it just gets reflected in the allocation envar and not actually binding the orteds.

On Apr 27, 2012, at 8:41 AM, Rémi Palancher wrote:

> Hi there,
>
> First, thank you for all your helpful answers!
>
> On Mon, 2 Apr 2012 20:30:37 -0700, Ralph Castain <rhc_at_[hidden]> wrote:
>> I'm afraid the 1.5 series doesn't offer any help in this regard. The
>> required changes only exist in the developers trunk, which will be
>> released as the 1.7 series in the not-too-distant future.
>
> I've tested the same use case with 1.5.5 and I obtain the exact same result than with 1.4.5. I confirm this version doesn't offer any help on this.
>
> I've also tested the last available snapshot 1.7a1r26338 of the trunk, but it seems to have 2 regressions:
>
> - when PSM enabled, undefined symbol error within mca_mtl_psm.so:
>
> $ mpirun -n 1 get-allowed-cpu-ompi
> [cn0286:23252] mca: base: component_find: unable to open /home/H76170/openmpi/1.7a1r26338/lib/openmpi/mca_mtl_psm: /home/H76170/openmpi/1.7a1r26338/lib/openmpi/mca_mtl_psm.so: undefined symbol: ompi_mtl_psm_imrecv (ignored)
> --------------------------------------------------------------------------
> A requested component was not found, or was unable to be opened. This
> means that this component is either not installed or is unable to be
> used on your system (e.g., sometimes this means that shared libraries
> that the component requires are unable to be found/loaded). Note that
> Open MPI stopped checking at the first component that it did not find.
>
> Host: cn0286
> Framework: mtl
> Component: psm
> --------------------------------------------------------------------------
> [cn0286:23252] mca: base: components_open: component pml / cm open function failed
> --------------------------------------------------------------------------
> No available pml components were found!
>
> This means that there are no components of this type installed on your
> system or all the components reported that they could not be used.
>
> This is a fatal error; your MPI process is likely to abort. Check the
> output of the "ompi_info" command and ensure that components of this
> type are available on your system. You may also wish to check the
> value of the "component_path" MCA parameter and ensure that it has at
> least one directory that contains valid MCA components.
> --------------------------------------------------------------------------
> [cn0286:23252] PML cm cannot be selected
>
> - when disabling PSM support (in order to avoid previous error), binding to cores allocated by Slurm fails:
>
> $ salloc --qos=debug -N 2 -n 20
> $ srun hostname | sort | uniq -c
> 12 cn0564
> 8 cn0565
> $ module load openmpi_1.7a1r26338
> $ unset OMPI_MCA_mtl OMPI_MCA_pml
> $ mpicc -o get-allowed-cpu-ompi get-allowed-cpu.c
> $ mpirun get-allowed-cpu-ompi
> Launch (null) Task 12 of 20 (cn0565): 0-23
> Launch (null) Task 13 of 20 (cn0565): 0-23
> Launch (null) Task 14 of 20 (cn0565): 0-23
> Launch (null) Task 15 of 20 (cn0565): 0-23
> Launch (null) Task 16 of 20 (cn0565): 0-23
> Launch (null) Task 17 of 20 (cn0565): 0-23
> Launch (null) Task 18 of 20 (cn0565): 0-23
> Launch (null) Task 19 of 20 (cn0565): 0-23
> Launch (null) Task 07 of 20 (cn0564): 0-23
> Launch (null) Task 08 of 20 (cn0564): 0-23
> Launch (null) Task 09 of 20 (cn0564): 0-23
> Launch (null) Task 10 of 20 (cn0564): 0-23
> Launch (null) Task 11 of 20 (cn0564): 0-23
> Launch (null) Task 00 of 20 (cn0564): 0-23
> Launch (null) Task 01 of 20 (cn0564): 0-23
> Launch (null) Task 02 of 20 (cn0564): 0-23
> Launch (null) Task 03 of 20 (cn0564): 0-23
> Launch (null) Task 04 of 20 (cn0564): 0-23
> Launch (null) Task 05 of 20 (cn0564): 0-23
> Launch (null) Task 06 of 20 (cn0564): 0-23
>
> FYI, I am using Slurm 2.3.3.
>
> Did I missed something with this trunk version?
>
> Do you want me to send the corresponding generated config.log, "ompi_info" and "mpirun ompi full"?
>
> Regards,
> --
> Rémi Palancher
> http://rezib.org
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users