Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] Error with multiple MPI runs inside one Slurm allocation (with QLogic PSM)
From: Ralph Castain (rhc_at_[hidden])
Date: 2012-05-02 11:24:01


Okay, your tests confirmed my suspicions. Slurm isn't doing any binding at all - that's why your srun of get-allowed-cpu-ompi showed no bindings. I don't see anything in your cmds that would tell slurm to bind us to anything. All your salloc did was tell slurm what to allocate - that doesn't imply binding.

You can get the trunk to bind by adding "--bind-to core" to your cmd line. That should yield the pattern you show from your 1.4.3 test.

Of more interest is why the 1.4.3 installation is binding at all. I suspect you have an MCA param set somewhere that tells us to bind-to-core - perhaps in the default MCA param file, or in your environment. It certainly wouldn't be doing that by default.

On May 2, 2012, at 8:49 AM, Rémi Palancher wrote:

> On Fri, 27 Apr 2012 08:56:15 -0600, Ralph Castain <rhc_at_[hidden]> wrote:
>> Couple of things:
>>
>> 1. please do send the output from ompi_info
>
> You can find them attached to this email.
>
>> 2. please send the slurm envars from your allocation - i.e., after
>> you do your salloc.
>
> Here is an example:
> $ salloc -N 2 -n 20 --qos=debug
> salloc: Granted job allocation 1917048
> $ srun hostname | sort | uniq -c
> 12 cn0331
> 8 cn0333
> $ env | grep ^SLURM
> SLURM_NODELIST=cn[0331,0333]
> SLURM_NNODES=2
> SLURM_JOBID=1917048
> SLURM_NTASKS=20
> SLURM_TASKS_PER_NODE=12,8
> SLURM_JOB_ID=1917048
> SLURM_SUBMIT_DIR=/gpfs/home/H76170
> SLURM_NPROCS=20
> SLURM_JOB_NODELIST=cn[0331,0333]
> SLURM_JOB_CPUS_PER_NODE=12,8
> SLURM_JOB_NUM_NODES=2
>
>> Are you sure that slurm is actually "binding" us during this launch?
>> If you just srun your get-allowed-cpu, what does it show? I'm
>> wondering if it just gets reflected in the allocation envar and not
>> actually binding the orteds.
>
> Core binding with Slurm 2.3.3 + OpenMPI 1.4.3 works well:
>
> $ mpirun -V
> mpirun (Open MPI) 1.4.3
>
> Report bugs to http://www.open-mpi.org/community/help/
> $ mpirun get-allowed-cpu-ompi 1
> Launch 1 Task 01 of 20 (cn0331): 1
> Launch 1 Task 03 of 20 (cn0331): 3
> Launch 1 Task 04 of 20 (cn0331): 5
> Launch 1 Task 02 of 20 (cn0331): 2
> Launch 1 Task 09 of 20 (cn0331): 11
> Launch 1 Task 11 of 20 (cn0331): 10
> Launch 1 Task 12 of 20 (cn0333): 0
> Launch 1 Task 13 of 20 (cn0333): 1
> Launch 1 Task 14 of 20 (cn0333): 2
> Launch 1 Task 15 of 20 (cn0333): 3
> Launch 1 Task 16 of 20 (cn0333): 6
> Launch 1 Task 17 of 20 (cn0333): 4
> Launch 1 Task 18 of 20 (cn0333): 5
> Launch 1 Task 19 of 20 (cn0333): 7
> Launch 1 Task 00 of 20 (cn0331): 0
> Launch 1 Task 05 of 20 (cn0331): 7
> Launch 1 Task 06 of 20 (cn0331): 6
> Launch 1 Task 07 of 20 (cn0331): 4
> Launch 1 Task 08 of 20 (cn0331): 8
> Launch 1 Task 10 of 20 (cn0331): 9
>
> But it fails as soon as I switch to OpenMPI 1.7a1r26338:
>
> $ module load openmpi_1.7a1r26338
> $ mpirun -V
> mpirun (Open MPI) 1.7a1r26338
>
> Report bugs to http://www.open-mpi.org/community/help/
> $ unset OMPI_MCA_mtl OMPI_MCA_pml
> $ mpirun get-allowed-cpu-ompi 1
> Launch 1 Task 12 of 20 (cn0333): 0-23
> Launch 1 Task 13 of 20 (cn0333): 0-23
> Launch 1 Task 14 of 20 (cn0333): 0-23
> Launch 1 Task 15 of 20 (cn0333): 0-23
> Launch 1 Task 16 of 20 (cn0333): 0-23
> Launch 1 Task 17 of 20 (cn0333): 0-23
> Launch 1 Task 18 of 20 (cn0333): 0-23
> Launch 1 Task 19 of 20 (cn0333): 0-23
> Launch 1 Task 07 of 20 (cn0331): 0-23
> Launch 1 Task 08 of 20 (cn0331): 0-23
> Launch 1 Task 09 of 20 (cn0331): 0-23
> Launch 1 Task 10 of 20 (cn0331): 0-23
> Launch 1 Task 11 of 20 (cn0331): 0-23
> Launch 1 Task 00 of 20 (cn0331): 0-23
> Launch 1 Task 01 of 20 (cn0331): 0-23
> Launch 1 Task 02 of 20 (cn0331): 0-23
> Launch 1 Task 03 of 20 (cn0331): 0-23
> Launch 1 Task 04 of 20 (cn0331): 0-23
> Launch 1 Task 05 of 20 (cn0331): 0-23
> Launch 1 Task 06 of 20 (cn0331): 0-23
>
> Using srun fails in OpenMPI 1.4.3 environment with the following error:
>
> Error obtaining unique transport key from ORTE (orte_precondition_transports not present in
> the environment).
> [...]
>
> In OpenMPI 1.7a1r26338, the result of srun is the same as with mpirun:
>
> $ module load openmpi_1.7a1r26338
> $ srun get-allowed-cpu-ompi 1
> Launch 1 Task 00 of 01 (cn0333): 0-23
> Launch 1 Task 00 of 01 (cn0333): 0-23
> Launch 1 Task 00 of 01 (cn0333): 0-23
> Launch 1 Task 00 of 01 (cn0333): 0-23
> Launch 1 Task 00 of 01 (cn0333): 0-23
> Launch 1 Task 00 of 01 (cn0333): 0-23
> Launch 1 Task 00 of 01 (cn0333): 0-23
> Launch 1 Task 00 of 01 (cn0333): 0-23
> Launch 1 Task 00 of 01 (cn0331): 0-23
> Launch 1 Task 00 of 01 (cn0331): 0-23
> Launch 1 Task 00 of 01 (cn0331): 0-23
> Launch 1 Task 00 of 01 (cn0331): 0-23
> Launch 1 Task 00 of 01 (cn0331): 0-23
> Launch 1 Task 00 of 01 (cn0331): 0-23
> Launch 1 Task 00 of 01 (cn0331): 0-23
> Launch 1 Task 00 of 01 (cn0331): 0-23
> Launch 1 Task 00 of 01 (cn0331): 0-23
> Launch 1 Task 00 of 01 (cn0331): 0-23
> Launch 1 Task 00 of 01 (cn0331): 0-23
> Launch 1 Task 00 of 01 (cn0331): 0-23
>
> Regards,
> --
> Rémi Palancher
> http://rezib.org><ompi_info_1.7a1r26338_psm_undefined_symbol.txt.gz>_______________________________________________
> users mailing list
> users_at_[hidden]
>
http://www.open-mpi.org/mailman/listinfo.cgi/users