Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] Error with multiple MPI runs inside one Slurm allocation (with QLogic PSM)
From: Rémi Palancher (remi_at_[hidden])
Date: 2012-05-02 10:49:35


 On Fri, 27 Apr 2012 08:56:15 -0600, Ralph Castain <rhc_at_[hidden]>
 wrote:
> Couple of things:
>
> 1. please do send the output from ompi_info

 You can find them attached to this email.

> 2. please send the slurm envars from your allocation - i.e., after
> you do your salloc.

 Here is an example:
 $ salloc -N 2 -n 20 --qos=debug
 salloc: Granted job allocation 1917048
 $ srun hostname | sort | uniq -c
      12 cn0331
       8 cn0333
 $ env | grep ^SLURM
 SLURM_NODELIST=cn[0331,0333]
 SLURM_NNODES=2
 SLURM_JOBID=1917048
 SLURM_NTASKS=20
 SLURM_TASKS_PER_NODE=12,8
 SLURM_JOB_ID=1917048
 SLURM_SUBMIT_DIR=/gpfs/home/H76170
 SLURM_NPROCS=20
 SLURM_JOB_NODELIST=cn[0331,0333]
 SLURM_JOB_CPUS_PER_NODE=12,8
 SLURM_JOB_NUM_NODES=2

> Are you sure that slurm is actually "binding" us during this launch?
> If you just srun your get-allowed-cpu, what does it show? I'm
> wondering if it just gets reflected in the allocation envar and not
> actually binding the orteds.

 Core binding with Slurm 2.3.3 + OpenMPI 1.4.3 works well:

 $ mpirun -V
 mpirun (Open MPI) 1.4.3

 Report bugs to http://www.open-mpi.org/community/help/
 $ mpirun get-allowed-cpu-ompi 1
 Launch 1 Task 01 of 20 (cn0331): 1
 Launch 1 Task 03 of 20 (cn0331): 3
 Launch 1 Task 04 of 20 (cn0331): 5
 Launch 1 Task 02 of 20 (cn0331): 2
 Launch 1 Task 09 of 20 (cn0331): 11
 Launch 1 Task 11 of 20 (cn0331): 10
 Launch 1 Task 12 of 20 (cn0333): 0
 Launch 1 Task 13 of 20 (cn0333): 1
 Launch 1 Task 14 of 20 (cn0333): 2
 Launch 1 Task 15 of 20 (cn0333): 3
 Launch 1 Task 16 of 20 (cn0333): 6
 Launch 1 Task 17 of 20 (cn0333): 4
 Launch 1 Task 18 of 20 (cn0333): 5
 Launch 1 Task 19 of 20 (cn0333): 7
 Launch 1 Task 00 of 20 (cn0331): 0
 Launch 1 Task 05 of 20 (cn0331): 7
 Launch 1 Task 06 of 20 (cn0331): 6
 Launch 1 Task 07 of 20 (cn0331): 4
 Launch 1 Task 08 of 20 (cn0331): 8
 Launch 1 Task 10 of 20 (cn0331): 9

 But it fails as soon as I switch to OpenMPI 1.7a1r26338:

 $ module load openmpi_1.7a1r26338
 $ mpirun -V
 mpirun (Open MPI) 1.7a1r26338

 Report bugs to http://www.open-mpi.org/community/help/
 $ unset OMPI_MCA_mtl OMPI_MCA_pml
 $ mpirun get-allowed-cpu-ompi 1
 Launch 1 Task 12 of 20 (cn0333): 0-23
 Launch 1 Task 13 of 20 (cn0333): 0-23
 Launch 1 Task 14 of 20 (cn0333): 0-23
 Launch 1 Task 15 of 20 (cn0333): 0-23
 Launch 1 Task 16 of 20 (cn0333): 0-23
 Launch 1 Task 17 of 20 (cn0333): 0-23
 Launch 1 Task 18 of 20 (cn0333): 0-23
 Launch 1 Task 19 of 20 (cn0333): 0-23
 Launch 1 Task 07 of 20 (cn0331): 0-23
 Launch 1 Task 08 of 20 (cn0331): 0-23
 Launch 1 Task 09 of 20 (cn0331): 0-23
 Launch 1 Task 10 of 20 (cn0331): 0-23
 Launch 1 Task 11 of 20 (cn0331): 0-23
 Launch 1 Task 00 of 20 (cn0331): 0-23
 Launch 1 Task 01 of 20 (cn0331): 0-23
 Launch 1 Task 02 of 20 (cn0331): 0-23
 Launch 1 Task 03 of 20 (cn0331): 0-23
 Launch 1 Task 04 of 20 (cn0331): 0-23
 Launch 1 Task 05 of 20 (cn0331): 0-23
 Launch 1 Task 06 of 20 (cn0331): 0-23

 Using srun fails in OpenMPI 1.4.3 environment with the following error:

 Error obtaining unique transport key from ORTE
 (orte_precondition_transports not present in
 the environment).
 [...]

 In OpenMPI 1.7a1r26338, the result of srun is the same as with mpirun:

 $ module load openmpi_1.7a1r26338
 $ srun get-allowed-cpu-ompi 1
 Launch 1 Task 00 of 01 (cn0333): 0-23
 Launch 1 Task 00 of 01 (cn0333): 0-23
 Launch 1 Task 00 of 01 (cn0333): 0-23
 Launch 1 Task 00 of 01 (cn0333): 0-23
 Launch 1 Task 00 of 01 (cn0333): 0-23
 Launch 1 Task 00 of 01 (cn0333): 0-23
 Launch 1 Task 00 of 01 (cn0333): 0-23
 Launch 1 Task 00 of 01 (cn0333): 0-23
 Launch 1 Task 00 of 01 (cn0331): 0-23
 Launch 1 Task 00 of 01 (cn0331): 0-23
 Launch 1 Task 00 of 01 (cn0331): 0-23
 Launch 1 Task 00 of 01 (cn0331): 0-23
 Launch 1 Task 00 of 01 (cn0331): 0-23
 Launch 1 Task 00 of 01 (cn0331): 0-23
 Launch 1 Task 00 of 01 (cn0331): 0-23
 Launch 1 Task 00 of 01 (cn0331): 0-23
 Launch 1 Task 00 of 01 (cn0331): 0-23
 Launch 1 Task 00 of 01 (cn0331): 0-23
 Launch 1 Task 00 of 01 (cn0331): 0-23
 Launch 1 Task 00 of 01 (cn0331): 0-23

 Regards,

-- 
 Rémi Palancher
 http://rezib.org