On Fri, 27 Apr 2012 08:56:15 -0600, Ralph Castain <rhc_at_[hidden]>
wrote:
> Couple of things:
>
> 1. please do send the output from ompi_info
You can find them attached to this email.
> 2. please send the slurm envars from your allocation - i.e., after
> you do your salloc.
Here is an example:
$ salloc -N 2 -n 20 --qos=debug
salloc: Granted job allocation 1917048
$ srun hostname | sort | uniq -c
12 cn0331
8 cn0333
$ env | grep ^SLURM
SLURM_NODELIST=cn[0331,0333]
SLURM_NNODES=2
SLURM_JOBID=1917048
SLURM_NTASKS=20
SLURM_TASKS_PER_NODE=12,8
SLURM_JOB_ID=1917048
SLURM_SUBMIT_DIR=/gpfs/home/H76170
SLURM_NPROCS=20
SLURM_JOB_NODELIST=cn[0331,0333]
SLURM_JOB_CPUS_PER_NODE=12,8
SLURM_JOB_NUM_NODES=2
> Are you sure that slurm is actually "binding" us during this launch?
> If you just srun your get-allowed-cpu, what does it show? I'm
> wondering if it just gets reflected in the allocation envar and not
> actually binding the orteds.
Core binding with Slurm 2.3.3 + OpenMPI 1.4.3 works well:
$ mpirun -V
mpirun (Open MPI) 1.4.3
Report bugs to http://www.open-mpi.org/community/help/
$ mpirun get-allowed-cpu-ompi 1
Launch 1 Task 01 of 20 (cn0331): 1
Launch 1 Task 03 of 20 (cn0331): 3
Launch 1 Task 04 of 20 (cn0331): 5
Launch 1 Task 02 of 20 (cn0331): 2
Launch 1 Task 09 of 20 (cn0331): 11
Launch 1 Task 11 of 20 (cn0331): 10
Launch 1 Task 12 of 20 (cn0333): 0
Launch 1 Task 13 of 20 (cn0333): 1
Launch 1 Task 14 of 20 (cn0333): 2
Launch 1 Task 15 of 20 (cn0333): 3
Launch 1 Task 16 of 20 (cn0333): 6
Launch 1 Task 17 of 20 (cn0333): 4
Launch 1 Task 18 of 20 (cn0333): 5
Launch 1 Task 19 of 20 (cn0333): 7
Launch 1 Task 00 of 20 (cn0331): 0
Launch 1 Task 05 of 20 (cn0331): 7
Launch 1 Task 06 of 20 (cn0331): 6
Launch 1 Task 07 of 20 (cn0331): 4
Launch 1 Task 08 of 20 (cn0331): 8
Launch 1 Task 10 of 20 (cn0331): 9
But it fails as soon as I switch to OpenMPI 1.7a1r26338:
$ module load openmpi_1.7a1r26338
$ mpirun -V
mpirun (Open MPI) 1.7a1r26338
Report bugs to http://www.open-mpi.org/community/help/
$ unset OMPI_MCA_mtl OMPI_MCA_pml
$ mpirun get-allowed-cpu-ompi 1
Launch 1 Task 12 of 20 (cn0333): 0-23
Launch 1 Task 13 of 20 (cn0333): 0-23
Launch 1 Task 14 of 20 (cn0333): 0-23
Launch 1 Task 15 of 20 (cn0333): 0-23
Launch 1 Task 16 of 20 (cn0333): 0-23
Launch 1 Task 17 of 20 (cn0333): 0-23
Launch 1 Task 18 of 20 (cn0333): 0-23
Launch 1 Task 19 of 20 (cn0333): 0-23
Launch 1 Task 07 of 20 (cn0331): 0-23
Launch 1 Task 08 of 20 (cn0331): 0-23
Launch 1 Task 09 of 20 (cn0331): 0-23
Launch 1 Task 10 of 20 (cn0331): 0-23
Launch 1 Task 11 of 20 (cn0331): 0-23
Launch 1 Task 00 of 20 (cn0331): 0-23
Launch 1 Task 01 of 20 (cn0331): 0-23
Launch 1 Task 02 of 20 (cn0331): 0-23
Launch 1 Task 03 of 20 (cn0331): 0-23
Launch 1 Task 04 of 20 (cn0331): 0-23
Launch 1 Task 05 of 20 (cn0331): 0-23
Launch 1 Task 06 of 20 (cn0331): 0-23
Using srun fails in OpenMPI 1.4.3 environment with the following error:
Error obtaining unique transport key from ORTE
(orte_precondition_transports not present in
the environment).
[...]
In OpenMPI 1.7a1r26338, the result of srun is the same as with mpirun:
$ module load openmpi_1.7a1r26338
$ srun get-allowed-cpu-ompi 1
Launch 1 Task 00 of 01 (cn0333): 0-23
Launch 1 Task 00 of 01 (cn0333): 0-23
Launch 1 Task 00 of 01 (cn0333): 0-23
Launch 1 Task 00 of 01 (cn0333): 0-23
Launch 1 Task 00 of 01 (cn0333): 0-23
Launch 1 Task 00 of 01 (cn0333): 0-23
Launch 1 Task 00 of 01 (cn0333): 0-23
Launch 1 Task 00 of 01 (cn0333): 0-23
Launch 1 Task 00 of 01 (cn0331): 0-23
Launch 1 Task 00 of 01 (cn0331): 0-23
Launch 1 Task 00 of 01 (cn0331): 0-23
Launch 1 Task 00 of 01 (cn0331): 0-23
Launch 1 Task 00 of 01 (cn0331): 0-23
Launch 1 Task 00 of 01 (cn0331): 0-23
Launch 1 Task 00 of 01 (cn0331): 0-23
Launch 1 Task 00 of 01 (cn0331): 0-23
Launch 1 Task 00 of 01 (cn0331): 0-23
Launch 1 Task 00 of 01 (cn0331): 0-23
Launch 1 Task 00 of 01 (cn0331): 0-23
Launch 1 Task 00 of 01 (cn0331): 0-23
Regards,
--
Rémi Palancher
http://rezib.org
|