Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |  

This web mail archive is frozen.

This page is part of a frozen web archive of this mailing list.

You can still navigate around this archive, but know that no new mails have been added to it since July of 2016.

Click here to be taken to the new web archives of this list; it includes all the mails that are in this frozen archive plus all new mails that have been sent to the list since it was migrated to the new archives.

Subject: [OMPI users] map-by node with openmpi-1.7.5a1
From: tmishima_at_[hidden]
Date: 2014-02-18 01:42:17


Hi Ralph,

I did overall verification of rr_mapper, and I found another problem
with "map-by node". As far as I checked, "map-by obj" other than node
worked fine. I myself do not use "map-by node", but I'd like to report
it to improve reliability of 1.7.5. It seems too difficult for me to
resolve it. I hope you could take a look.

The problem occurs when I mixedly use two kinds of node, although I
add "-hetero-nodes" to command line:

[mishima_at_manage work]$ cat pbs_hosts
node04 slots=8
node05 slots=2
node06 slots=2

[mishima_at_manage work]$ mpirun -np 12 -machinefile pbs_hosts -map-by node
-report-bindings -hetero-nodes /home/mishima/mi
s/openmpi/demos/myprog
[manage.cluster:13113] [[15682,0],0] ORTE_ERROR_LOG: Fatal in file
rmaps_rr.c at line 241
[manage.cluster:13113] [[15682,0],0] ORTE_ERROR_LOG: Fatal in file
base/rmaps_base_map_job.c at line 285

With "-np 11", it works. But rank 10 is bound to the wrong core (which is
already used by rank 0). I guess something is wrong with the handling of
different topology when "map-by node" is specified. In addition, the
calculation of assigning procs to each node has some problems:

[mishima_at_manage work]$ mpirun -np 11 -machinefile pbs_hosts -map-by node
-report-bindings -hetero-nodes /home/mishima/mi
s/openmpi/demos/myprog
[node04.cluster:13384] MCW rank 3 bound to socket 0[core 1[hwt 0]]:
[./B/./././././.][./././././././.][./././././././.][
./././././././.]
[node04.cluster:13384] MCW rank 6 bound to socket 0[core 2[hwt 0]]:
[././B/././././.][./././././././.][./././././././.][
./././././././.]
[node04.cluster:13384] MCW rank 8 bound to socket 0[core 3[hwt 0]]:
[./././B/./././.][./././././././.][./././././././.][
./././././././.]
[node04.cluster:13384] MCW rank 10 bound to socket 0[core 0[hwt 0]]:
[B/././././././.][./././././././.][./././././././.]
[./././././././.]
[node04.cluster:13384] MCW rank 0 bound to socket 0[core 0[hwt 0]]:
[B/././././././.][./././././././.][./././././././.][
./././././././.]
[node06.cluster:24192] MCW rank 5 bound to socket 0[core 1[hwt 0]]:
[./B/./.][./././.]
[node06.cluster:24192] MCW rank 2 bound to socket 0[core 0[hwt 0]]:
[B/././.][./././.]
[node05.cluster:25655] MCW rank 9 bound to socket 0[core 3[hwt 0]]:
[./././B][./././.]
[node05.cluster:25655] MCW rank 1 bound to socket 0[core 0[hwt 0]]:
[B/././.][./././.]
[node05.cluster:25655] MCW rank 4 bound to socket 0[core 1[hwt 0]]:
[./B/./.][./././.]
[node05.cluster:25655] MCW rank 7 bound to socket 0[core 2[hwt 0]]:
[././B/.][./././.]
Hello world from process 4 of 11
Hello world from process 7 of 11
Hello world from process 6 of 11
Hello world from process 3 of 11
Hello world from process 0 of 11
Hello world from process 8 of 11
Hello world from process 2 of 11
Hello world from process 5 of 11
Hello world from process 9 of 11
Hello world from process 1 of 11
Hello world from process 10 of 11

Regards,
Tetsuya Mishima