Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |  

This web mail archive is frozen.

This page is part of a frozen web archive of this mailing list.

You can still navigate around this archive, but know that no new mails have been added to it since July of 2016.

Click here to be taken to the new web archives of this list; it includes all the mails that are in this frozen archive plus all new mails that have been sent to the list since it was migrated to the new archives.

Subject: Re: [OMPI users] map-by node with openmpi-1.7.5a1
From: tmishima_at_[hidden]
Date: 2014-02-19 08:00:05


Hi Ralph, I've found the fix. Please check the attached
patch file.

At this moment, nodes in hostfile should be listed in
ascending order of slot size when we use "map-by node" or
"map-by obj:span".

The problem is that the hostfile created by Torque in our
cluster always lists allocated nodes in descending order...

Regards,
Tetsuya Mishima

(See attached file: patch.rr)

> Hi Ralph,
>
> I did overall verification of rr_mapper, and I found another problem
> with "map-by node". As far as I checked, "map-by obj" other than node
> worked fine. I myself do not use "map-by node", but I'd like to report
> it to improve reliability of 1.7.5. It seems too difficult for me to
> resolve it. I hope you could take a look.
>
> The problem occurs when I mixedly use two kinds of node, although I
> add "-hetero-nodes" to command line:
>
> [mishima_at_manage work]$ cat pbs_hosts
> node04 slots=8
> node05 slots=2
> node06 slots=2
>
> [mishima_at_manage work]$ mpirun -np 12 -machinefile pbs_hosts -map-by node
> -report-bindings -hetero-nodes /home/mishima/mi
> s/openmpi/demos/myprog
> [manage.cluster:13113] [[15682,0],0] ORTE_ERROR_LOG: Fatal in file
> rmaps_rr.c at line 241
> [manage.cluster:13113] [[15682,0],0] ORTE_ERROR_LOG: Fatal in file
> base/rmaps_base_map_job.c at line 285
>
> With "-np 11", it works. But rank 10 is bound to the wrong core (which is
> already used by rank 0). I guess something is wrong with the handling of
> different topology when "map-by node" is specified. In addition, the
> calculation of assigning procs to each node has some problems:
>
> [mishima_at_manage work]$ mpirun -np 11 -machinefile pbs_hosts -map-by node
> -report-bindings -hetero-nodes /home/mishima/mi
> s/openmpi/demos/myprog
> [node04.cluster:13384] MCW rank 3 bound to socket 0[core 1[hwt 0]]:
> [./B/./././././.][./././././././.][./././././././.][
> ./././././././.]
> [node04.cluster:13384] MCW rank 6 bound to socket 0[core 2[hwt 0]]:
> [././B/././././.][./././././././.][./././././././.][
> ./././././././.]
> [node04.cluster:13384] MCW rank 8 bound to socket 0[core 3[hwt 0]]:
> [./././B/./././.][./././././././.][./././././././.][
> ./././././././.]
> [node04.cluster:13384] MCW rank 10 bound to socket 0[core 0[hwt 0]]:
> [B/././././././.][./././././././.][./././././././.]
> [./././././././.]
> [node04.cluster:13384] MCW rank 0 bound to socket 0[core 0[hwt 0]]:
> [B/././././././.][./././././././.][./././././././.][
> ./././././././.]
> [node06.cluster:24192] MCW rank 5 bound to socket 0[core 1[hwt 0]]:
> [./B/./.][./././.]
> [node06.cluster:24192] MCW rank 2 bound to socket 0[core 0[hwt 0]]:
> [B/././.][./././.]
> [node05.cluster:25655] MCW rank 9 bound to socket 0[core 3[hwt 0]]:
> [./././B][./././.]
> [node05.cluster:25655] MCW rank 1 bound to socket 0[core 0[hwt 0]]:
> [B/././.][./././.]
> [node05.cluster:25655] MCW rank 4 bound to socket 0[core 1[hwt 0]]:
> [./B/./.][./././.]
> [node05.cluster:25655] MCW rank 7 bound to socket 0[core 2[hwt 0]]:
> [././B/.][./././.]
> Hello world from process 4 of 11
> Hello world from process 7 of 11
> Hello world from process 6 of 11
> Hello world from process 3 of 11
> Hello world from process 0 of 11
> Hello world from process 8 of 11
> Hello world from process 2 of 11
> Hello world from process 5 of 11
> Hello world from process 9 of 11
> Hello world from process 1 of 11
> Hello world from process 10 of 11
>
> Regards,
> Tetsuya Mishima
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users


  • application/octet-stream attachment: patch.rr