Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] openmpi-1.7.4a1r29646 with -hostfile option under Torque manager
From: tmishima_at_[hidden]
Date: 2013-11-25 19:22:24


Hi Ralph,

Thank you very much for your quick response.

I'm afraid to say that I found one more issuse...

It's not so serious. Please check it when you have a lot of time.

The problem is cpus-per-proc with -map-by option under Torque manager.
It doesn't work as shown below. I guess you can get the same
behaviour under Slurm manager.

Of course, if I remove -map-by option, it works quite well.

[mishima_at_manage testbed2]$ qsub -I -l nodes=1:ppn=32
qsub: waiting for job 8116.manage.cluster to start
qsub: job 8116.manage.cluster ready

[mishima_at_node03 ~]$ cd ~/Ducom/testbed2
[mishima_at_node03 testbed2]$ mpirun -np 8 -report-bindings -cpus-per-proc 4
-map-by socket mPre
--------------------------------------------------------------------------
A request was made to bind to that would result in binding more
processes than cpus on a resource:

   Bind to: CORE
   Node: node03
   #processes: 2
   #cpus: 1

You can override this protection by adding the "overload-allowed"
option to your binding directive.
--------------------------------------------------------------------------

[mishima_at_node03 testbed2]$ mpirun -np 8 -report-bindings -cpus-per-proc 4
mPre
[node03.cluster:18128] MCW rank 2 bound to socket 1[core 8[hwt 0]], socket
1[core 9[hwt 0]], socket 1[core 10[hwt 0]], s
ocket 1[core 11[hwt 0]]:
[./././././././.][B/B/B/B/./././.][./././././././.][./././././././.]
[node03.cluster:18128] MCW rank 3 bound to socket 1[core 12[hwt 0]], socket
1[core 13[hwt 0]], socket 1[core 14[hwt 0]],
 socket 1[core 15[hwt 0]]:
[./././././././.][././././B/B/B/B][./././././././.][./././././././.]
[node03.cluster:18128] MCW rank 4 bound to socket 2[core 16[hwt 0]], socket
2[core 17[hwt 0]], socket 2[core 18[hwt 0]],
 socket 2[core 19[hwt 0]]:
[./././././././.][./././././././.][B/B/B/B/./././.][./././././././.]
[node03.cluster:18128] MCW rank 5 bound to socket 2[core 20[hwt 0]], socket
2[core 21[hwt 0]], socket 2[core 22[hwt 0]],
 socket 2[core 23[hwt 0]]:
[./././././././.][./././././././.][././././B/B/B/B][./././././././.]
[node03.cluster:18128] MCW rank 6 bound to socket 3[core 24[hwt 0]], socket
3[core 25[hwt 0]], socket 3[core 26[hwt 0]],
 socket 3[core 27[hwt 0]]:
[./././././././.][./././././././.][./././././././.][B/B/B/B/./././.]
[node03.cluster:18128] MCW rank 7 bound to socket 3[core 28[hwt 0]], socket
3[core 29[hwt 0]], socket 3[core 30[hwt 0]],
 socket 3[core 31[hwt 0]]:
[./././././././.][./././././././.][./././././././.][././././B/B/B/B]
[node03.cluster:18128] MCW rank 0 bound to socket 0[core 0[hwt 0]], socket
0[core 1[hwt 0]], socket 0[core 2[hwt 0]], so
cket 0[core 3[hwt 0]]:
[B/B/B/B/./././.][./././././././.][./././././././.][./././././././.]
[node03.cluster:18128] MCW rank 1 bound to socket 0[core 4[hwt 0]], socket
0[core 5[hwt 0]], socket 0[core 6[hwt 0]], so
cket 0[core 7[hwt 0]]:
[././././B/B/B/B][./././././././.][./././././././.][./././././././.]

Regards,
Tetsuya Mishima

> Fixed and scheduled to move to 1.7.4. Thanks again!
>
>
> On Nov 17, 2013, at 6:11 PM, Ralph Castain <rhc_at_[hidden]> wrote:
>
> Thanks! That's precisely where I was going to look when I had time :-)
>
> I'll update tomorrow.
> Ralph
>
>
>
>
> On Sun, Nov 17, 2013 at 7:01 PM, <tmishima_at_[hidden]>wrote:
>
>
> Hi Ralph,
>
> This is the continuous story of "Segmentation fault in oob_tcp.c of
> openmpi-1.7.4a1r29646".
>
> I found the cause.
>
> Firstly, I noticed that your hostfile can work and mine can not.
>
> Your host file:
> cat hosts
> bend001 slots=12
>
> My host file:
> cat hosts
> node08
> node08
> ...(total 8 lines)
>
> I modified my script file to add "slots=1" to each line of my hostfile
> just before launching mpirun. Then it worked.
>
> My host file(modified):
> cat hosts
> node08 slots=1
> node08 slots=1
> ...(total 8 lines)
>
> Secondary, I confirmed that there's a slight difference between
> orte/util/hostfile/hostfile.c of 1.7.3 and that of 1.7.4a1r29646.
>
> $ diff
> hostfile.c.org ../../../../openmpi-1.7.3/orte/util/hostfile/hostfile.c
> 394,401c394,399
> <     if (got_count) {
> <         node->slots_given = true;
> <     } else if (got_max) {
> <         node->slots = node->slots_max;
> <         node->slots_given = true;
> <     } else {
> <         /* should be set by obj_new, but just to be clear */
> <         node->slots_given = false;
> ---
> >     if (!got_count) {
> >         if (got_max) {
> >             node->slots = node->slots_max;
> >         } else {
> >             ++node->slots;
> >         }
> ....
>
> Finally, I added the line 402 below just as a tentative trial.
> Then, it worked.
>
> cat -n orte/util/hostfile/hostfile.c:
>    ...
>    394      if (got_count) {
>    395          node->slots_given = true;
>    396      } else if (got_max) {
>    397          node->slots = node->slots_max;
>    398          node->slots_given = true;
>    399      } else {
>    400          /* should be set by obj_new, but just to be clear */
>    401          node->slots_given = false;
>    402          ++node->slots; /* added by tmishima */
>    403      }
>    ...
>
> Please fix the problem properly, because it's just based on my
> random guess. It's related to the treatment of hostfile where slots
> information is not given.
>
> Regards,
> Tetsuya Mishima
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
>
http://www.open-mpi.org/mailman/listinfo.cgi/users_______________________________________________

> users mailing list
> users_at_[hidden]http://www.open-mpi.org/mailman/listinfo.cgi/users