Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] openmpi-1.7.4a1r29646 with -hostfile option under Torque manager
From: tmishima_at_[hidden]
Date: 2013-11-26 05:24:55


Hi,

I used interactive mode just because it was easy to report the behavior.
I'm sure that submiting job gives the same result.
Therefore, I think the environment variables are also set in the session.

Anyway, I'm away from the cluster now. Regarding "$ env | grep PBS",
I'll send it later.

Regards,
Tetsuya Mishima

Hi,
>
> Am 26.11.2013 um 01:22 schrieb tmishima_at_[hidden]:
>
> > Thank you very much for your quick response.
> >
> > I'm afraid to say that I found one more issuse...
> >
> > It's not so serious. Please check it when you have a lot of time.
> >
> > The problem is cpus-per-proc with -map-by option under Torque manager.
> > It doesn't work as shown below. I guess you can get the same
> > behaviour under Slurm manager.
> >
> > Of course, if I remove -map-by option, it works quite well.
> >
> > [mishima_at_manage testbed2]$ qsub -I -l nodes=1:ppn=32
> > qsub: waiting for job 8116.manage.cluster to start
> > qsub: job 8116.manage.cluster ready
>
> Are the environment variables of Torque also set in an interactive
session? What is the output of:
>
> $ env | grep PBS
>
> inside such a session.
>
> -- Reuti
>
>
> > [mishima_at_node03 ~]$ cd ~/Ducom/testbed2
> > [mishima_at_node03 testbed2]$ mpirun -np 8 -report-bindings -cpus-per-proc
4
> > -map-by socket mPre
> >
--------------------------------------------------------------------------
> > A request was made to bind to that would result in binding more
> > processes than cpus on a resource:
> >
> > Bind to: CORE
> > Node: node03
> > #processes: 2
> > #cpus: 1
> >
> > You can override this protection by adding the "overload-allowed"
> > option to your binding directive.
> >
--------------------------------------------------------------------------
> >
> >
> > [mishima_at_node03 testbed2]$ mpirun -np 8 -report-bindings -cpus-per-proc
4
> > mPre
> > [node03.cluster:18128] MCW rank 2 bound to socket 1[core 8[hwt 0]],
socket
> > 1[core 9[hwt 0]], socket 1[core 10[hwt 0]], s
> > ocket 1[core 11[hwt 0]]:
> > [./././././././.][B/B/B/B/./././.][./././././././.][./././././././.]
> > [node03.cluster:18128] MCW rank 3 bound to socket 1[core 12[hwt 0]],
socket
> > 1[core 13[hwt 0]], socket 1[core 14[hwt 0]],
> > socket 1[core 15[hwt 0]]:
> > [./././././././.][././././B/B/B/B][./././././././.][./././././././.]
> > [node03.cluster:18128] MCW rank 4 bound to socket 2[core 16[hwt 0]],
socket
> > 2[core 17[hwt 0]], socket 2[core 18[hwt 0]],
> > socket 2[core 19[hwt 0]]:
> > [./././././././.][./././././././.][B/B/B/B/./././.][./././././././.]
> > [node03.cluster:18128] MCW rank 5 bound to socket 2[core 20[hwt 0]],
socket
> > 2[core 21[hwt 0]], socket 2[core 22[hwt 0]],
> > socket 2[core 23[hwt 0]]:
> > [./././././././.][./././././././.][././././B/B/B/B][./././././././.]
> > [node03.cluster:18128] MCW rank 6 bound to socket 3[core 24[hwt 0]],
socket
> > 3[core 25[hwt 0]], socket 3[core 26[hwt 0]],
> > socket 3[core 27[hwt 0]]:
> > [./././././././.][./././././././.][./././././././.][B/B/B/B/./././.]
> > [node03.cluster:18128] MCW rank 7 bound to socket 3[core 28[hwt 0]],
socket
> > 3[core 29[hwt 0]], socket 3[core 30[hwt 0]],
> > socket 3[core 31[hwt 0]]:
> > [./././././././.][./././././././.][./././././././.][././././B/B/B/B]
> > [node03.cluster:18128] MCW rank 0 bound to socket 0[core 0[hwt 0]],
socket
> > 0[core 1[hwt 0]], socket 0[core 2[hwt 0]], so
> > cket 0[core 3[hwt 0]]:
> > [B/B/B/B/./././.][./././././././.][./././././././.][./././././././.]
> > [node03.cluster:18128] MCW rank 1 bound to socket 0[core 4[hwt 0]],
socket
> > 0[core 5[hwt 0]], socket 0[core 6[hwt 0]], so
> > cket 0[core 7[hwt 0]]:
> > [././././B/B/B/B][./././././././.][./././././././.][./././././././.]
> >
> > Regards,
> > Tetsuya Mishima
> >
> >> Fixed and scheduled to move to 1.7.4. Thanks again!
> >>
> >>
> >> On Nov 17, 2013, at 6:11 PM, Ralph Castain <rhc_at_[hidden]> wrote:
> >>
> >> Thanks! That's precisely where I was going to look when I had time :-)
> >>
> >> I'll update tomorrow.
> >> Ralph
> >>
> >>
> >>
> >>
> >> On Sun, Nov 17, 2013 at 7:01 PM, <tmishima_at_[hidden]>wrote:
> >>
> >>
> >> Hi Ralph,
> >>
> >> This is the continuous story of "Segmentation fault in oob_tcp.c of
> >> openmpi-1.7.4a1r29646".
> >>
> >> I found the cause.
> >>
> >> Firstly, I noticed that your hostfile can work and mine can not.
> >>
> >> Your host file:
> >> cat hosts
> >> bend001 slots=12
> >>
> >> My host file:
> >> cat hosts
> >> node08
> >> node08
> >> ...(total 8 lines)
> >>
> >> I modified my script file to add "slots=1" to each line of my hostfile
> >> just before launching mpirun. Then it worked.
> >>
> >> My host file(modified):
> >> cat hosts
> >> node08 slots=1
> >> node08 slots=1
> >> ...(total 8 lines)
> >>
> >> Secondary, I confirmed that there's a slight difference between
> >> orte/util/hostfile/hostfile.c of 1.7.3 and that of 1.7.4a1r29646.
> >>
> >> $ diff
> >> hostfile.c.org ../../../../openmpi-1.7.3/orte/util/hostfile/hostfile.c
> >> 394,401c394,399
> >> < if (got_count) {
> >> < node->slots_given = true;
> >> < } else if (got_max) {
> >> < node->slots = node->slots_max;
> >> < node->slots_given = true;
> >> < } else {
> >> < /* should be set by obj_new, but just to be clear */
> >> < node->slots_given = false;
> >> ---
> >>> if (!got_count) {
> >>> if (got_max) {
> >>> node->slots = node->slots_max;
> >>> } else {
> >>> ++node->slots;
> >>> }
> >> ....
> >>
> >> Finally, I added the line 402 below just as a tentative trial.
> >> Then, it worked.
> >>
> >> cat -n orte/util/hostfile/hostfile.c:
> >> ...
> >> 394 if (got_count) {
> >> 395 node->slots_given = true;
> >> 396 } else if (got_max) {
> >> 397 node->slots = node->slots_max;
> >> 398 node->slots_given = true;
> >> 399 } else {
> >> 400 /* should be set by obj_new, but just to be clear */
> >> 401 node->slots_given = false;
> >> 402 ++node->slots; /* added by tmishima */
> >> 403 }
> >> ...
> >>
> >> Please fix the problem properly, because it's just based on my
> >> random guess. It's related to the treatment of hostfile where slots
> >> information is not given.
> >>
> >> Regards,
> >> Tetsuya Mishima
> >>
> >> _______________________________________________
> >> users mailing list
> >> users_at_[hidden]
> >>
> >
http://www.open-mpi.org/mailman/listinfo.cgi/users_______________________________________________

> >
> >> users mailing list
> >> users_at_[hidden]http://www.open-mpi.org/mailman/listinfo.cgi/users
> >
> > _______________________________________________
> > users mailing list
> > users_at_[hidden]
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
> >
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users