Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] openmpi-1.7.4a1r29646 with -hostfile option under Torque manager
From: Ralph Castain (rhc_at_[hidden])
Date: 2013-11-17 21:11:59


Thanks! That's precisely where I was going to look when I had time :-)

I'll update tomorrow.
Ralph

On Sun, Nov 17, 2013 at 7:01 PM, <tmishima_at_[hidden]> wrote:

>
>
> Hi Ralph,
>
> This is the continuous story of "Segmentation fault in oob_tcp.c of
> openmpi-1.7.4a1r29646".
>
> I found the cause.
>
> Firstly, I noticed that your hostfile can work and mine can not.
>
> Your host file:
> cat hosts
> bend001 slots=12
>
> My host file:
> cat hosts
> node08
> node08
> ...(total 8 lines)
>
> I modified my script file to add "slots=1" to each line of my hostfile
> just before launching mpirun. Then it worked.
>
> My host file(modified):
> cat hosts
> node08 slots=1
> node08 slots=1
> ...(total 8 lines)
>
> Secondary, I confirmed that there's a slight difference between
> orte/util/hostfile/hostfile.c of 1.7.3 and that of 1.7.4a1r29646.
>
> $ diff
> hostfile.c.org ../../../../openmpi-1.7.3/orte/util/hostfile/hostfile.c
> 394,401c394,399
> < if (got_count) {
> < node->slots_given = true;
> < } else if (got_max) {
> < node->slots = node->slots_max;
> < node->slots_given = true;
> < } else {
> < /* should be set by obj_new, but just to be clear */
> < node->slots_given = false;
> ---
> > if (!got_count) {
> > if (got_max) {
> > node->slots = node->slots_max;
> > } else {
> > ++node->slots;
> > }
> ....
>
> Finally, I added the line 402 below just as a tentative trial.
> Then, it worked.
>
> cat -n orte/util/hostfile/hostfile.c:
> ...
> 394 if (got_count) {
> 395 node->slots_given = true;
> 396 } else if (got_max) {
> 397 node->slots = node->slots_max;
> 398 node->slots_given = true;
> 399 } else {
> 400 /* should be set by obj_new, but just to be clear */
> 401 node->slots_given = false;
> 402 ++node->slots; /* added by tmishima */
> 403 }
> ...
>
> Please fix the problem properly, because it's just based on my
> random guess. It's related to the treatment of hostfile where slots
> information is not given.
>
> Regards,
> Tetsuya Mishima
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>