Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: [OMPI users] openmpi-1.7.4a1r29646 with -hostfile option under Torque manager
From: tmishima_at_[hidden]
Date: 2013-11-17 21:01:23


Hi Ralph,

This is the continuous story of "Segmentation fault in oob_tcp.c of
openmpi-1.7.4a1r29646".

I found the cause.

Firstly, I noticed that your hostfile can work and mine can not.

Your host file:
cat hosts
bend001 slots=12

My host file:
cat hosts
node08
node08
...(total 8 lines)

I modified my script file to add "slots=1" to each line of my hostfile
just before launching mpirun. Then it worked.

My host file(modified):
cat hosts
node08 slots=1
node08 slots=1
...(total 8 lines)

Secondary, I confirmed that there's a slight difference between
orte/util/hostfile/hostfile.c of 1.7.3 and that of 1.7.4a1r29646.

$ diff
hostfile.c.org ../../../../openmpi-1.7.3/orte/util/hostfile/hostfile.c
394,401c394,399
< if (got_count) {
< node->slots_given = true;
< } else if (got_max) {
< node->slots = node->slots_max;
< node->slots_given = true;
< } else {
< /* should be set by obj_new, but just to be clear */
< node->slots_given = false;

---
>     if (!got_count) {
>         if (got_max) {
>             node->slots = node->slots_max;
>         } else {
>             ++node->slots;
>         }
....
Finally, I added the line 402 below just as a tentative trial.
Then, it worked.
cat -n orte/util/hostfile/hostfile.c:
   ...
   394      if (got_count) {
   395          node->slots_given = true;
   396      } else if (got_max) {
   397          node->slots = node->slots_max;
   398          node->slots_given = true;
   399      } else {
   400          /* should be set by obj_new, but just to be clear */
   401          node->slots_given = false;
   402          ++node->slots; /* added by tmishima */
   403      }
   ...
Please fix the problem properly, because it's just based on my
random guess. It's related to the treatment of hostfile where slots
information is not given.
Regards,
Tetsuya Mishima