Fixed and scheduled to move to 1.7.4. Thanks again!


On Nov 17, 2013, at 6:11 PM, Ralph Castain <rhc@open-mpi.org> wrote:

Thanks! That's precisely where I was going to look when I had time :-)

I'll update tomorrow.
Ralph




On Sun, Nov 17, 2013 at 7:01 PM, <tmishima@jcity.maeda.co.jp> wrote:


Hi Ralph,

This is the continuous story of "Segmentation fault in oob_tcp.c of
openmpi-1.7.4a1r29646".

I found the cause.

Firstly, I noticed that your hostfile can work and mine can not.

Your host file:
cat hosts
bend001 slots=12

My host file:
cat hosts
node08
node08
...(total 8 lines)

I modified my script file to add "slots=1" to each line of my hostfile
just before launching mpirun. Then it worked.

My host file(modified):
cat hosts
node08 slots=1
node08 slots=1
...(total 8 lines)

Secondary, I confirmed that there's a slight difference between
orte/util/hostfile/hostfile.c of 1.7.3 and that of 1.7.4a1r29646.

$ diff
hostfile.c.org ../../../../openmpi-1.7.3/orte/util/hostfile/hostfile.c
394,401c394,399
<     if (got_count) {
<         node->slots_given = true;
<     } else if (got_max) {
<         node->slots = node->slots_max;
<         node->slots_given = true;
<     } else {
<         /* should be set by obj_new, but just to be clear */
<         node->slots_given = false;
---
>     if (!got_count) {
>         if (got_max) {
>             node->slots = node->slots_max;
>         } else {
>             ++node->slots;
>         }
....

Finally, I added the line 402 below just as a tentative trial.
Then, it worked.

cat -n orte/util/hostfile/hostfile.c:
   ...
   394      if (got_count) {
   395          node->slots_given = true;
   396      } else if (got_max) {
   397          node->slots = node->slots_max;
   398          node->slots_given = true;
   399      } else {
   400          /* should be set by obj_new, but just to be clear */
   401          node->slots_given = false;
   402          ++node->slots; /* added by tmishima */
   403      }
   ...

Please fix the problem properly, because it's just based on my
random guess. It's related to the treatment of hostfile where slots
information is not given.

Regards,
Tetsuya Mishima

_______________________________________________
users mailing list
users@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users