Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] now 1.9 [was: I have still a problem with rankfiles in openmpi-1.6.4rc3]
From: Siegmar Gross (Siegmar.Gross_at_[hidden])
Date: 2013-02-07 02:54:21


Hi

> > thank you very much for your answer. I have compiled your program
> > and get different behaviours for openmpi-1.6.4rc3 and openmpi-1.9.
> >
> > I get the following output for openmpi-1.9 (different outputs !!!).
> >
> > sunpc1 rankfiles 104 mpirun --report-bindings --rankfile myrankfile
> > ./a.out
> > [sunpc1:26554] MCW rank 0 bound to socket 0[core 0[hwt 0]],
> > socket 0[core 1[hwt 0]]: [B/B][./.]
> > unbound
> >
> > sunpc1 rankfiles 105 mpirun --report-bindings --rankfile myrankfile_0
> > ./a.out
> > [sunpc1:26557] MCW rank 0 bound to socket 0[core 0[hwt 0]]: [B/.][./.]
> > bind to 0
>
> I think what's happening is that although you specified "0:0" or "0:1"
> in the rankfile, the string "0,0" or "0,1" is getting passed
> in (at least in the runs I looked at). That colon became a comma.
> So, it's just by accident that myrankfile_0 is working out all
> right.

It is working for 0:0 and 1:1 and it isn't working for 0:1 and
1:0. The machine is a Sun Ultra 40 by the way.

sunpc1 rankfiles 104 ompi_info | grep "MPI:"
                Open MPI: 1.9a1r28035
sunpc1 rankfiles 105 cat myrankfile_*
rank 0=sunpc1 slot=0:0
rank 0=sunpc1 slot=0:1
rank 0=sunpc1 slot=1:0
rank 0=sunpc1 slot=1:1
sunpc1 rankfiles 106 cc check.c
sunpc1 rankfiles 107 mpirun --report-bindings \
  --rankfile myrankfile_0 ./a.out
bind to 0
[sunpc1:26988] MCW rank 0 bound to socket 0[core 0[hwt 0]]:
  [B/.][./.]

sunpc1 rankfiles 108 mpirun --report-bindings \
  --rankfile myrankfile_1 ./a.out
[sunpc1:26991] MCW rank 0 bound to socket 0[core 0[hwt 0]],
  socket 0[core 1[hwt 0]]: [B/B][./.]
unbound

sunpc1 rankfiles 109 mpirun --report-bindings \
  --rankfile myrankfile_2 ./a.out
[sunpc1:26994] MCW rank 0 bound to socket 1[core 2[hwt 0]],
  socket 1[core 3[hwt 0]]: [./.][B/B]
unbound

sunpc1 rankfiles 110 mpirun --report-bindings \
  --rankfile myrankfile_3 ./a.out
[sunpc1:26997] MCW rank 0 bound to socket 1[core 3[hwt 0]]:
  [./.][./B]
bind to 3
sunpc1 rankfiles 111

> Could someone who knows the code better than I do help me narrow this
> down? E.g., where is the rankfile parsed? For what it's
> worth, by the time mpirun reaches
> orte_odls_base_default_get_add_procs_data(), orte_job_data already
> contains the corrupted
> cpu_bitmap string.

Thank you very much for any help in advance.

Kind regards

Siegmar