Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] I have still a problem with rankfiles in openmpi-1.6.4rc3
From: Siegmar Gross (Siegmar.Gross_at_[hidden])
Date: 2013-02-06 07:29:36


Hi

thank you very much for your answer. I have compiled your program
and get different behaviours for openmpi-1.6.4rc3 and openmpi-1.9.

> On 02/05/13 00:30, Siegmar Gross wrote:
> >
> > now I can use all our machines once more. I have a problem on
> > Solaris 10 x86_64, because the mapping of processes doesn't
> > correspond to the rankfile. I removed the output from "hostfile"
> > and wrapped around long lines.
> >
> > tyr rankfiles 114 cat rf_ex_sunpc
> > # mpiexec -report-bindings -rf rf_ex_sunpc hostname
> >
> > rank 0=sunpc0 slot=0:0-1,1:0-1
> > rank 1=sunpc1 slot=0:0-1
> > rank 2=sunpc1 slot=1:0
> > rank 3=sunpc1 slot=1:1
> >
> >
> > tyr rankfiles 115 mpiexec -report-bindings -rf rf_ex_sunpc hostname
> > [sunpc0:17920] MCW rank 0 bound to socket 0[core 0-1]
> > socket 1[core 0-1]: [B B][B B] (slot list 0:0-1,1:0-1)
> > [sunpc1:11265] MCW rank 1 bound to socket 0[core 0-1]:
> > [B B][. .] (slot list 0:0-1)
> > [sunpc1:11265] MCW rank 2 bound to socket 0[core 0-1]
> > socket 1[core 0-1]: [B B][B B] (slot list 1:0)
> > [sunpc1:11265] MCW rank 3 bound to socket 0[core 0-1]
> > socket 1[core 0-1]: [B B][B B] (slot list 1:1)
>
> A few comments.
>
> First of all, the heterogeneous environment had nothing to do
> with this (as you have just confirmed). You can reproduce the problem so:
>
> % cat myrankfile
> rank 0=mynode slot=0:1
> % mpirun --report-bindings --rankfile myrankfile hostname
> [mynode:5150] MCW rank 0 bound to socket 0[core 0-3]:
> [B B B B] (slot list 0:1)
>
> Anyhow, that's water under the bridge at this point.
>
> Next, and you might already know this, you can't bind arbitrarily
> on Solaris. You have to bind to a locality group (lgroup) or an
> individual core. Sorry if that's repeating something you already
> knew. Anyhow, your problem cases are when binding to a single
> core. So, you're all right (and OMPI isn't).
>
> Finally, you can check the actual binding so:
>
> % cat check.c
> #include <sys/types.h>
> #include <sys/processor.h>
> #include <sys/procset.h>
> #include <stdio.h>
>
> int main(int argc, char **argv) {
> processorid_t obind;
> if ( processor_bind(P_PID, P_MYID, PBIND_QUERY, &obind) != 0 ) {
> printf("ERROR\n");
> } else {
> if ( obind == PBIND_NONE ) printf("unbound\n");
> else printf("bind to %d\n", obind);
> }
> return 0;
> }
> % cc check.c
> % mpirun --report-bindings --rankfile myrankfile ./a.out
>
> I can reproduce your problem on my Solaris 11 machine (rankfile
> specifies a particular core but --report-bindings shows binding to
> entire node), but the test problem shows binding to the core I
> specified.
>
> So, the problem is in --report-bindings? I'll poke around some.

sunpc1 rankfiles 103 cat myrankfile
rank 0=sunpc1 slot=0:1
sunpc1 rankfiles 104 cat myrankfile_0
rank 0=sunpc1 slot=0:0

I get the following output for openmpi-1.6.4rc3 (more or less
the same for both rankfiles).

sunpc1 rankfiles 105 ompi_info | grep "MPI:"
                Open MPI: 1.6.4rc3r27923
sunpc1 rankfiles 106 mpirun --report-bindings \
  --rankfile myrankfile ./a.out
bind to 1
[sunpc1:26472] MCW rank 0 bound to socket 0[core 0-1]
  socket 1[core 0-1]: [B B][B B] (slot list 0:1)

sunpc1 rankfiles 107 mpirun --report-bindings \
  --rankfile myrankfile_0 ./a.out
[sunpc1:26484] MCW rank 0 bound to socket 0[core 0-1]
  socket 1[core 0-1]: [B B][B B] (slot list 0:0)
bind to 0

I get the following output for openmpi-1.9 (different outputs !!!).

sunpc1 rankfiles 103 ompi_info | grep "MPI:"
                Open MPI: 1.9a1r28035
sunpc1 rankfiles 104 mpirun --report-bindings \
  --rankfile myrankfile ./a.out
[sunpc1:26554] MCW rank 0 bound to socket 0[core 0[hwt 0]],
  socket 0[core 1[hwt 0]]: [B/B][./.]
unbound

sunpc1 rankfiles 105 mpirun --report-bindings \
  --rankfile myrankfile_0 ./a.out
[sunpc1:26557] MCW rank 0 bound to socket 0[core 0[hwt 0]]:
  [B/.][./.]
bind to 0

sunpc1 rankfiles 107 cd /usr/local/hwloc-1.6.1/bin/
sunpc1 bin 108 ./lstopo
Machine (8191MB)
  NUMANode L#0 (P#1 4095MB) + Socket L#0
    Core L#0 + PU L#0 (P#0)
    Core L#1 + PU L#1 (P#1)
  NUMANode L#1 (P#2 4096MB) + Socket L#1
    Core L#2 + PU L#2 (P#2)
    Core L#3 + PU L#3 (P#3)

Thank you very much for any help in advance.

Kind regards

Siegmar