Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] I have still a problem with rankfiles in openmpi-1.6.4rc3
From: Eugene Loh (eugene.loh_at_[hidden])
Date: 2013-02-05 16:20:09


On 02/05/13 00:30, Siegmar Gross wrote:
>
> now I can use all our machines once more. I have a problem on
> Solaris 10 x86_64, because the mapping of processes doesn't
> correspond to the rankfile. I removed the output from "hostfile"
> and wrapped around long lines.
>
> tyr rankfiles 114 cat rf_ex_sunpc
> # mpiexec -report-bindings -rf rf_ex_sunpc hostname
>
> rank 0=sunpc0 slot=0:0-1,1:0-1
> rank 1=sunpc1 slot=0:0-1
> rank 2=sunpc1 slot=1:0
> rank 3=sunpc1 slot=1:1
>
>
> tyr rankfiles 115 mpiexec -report-bindings -rf rf_ex_sunpc hostname
> [sunpc0:17920] MCW rank 0 bound to socket 0[core 0-1] socket 1[core 0-1]: [B B][B B] (slot list 0:0-1,1:0-1)
> [sunpc1:11265] MCW rank 1 bound to socket 0[core 0-1] : [B B][. .] (slot list 0:0-1)
> [sunpc1:11265] MCW rank 2 bound to socket 0[core 0-1] socket 1[core 0-1]: [B B][B B] (slot list 1:0)
> [sunpc1:11265] MCW rank 3 bound to socket 0[core 0-1] socket 1[core 0-1]: [B B][B B] (slot list 1:1)

A few comments.

First of all, the heterogeneous environment had nothing to do with this (as you have just confirmed). You can reproduce the problem so:

% cat myrankfile
rank 0=mynode slot=0:1
% mpirun --report-bindings --rankfile myrankfile hostname
[mynode:5150] MCW rank 0 bound to socket 0[core 0-3]: [B B B B] (slot list 0:1)

Anyhow, that's water under the bridge at this point.

Next, and you might already know this, you can't bind arbitrarily on Solaris. You have to bind to a locality group (lgroup) or an
individual core. Sorry if that's repeating something you already knew. Anyhow, your problem cases are when binding to a single
core. So, you're all right (and OMPI isn't).

Finally, you can check the actual binding so:

% cat check.c
#include <sys/types.h>
#include <sys/processor.h>
#include <sys/procset.h>
#include <stdio.h>

int main(int argc, char **argv) {
   processorid_t obind;
   if ( processor_bind(P_PID, P_MYID, PBIND_QUERY, &obind) != 0 ) {
     printf("ERROR\n");
   } else {
     if ( obind == PBIND_NONE ) printf("unbound\n");
     else printf("bind to %d\n", obind);
   }
   return 0;
}
% cc check.c
% mpirun --report-bindings --rankfile myrankfile ./a.out

I can reproduce your problem on my Solaris 11 machine (rankfile specifies a particular core but --report-bindings shows binding to
entire node), but the test problem shows binding to the core I specified.

So, the problem is in --report-bindings? I'll poke around some.