Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] I have still a problem with rankfiles in openmpi-1.6.4rc3
From: Siegmar Gross (Siegmar.Gross_at_[hidden])
Date: 2013-02-08 03:57:23


Hi

today I tried a different rankfile and got once more a problem. :-((

> > thank you very much for your patch. I have applied the patch to
> > openmpi-1.6.4rc4.
> >
> > Open MPI: 1.6.4rc4r28022
> > : [B .][. .] (slot list 0:0)
> > : [. B][. .] (slot list 0:1)
> > : [B B][. .] (slot list 0:0-1)
> > : [. .][B .] (slot list 1:0)
> > : [. .][. B] (slot list 1:1)
> > : [. .][B B] (slot list 1:0-1)
> > : [B B][B B] (slot list 0:0-1,1:0-1)
>
> That looks great. I'll file a CMR to get this patch into 1.6.
> Unless you indicate otherwise, I'll assume this issue is understood
> for 1.6.

Rankfile rf_6 is the same as last time. I have added one more
line in rf_7 and I switched the sequence of the hosts in rf_8.
Everything is still fine with rf_6. I don't get any output for
rank 1 with rf_7 and I get an error for rf_8. Both machines
use the same hardware.

sunpc1 rankfiles 106 cat rf_6
# mpiexec -report-bindings -rf rf_6 hostname
rank 0=sunpc1 slot=0:0-1,1:0-1

sunpc1 rankfiles 107 cat rf_7
# mpiexec -report-bindings -rf rf_7 hostname
rank 0=sunpc1 slot=0:0-1,1:0-1
rank 1=sunpc0 slot=0:0-1

sunpc1 rankfiles 108 cat rf_8
# mpiexec -report-bindings -rf rf_8 hostname
rank 0=sunpc0 slot=0:0-1,1:0-1
rank 1=sunpc1 slot=0:0-1

sunpc1 rankfiles 109 mpiexec -report-bindings -rf rf_6 hostname
[sunpc1:09779] MCW rank 0 bound to socket 0[core 0-1]
  socket 1[core 0-1]: [B B][B B] (slot list 0:0-1,1:0-1)

sunpc1 rankfiles 110 mpiexec -report-bindings -rf rf_7 hostname
[sunpc1:09782] MCW rank 0 bound to socket 0[core 0-1]
  socket 1[core 0-1]: [B B][B B] (slot list 0:0-1,1:0-1)

sunpc1 rankfiles 111 mpiexec -report-bindings -rf rf_8 hostname
--------------------------------------------------------------------------
The rankfile that was used claimed that a host was either not
allocated or oversubscribed its slots. Please review your rank-slot
assignments and your host allocation to ensure a proper match. Also,
some systems may require using full hostnames, such as
"host1.example.com" (instead of just plain "host1").

  Host: sunpc0
--------------------------------------------------------------------------

I get the following output, if I use sunpc0 as local host.

sunpc0 rankfiles 102 mpiexec -report-bindings -rf rf_6 hostname
--------------------------------------------------------------------------
All nodes which are allocated for this job are already filled.
--------------------------------------------------------------------------

sunpc0 rankfiles 103 mpiexec -report-bindings -rf rf_7 hostname
--------------------------------------------------------------------------
The rankfile that was used claimed that a host was either not
allocated or oversubscribed its slots. Please review your rank-slot
assignments and your host allocation to ensure a proper match. Also,
some systems may require using full hostnames, such as
"host1.example.com" (instead of just plain "host1").

  Host: sunpc1
--------------------------------------------------------------------------

sunpc0 rankfiles 104 mpiexec -report-bindings -rf rf_8 hostname
[sunpc0:19027] MCW rank 0 bound to socket 0[core 0-1]
  socket 1[core 0-1]: [B B][B B] (slot list 0:0-1,1:0-1)

I get the following output, if I use tyr as local host.

tyr rankfiles 218 mpiexec -report-bindings -rf rf_6 hostname
--------------------------------------------------------------------------
All nodes which are allocated for this job are already filled.
--------------------------------------------------------------------------

tyr rankfiles 219 mpiexec -report-bindings -rf rf_7 hostname
--------------------------------------------------------------------------
All nodes which are allocated for this job are already filled.
--------------------------------------------------------------------------

tyr rankfiles 220 mpiexec -report-bindings -rf rf_8 hostname
--------------------------------------------------------------------------
All nodes which are allocated for this job are already filled.
--------------------------------------------------------------------------

Do you have any ideas why this happens? Thank you very much for
any help in advance.

Kind regards

Siegmar