Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] problems with rankfile in openmpi-1.9a1r29097
From: Ralph Castain (rhc_at_[hidden])
Date: 2013-09-02 11:03:38


Okay, I have a fix for not specifying the number of procs when using a rankfile.

As for the binding pattern, the problem is a syntax error in your rankfile. You need a semi-colon instead of a comma to separate the sockets for rank0:

> rank 0=bend001 slot=0:0-1,1:0-1 => rank 0=bend001 slot=0:0-1;1:0-1

This is required because you use commas to list specific cores - e.g., slot=0:0,1,4,6

HTH
Ralph

On Sep 2, 2013, at 7:52 AM, Ralph Castain <rhc_at_[hidden]> wrote:

> It seems to run for me on CentOS, though I note that rank 0 isn't bound to both sockets 0 and 1 as specified and I had to tell it how many procs to run:
>
> [rhc_at_bend001 svn-trunk]$ mpirun --report-bindings -rf rf -n 4 hostname
> [bend001:13297] MCW rank 0 bound to socket 0[core 0[hwt 0-1]], socket 0[core 1[hwt 0-1]]: [BB/BB/../../../..][../../../../../..]
> bend001
> [bend002:25899] MCW rank 3 bound to socket 1[core 7[hwt 0-1]]: [../../../../../..][../BB/../../../..]
> bend002
> [bend002:25899] MCW rank 1 bound to socket 0[core 0[hwt 0-1]], socket 0[core 1[hwt 0-1]]: [BB/BB/../../../..][../../../../../..]
> bend002
> [bend002:25899] MCW rank 2 bound to socket 1[core 6[hwt 0-1]]: [../../../../../..][BB/../../../../..]
> bend002
>
> [rhc_at_bend001 svn-trunk]$ cat rf
> rank 0=bend001 slot=0:0-1,1:0-1
> rank 1=bend002 slot=0:0-1
> rank 2=bend002 slot=1:0
> rank 3=bend002 slot=1:1
>
> I'll work on those issues, but don't know why you are getting this "not allocated" error.
>
>
> On Sep 2, 2013, at 7:10 AM, Siegmar Gross <Siegmar.Gross_at_[hidden]> wrote:
>
>> Hi,
>>
>> I installed openmpi-1.9a1r29097 on "openSuSE Linux 12.1", "Solaris 10
>> x86_64", and "Solaris 10 sparc" with "Sun C 5.12" in 64-bit mode.
>> Unfortunately I still have a problem with rankfiles. I reported the
>> problems already in May. I show the problems with Linux, although I
>> have the same problems on all Solaris machines as well.
>>
>> linpc1 rankfiles 99 cat rf_linpc1
>> # mpiexec -report-bindings -rf rf_linpc1 hostname
>> rank 0=linpc1 slot=0:0-1,1:0-1
>>
>> linpc1 rankfiles 100 mpiexec -report-bindings -rf rf_linpc1 hostname
>> [linpc1:23413] MCW rank 0 bound to socket 0[core 0[hwt 0]],
>> socket 0[core 1[hwt 0]]: [B/B][./.]
>> linpc1
>>
>>
>> linpc1 rankfiles 101 cat rf_ex_linpc
>> # mpiexec -report-bindings -rf rf_ex_linpc hostname
>> rank 0=linpc0 slot=0:0-1,1:0-1
>> rank 1=linpc1 slot=0:0-1
>> rank 2=linpc1 slot=1:0
>> rank 3=linpc1 slot=1:1
>>
>> linpc1 rankfiles 102 mpiexec -report-bindings -rf rf_ex_linpc hostname
>> --------------------------------------------------------------------------
>> The rankfile that was used claimed that a host was either not
>> allocated or oversubscribed its slots. Please review your rank-slot
>> assignments and your host allocation to ensure a proper match. Also,
>> some systems may require using full hostnames, such as
>> "host1.example.com" (instead of just plain "host1").
>>
>> Host: linpc0
>> --------------------------------------------------------------------------
>> linpc1 rankfiles 103
>>
>>
>>
>> I don't have these problems with openmpi-1.6.5a1r28554.
>>
>> linpc1 rankfiles 95 ompi_info | grep "Open MPI:"
>> Open MPI: 1.6.5a1r28554
>>
>> linpc1 rankfiles 95 mpiexec -report-bindings -rf rf_linpc1 hostname
>> [linpc1:23583] MCW rank 0 bound to socket 0[core 0-1]
>> socket 1[core 0-1]: [B B][B B] (slot list 0:0-1,1:0-1)
>> linpc1
>>
>>
>> linpc1 rankfiles 96 mpiexec -report-bindings -rf rf_ex_linpc hostname
>> [linpc1:23585] MCW rank 1 bound to socket 0[core 0-1]:
>> [B B][. .] (slot list 0:0-1)
>> [linpc1:23585] MCW rank 2 bound to socket 1[core 0]:
>> [. .][B .] (slot list 1:0)
>> [linpc1:23585] MCW rank 3 bound to socket 1[core 1]:
>> [. .][. B] (slot list 1:1)
>> linpc1
>> linpc1
>> linpc1
>> [linpc0:10422] MCW rank 0 bound to socket 0[core 0-1] socket 1[core 0-1]:
>> [B B][B B] (slot list 0:0-1,1:0-1)
>> linpc0
>>
>>
>> I would be grateful, if somebody can fix the problem. Thank you
>> very much for any help in advance.
>>
>>
>> Kind regards
>>
>> Siegmar
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>