Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] I have still a problem with rankfiles in openmpi-1.6.4rc3
From: Jeff Squyres (jsquyres) (jsquyres_at_[hidden])
Date: 2013-02-05 11:14:09


Siegmar --

We've been talking about this offline. Can you send us an lstopo output from your Solaris machine? Send us the text output and the xml output, e.g.:

lstopo > solaris.txt
lstopo solaris.xml

Thanks!

On Feb 5, 2013, at 12:30 AM, Siegmar Gross <Siegmar.Gross_at_[hidden]> wrote:

> Hi
>
> now I can use all our machines once more. I have a problem on
> Solaris 10 x86_64, because the mapping of processes doesn't
> correspond to the rankfile. I removed the output from "hostfile"
> and wrapped around long lines.
>
> tyr rankfiles 114 cat rf_ex_sunpc
> # mpiexec -report-bindings -rf rf_ex_sunpc hostname
>
> rank 0=sunpc0 slot=0:0-1,1:0-1
> rank 1=sunpc1 slot=0:0-1
> rank 2=sunpc1 slot=1:0
> rank 3=sunpc1 slot=1:1
>
>
> tyr rankfiles 115 mpiexec -report-bindings -rf rf_ex_sunpc hostname
> [sunpc0:17920] MCW rank 0 bound to socket 0[core 0-1]
> socket 1[core 0-1]: [B B][B B] (slot list 0:0-1,1:0-1)
> [sunpc1:11265] MCW rank 1 bound to socket 0[core 0-1]:
> [B B][. .] (slot list 0:0-1)
> [sunpc1:11265] MCW rank 2 bound to socket 0[core 0-1]
> socket 1[core 0-1]: [B B][B B] (slot list 1:0)
> [sunpc1:11265] MCW rank 3 bound to socket 0[core 0-1]
> socket 1[core 0-1]: [B B][B B] (slot list 1:1)
>
>
> Can I provide any information to solve this problem? My
> rankfile works as expected, if I use only Linux machines.
>
>
> Kind regards
>
> Siegmar
>
>
>
>>> Hmmm....well, it certainly works for me:
>>>
>>> [rhc_at_odin ~/v1.6]$ cat rf
>>> rank 0=odin093 slot=0:0-1,1:0-1
>>> rank 1=odin094 slot=0:0-1
>>> rank 2=odin094 slot=1:0
>>> rank 3=odin094 slot=1:1
>>>
>>>
>>> [rhc_at_odin ~/v1.6]$ mpirun -n 4 -rf ./rf --report-bindings
>>> -mca opal_paffinity_alone 0 hostname
>>> [odin093.cs.indiana.edu:04617] MCW rank 0 bound to
>>> socket 0[core 0-1] socket 1[core 0-1]: [B B][B B] (slot list
> 0:0-1,1:0-1)
>>> odin093.cs.indiana.edu
>>> odin094.cs.indiana.edu
>>> [odin094.cs.indiana.edu:04426] MCW rank 1 bound to
>>> socket 0[core 0-1]: [B B][. .] (slot list 0:0-1)
>>> odin094.cs.indiana.edu
>>> [odin094.cs.indiana.edu:04426] MCW rank 2 bound to
>>> socket 1[core 0]: [. .][B .] (slot list 1:0)
>>> [odin094.cs.indiana.edu:04426] MCW rank 3 bound to
>>> socket 1[core 1]: [. .][. B] (slot list 1:1)
>>> odin094.cs.indiana.edu
>>
>> Interesting that it works on your machines.
>>
>>
>>> I see one thing of concern to me in your output - your second node
>>> appears to be a Sun computer. Is it the same physical architecture?
>>> Is it also running Linux? Are you sure it is using the same version
>>> of OMPI, built for that environment and hardware?
>>
>> Both machines (in fact all four machines: sunpc0, sunpc1, linpc0, and
>> linpc1) use the same hardware. "linpc" uses openSUSE 12.1 and "sunpc"
>> Solaris 10 x86_64. All machines use the same version of Open MPI,
>> built for that environment. At the moment I can only use sunpc1 and
>> linpc1 ("my" developer machines). Next week I will have access to all
>> machines so that I can test, if I get a different behaviour when I
>> use two machines with the same operating system (although mixed
>> operating systems weren't a problem in the past (only machines with
>> differnt endians)). I let you know my results.
>>
>>
>> Kind regards
>>
>> Siegmar
>>
>>
>>
>>
>>> On Jan 30, 2013, at 2:08 AM, Siegmar Gross
>> <Siegmar.Gross_at_[hidden]> wrote:
>>>
>>>> Hi
>>>>
>>>> I applied your patch "rmaps.diff" to openmpi-1.6.4rc3r27923 and
>>>> it works for my previous rankfile.
>>>>
>>>>
>>>>> #3493: Handle the case where rankfile provides the allocation
>>>>> -----------------------------------+----------------------------
>>>>> Reporter: rhc | Owner: jsquyres
>>>>> Type: changeset move request | Status: new
>>>>> Priority: critical | Milestone: Open MPI 1.6.4
>>>>> Version: trunk | Keywords:
>>>>> -----------------------------------+----------------------------
>>>>> Please apply the attached patch that corrects the rmaps function for
>>>>> obtaining the available nodes when rankfile is providing the
> allocation.
>>>>
>>>>
>>>> tyr rankfiles 129 more rf_linpc1
>>>> # mpiexec -report-bindings -rf rf_linpc1 hostname
>>>> rank 0=linpc1 slot=0:0-1,1:0-1
>>>>
>>>> tyr rankfiles 130 mpiexec -report-bindings -rf rf_linpc1 hostname
>>>> [linpc1:31603] MCW rank 0 bound to socket 0[core 0-1]
>>>> socket 1[core 0-1]: [B B][B B] (slot list 0:0-1,1:0-1)
>>>>
>>>>
>>>>
>>>> Unfortunately I don't get the expected result for the following
>>>> rankfile.
>>>>
>>>> tyr rankfiles 114 more rf_bsp
>>>> # mpiexec -report-bindings -rf rf_bsp hostname
>>>> rank 0=linpc1 slot=0:0-1,1:0-1
>>>> rank 1=sunpc1 slot=0:0-1
>>>> rank 2=sunpc1 slot=1:0
>>>> rank 3=sunpc1 slot=1:1
>>>>
>>>> I would expect that rank 0 gets all four cores from linpc1, rank 1
>>>> both cores of socket 0 from sunpc1, rank 2 core 0 of socket 1, and
>>>> rank 3 core 1 of socket 1 from sunpc1. Everything is fine for my
>>>> processes with rank 0 and 1, but it's wrong for ranks 2 and 3,
>>>> because they both get all four cores of sunpc1. Is something wrong
>>>> with my rankfile or with your mapping of processes to cores? I have
>>>> removed the output from "hostname" and wrapped long lines.
>>>>
>>>> tyr rankfiles 115 mpiexec -report-bindings -rf rf_bsp hostname
>>>> [linpc1:31092] MCW rank 0 bound to socket 0[core 0-1] socket 1[core
> 0-1]:
>>>> [B B][B B] (slot list 0:0-1,1:0-1)
>>>> [sunpc1:12916] MCW rank 1 bound to socket 0[core 0-1]:
>>>> [B B][. .] (slot list 0:0-1)
>>>> [sunpc1:12916] MCW rank 2 bound to socket 0[core 0-1] socket 1[core
> 0-1]:
>>>> [B B][B B] (slot list 1:0)
>>>> [sunpc1:12916] MCW rank 3 bound to socket 0[core 0-1] socket 1[core
> 0-1]:
>>>> [B B][B B] (slot list 1:1)
>>>>
>>>>
>>>> I get the following output, if I add the options which you mentioned
>>>> in a previous email.
>>>>
>>>> tyr rankfiles 124 mpiexec -report-bindings -rf rf_bsp \
>>>> -display-allocation -mca ras_base_verbose 5 hostname
>>>> [tyr.informatik.hs-fulda.de:19401] mca:base:select:( ras)
>>>> Querying component [cm]
>>>> [tyr.informatik.hs-fulda.de:19401] mca:base:select:( ras)
>>>> Skipping component [cm]. Query failed to return a module
>>>> [tyr.informatik.hs-fulda.de:19401] mca:base:select:( ras)
>>>> No component selected!
>>>> [tyr.informatik.hs-fulda.de:19401] [[27101,0],0] ras:base:allocate
>>>> [tyr.informatik.hs-fulda.de:19401] [[27101,0],0] ras:base:allocate
>>>> nothing found in module - proceeding to hostfile
>>>> [tyr.informatik.hs-fulda.de:19401] [[27101,0],0] ras:base:allocate
>>>> parsing default hostfile
>>>> /usr/local/openmpi-1.6.4_64_cc/etc/openmpi-default-hostfile
>>>> [tyr.informatik.hs-fulda.de:19401] [[27101,0],0] ras:base:allocate
>>>> nothing found in hostfiles or dash-host - checking for rankfile
>>>> [tyr.informatik.hs-fulda.de:19401] [[27101,0],0]
>>>> ras:base:node_insert inserting 2 nodes
>>>> [tyr.informatik.hs-fulda.de:19401] [[27101,0],0]
>>>> ras:base:node_insert node linpc1
>>>> [tyr.informatik.hs-fulda.de:19401] [[27101,0],0]
>>>> ras:base:node_insert node sunpc1
>>>>
>>>> ====================== ALLOCATED NODES ======================
>>>>
>>>> Data for node: tyr.informatik.hs-fulda.de Num slots: 0 Max slots: 0
>>>> Data for node: linpc1 Num slots: 1 Max slots: 0
>>>> Data for node: sunpc1 Num slots: 3 Max slots: 0
>>>>
>>>> =================================================================
>>>> [linpc1:31532] MCW rank 0 bound to socket 0[core 0-1] socket 1[core
> 0-1]:
>>>> [B B][B B] (slot list 0:0-1,1:0-1)
>>>> [sunpc1:13136] MCW rank 1 bound to socket 0[core 0-1]:
>>>> [B B][. .] (slot list 0:0-1)
>>>> [sunpc1:13136] MCW rank 2 bound to socket 0[core 0-1] socket 1[core
> 0-1]:
>>>> [B B][B B] (slot list 1:0)
>>>> [sunpc1:13136] MCW rank 3 bound to socket 0[core 0-1] socket 1[core
> 0-1]:
>>>> [B B][B B] (slot list 1:1)
>>>>
>>>>
>>>> Thank you very much for any suggestions and any help in advance.
>>>>
>>>>
>>>> Kind regards
>>>>
>>>> Siegmar
>>>>
>>>> _______________________________________________
>>>> users mailing list
>>>> users_at_[hidden]
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>
>>>
>>
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users

-- 
Jeff Squyres
jsquyres_at_[hidden]
For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/