Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] I have still a problem with rankfiles in openmpi-1.6.4rc3
From: Ralph Castain (rhc_at_[hidden])
Date: 2013-01-31 16:02:55


On Jan 31, 2013, at 12:39 PM, Siegmar Gross <Siegmar.Gross_at_[hidden]> wrote:

> Hi
>
>> Hmmm....well, it certainly works for me:
>>
>> [rhc_at_odin ~/v1.6]$ cat rf
>> rank 0=odin093 slot=0:0-1,1:0-1
>> rank 1=odin094 slot=0:0-1
>> rank 2=odin094 slot=1:0
>> rank 3=odin094 slot=1:1
>>
>>
>> [rhc_at_odin ~/v1.6]$ mpirun -n 4 -rf ./rf --report-bindings
>> -mca opal_paffinity_alone 0 hostname
>> [odin093.cs.indiana.edu:04617] MCW rank 0 bound to
>> socket 0[core 0-1] socket 1[core 0-1]: [B B][B B] (slot list 0:0-1,1:0-1)
>> odin093.cs.indiana.edu
>> odin094.cs.indiana.edu
>> [odin094.cs.indiana.edu:04426] MCW rank 1 bound to
>> socket 0[core 0-1]: [B B][. .] (slot list 0:0-1)
>> odin094.cs.indiana.edu
>> [odin094.cs.indiana.edu:04426] MCW rank 2 bound to
>> socket 1[core 0]: [. .][B .] (slot list 1:0)
>> [odin094.cs.indiana.edu:04426] MCW rank 3 bound to
>> socket 1[core 1]: [. .][. B] (slot list 1:1)
>> odin094.cs.indiana.edu
>
> Interesting that it works on your machines.
>
>
>> I see one thing of concern to me in your output - your second node
>> appears to be a Sun computer. Is it the same physical architecture?
>> Is it also running Linux? Are you sure it is using the same version
>> of OMPI, built for that environment and hardware?
>
> Both machines (in fact all four machines: sunpc0, sunpc1, linpc0, and
> linpc1) use the same hardware. "linpc" uses openSUSE 12.1 and "sunpc"
> Solaris 10 x86_64. All machines use the same version of Open MPI,
> built for that environment. At the moment I can only use sunpc1 and
> linpc1 ("my" developer machines). Next week I will have access to all
> machines so that I can test, if I get a different behaviour when I
> use two machines with the same operating system (although mixed
> operating systems weren't a problem in the past (only machines with
> differnt endians)). I let you know my results.

I suspect the problem is Solaris being on the remote machine. I don't know how far our Solaris support may have rotted by now.

>
>
> Kind regards
>
> Siegmar
>
>
>
>
>> On Jan 30, 2013, at 2:08 AM, Siegmar Gross
> <Siegmar.Gross_at_[hidden]> wrote:
>>
>>> Hi
>>>
>>> I applied your patch "rmaps.diff" to openmpi-1.6.4rc3r27923 and
>>> it works for my previous rankfile.
>>>
>>>
>>>> #3493: Handle the case where rankfile provides the allocation
>>>> -----------------------------------+----------------------------
>>>> Reporter: rhc | Owner: jsquyres
>>>> Type: changeset move request | Status: new
>>>> Priority: critical | Milestone: Open MPI 1.6.4
>>>> Version: trunk | Keywords:
>>>> -----------------------------------+----------------------------
>>>> Please apply the attached patch that corrects the rmaps function for
>>>> obtaining the available nodes when rankfile is providing the allocation.
>>>
>>>
>>> tyr rankfiles 129 more rf_linpc1
>>> # mpiexec -report-bindings -rf rf_linpc1 hostname
>>> rank 0=linpc1 slot=0:0-1,1:0-1
>>>
>>> tyr rankfiles 130 mpiexec -report-bindings -rf rf_linpc1 hostname
>>> [linpc1:31603] MCW rank 0 bound to socket 0[core 0-1]
>>> socket 1[core 0-1]: [B B][B B] (slot list 0:0-1,1:0-1)
>>>
>>>
>>>
>>> Unfortunately I don't get the expected result for the following
>>> rankfile.
>>>
>>> tyr rankfiles 114 more rf_bsp
>>> # mpiexec -report-bindings -rf rf_bsp hostname
>>> rank 0=linpc1 slot=0:0-1,1:0-1
>>> rank 1=sunpc1 slot=0:0-1
>>> rank 2=sunpc1 slot=1:0
>>> rank 3=sunpc1 slot=1:1
>>>
>>> I would expect that rank 0 gets all four cores from linpc1, rank 1
>>> both cores of socket 0 from sunpc1, rank 2 core 0 of socket 1, and
>>> rank 3 core 1 of socket 1 from sunpc1. Everything is fine for my
>>> processes with rank 0 and 1, but it's wrong for ranks 2 and 3,
>>> because they both get all four cores of sunpc1. Is something wrong
>>> with my rankfile or with your mapping of processes to cores? I have
>>> removed the output from "hostname" and wrapped long lines.
>>>
>>> tyr rankfiles 115 mpiexec -report-bindings -rf rf_bsp hostname
>>> [linpc1:31092] MCW rank 0 bound to socket 0[core 0-1] socket 1[core 0-1]:
>>> [B B][B B] (slot list 0:0-1,1:0-1)
>>> [sunpc1:12916] MCW rank 1 bound to socket 0[core 0-1]:
>>> [B B][. .] (slot list 0:0-1)
>>> [sunpc1:12916] MCW rank 2 bound to socket 0[core 0-1] socket 1[core 0-1]:
>>> [B B][B B] (slot list 1:0)
>>> [sunpc1:12916] MCW rank 3 bound to socket 0[core 0-1] socket 1[core 0-1]:
>>> [B B][B B] (slot list 1:1)
>>>
>>>
>>> I get the following output, if I add the options which you mentioned
>>> in a previous email.
>>>
>>> tyr rankfiles 124 mpiexec -report-bindings -rf rf_bsp \
>>> -display-allocation -mca ras_base_verbose 5 hostname
>>> [tyr.informatik.hs-fulda.de:19401] mca:base:select:( ras)
>>> Querying component [cm]
>>> [tyr.informatik.hs-fulda.de:19401] mca:base:select:( ras)
>>> Skipping component [cm]. Query failed to return a module
>>> [tyr.informatik.hs-fulda.de:19401] mca:base:select:( ras)
>>> No component selected!
>>> [tyr.informatik.hs-fulda.de:19401] [[27101,0],0] ras:base:allocate
>>> [tyr.informatik.hs-fulda.de:19401] [[27101,0],0] ras:base:allocate
>>> nothing found in module - proceeding to hostfile
>>> [tyr.informatik.hs-fulda.de:19401] [[27101,0],0] ras:base:allocate
>>> parsing default hostfile
>>> /usr/local/openmpi-1.6.4_64_cc/etc/openmpi-default-hostfile
>>> [tyr.informatik.hs-fulda.de:19401] [[27101,0],0] ras:base:allocate
>>> nothing found in hostfiles or dash-host - checking for rankfile
>>> [tyr.informatik.hs-fulda.de:19401] [[27101,0],0]
>>> ras:base:node_insert inserting 2 nodes
>>> [tyr.informatik.hs-fulda.de:19401] [[27101,0],0]
>>> ras:base:node_insert node linpc1
>>> [tyr.informatik.hs-fulda.de:19401] [[27101,0],0]
>>> ras:base:node_insert node sunpc1
>>>
>>> ====================== ALLOCATED NODES ======================
>>>
>>> Data for node: tyr.informatik.hs-fulda.de Num slots: 0 Max slots: 0
>>> Data for node: linpc1 Num slots: 1 Max slots: 0
>>> Data for node: sunpc1 Num slots: 3 Max slots: 0
>>>
>>> =================================================================
>>> [linpc1:31532] MCW rank 0 bound to socket 0[core 0-1] socket 1[core 0-1]:
>>> [B B][B B] (slot list 0:0-1,1:0-1)
>>> [sunpc1:13136] MCW rank 1 bound to socket 0[core 0-1]:
>>> [B B][. .] (slot list 0:0-1)
>>> [sunpc1:13136] MCW rank 2 bound to socket 0[core 0-1] socket 1[core 0-1]:
>>> [B B][B B] (slot list 1:0)
>>> [sunpc1:13136] MCW rank 3 bound to socket 0[core 0-1] socket 1[core 0-1]:
>>> [B B][B B] (slot list 1:1)
>>>
>>>
>>> Thank you very much for any suggestions and any help in advance.
>>>
>>>
>>> Kind regards
>>>
>>> Siegmar
>>>
>>> _______________________________________________
>>> users mailing list
>>> users_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>>
>