Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |  

This web mail archive is frozen.

This page is part of a frozen web archive of this mailing list.

You can still navigate around this archive, but know that no new mails have been added to it since July of 2016.

Click here to be taken to the new web archives of this list; it includes all the mails that are in this frozen archive plus all new mails that have been sent to the list since it was migrated to the new archives.

Subject: Re: [OMPI users] I have still a problem with rankfiles in openmpi-1.6.4rc3
From: Ralph Castain (rhc_at_[hidden])
Date: 2013-01-31 16:02:55


On Jan 31, 2013, at 12:39 PM, Siegmar Gross <Siegmar.Gross_at_[hidden]> wrote:

> Hi
>
>> Hmmm....well, it certainly works for me:
>>
>> [rhc_at_odin ~/v1.6]$ cat rf
>> rank 0=odin093 slot=0:0-1,1:0-1
>> rank 1=odin094 slot=0:0-1
>> rank 2=odin094 slot=1:0
>> rank 3=odin094 slot=1:1
>>
>>
>> [rhc_at_odin ~/v1.6]$ mpirun -n 4 -rf ./rf --report-bindings
>> -mca opal_paffinity_alone 0 hostname
>> [odin093.cs.indiana.edu:04617] MCW rank 0 bound to
>> socket 0[core 0-1] socket 1[core 0-1]: [B B][B B] (slot list 0:0-1,1:0-1)
>> odin093.cs.indiana.edu
>> odin094.cs.indiana.edu
>> [odin094.cs.indiana.edu:04426] MCW rank 1 bound to
>> socket 0[core 0-1]: [B B][. .] (slot list 0:0-1)
>> odin094.cs.indiana.edu
>> [odin094.cs.indiana.edu:04426] MCW rank 2 bound to
>> socket 1[core 0]: [. .][B .] (slot list 1:0)
>> [odin094.cs.indiana.edu:04426] MCW rank 3 bound to
>> socket 1[core 1]: [. .][. B] (slot list 1:1)
>> odin094.cs.indiana.edu
>
> Interesting that it works on your machines.
>
>
>> I see one thing of concern to me in your output - your second node
>> appears to be a Sun computer. Is it the same physical architecture?
>> Is it also running Linux? Are you sure it is using the same version
>> of OMPI, built for that environment and hardware?
>
> Both machines (in fact all four machines: sunpc0, sunpc1, linpc0, and
> linpc1) use the same hardware. "linpc" uses openSUSE 12.1 and "sunpc"
> Solaris 10 x86_64. All machines use the same version of Open MPI,
> built for that environment. At the moment I can only use sunpc1 and
> linpc1 ("my" developer machines). Next week I will have access to all
> machines so that I can test, if I get a different behaviour when I
> use two machines with the same operating system (although mixed
> operating systems weren't a problem in the past (only machines with
> differnt endians)). I let you know my results.

I suspect the problem is Solaris being on the remote machine. I don't know how far our Solaris support may have rotted by now.

>
>
> Kind regards
>
> Siegmar
>
>
>
>
>> On Jan 30, 2013, at 2:08 AM, Siegmar Gross
> <Siegmar.Gross_at_[hidden]> wrote:
>>
>>> Hi
>>>
>>> I applied your patch "rmaps.diff" to openmpi-1.6.4rc3r27923 and
>>> it works for my previous rankfile.
>>>
>>>
>>>> #3493: Handle the case where rankfile provides the allocation
>>>> -----------------------------------+----------------------------
>>>> Reporter: rhc | Owner: jsquyres
>>>> Type: changeset move request | Status: new
>>>> Priority: critical | Milestone: Open MPI 1.6.4
>>>> Version: trunk | Keywords:
>>>> -----------------------------------+----------------------------
>>>> Please apply the attached patch that corrects the rmaps function for
>>>> obtaining the available nodes when rankfile is providing the allocation.
>>>
>>>
>>> tyr rankfiles 129 more rf_linpc1
>>> # mpiexec -report-bindings -rf rf_linpc1 hostname
>>> rank 0=linpc1 slot=0:0-1,1:0-1
>>>
>>> tyr rankfiles 130 mpiexec -report-bindings -rf rf_linpc1 hostname
>>> [linpc1:31603] MCW rank 0 bound to socket 0[core 0-1]
>>> socket 1[core 0-1]: [B B][B B] (slot list 0:0-1,1:0-1)
>>>
>>>
>>>
>>> Unfortunately I don't get the expected result for the following
>>> rankfile.
>>>
>>> tyr rankfiles 114 more rf_bsp
>>> # mpiexec -report-bindings -rf rf_bsp hostname
>>> rank 0=linpc1 slot=0:0-1,1:0-1
>>> rank 1=sunpc1 slot=0:0-1
>>> rank 2=sunpc1 slot=1:0
>>> rank 3=sunpc1 slot=1:1
>>>
>>> I would expect that rank 0 gets all four cores from linpc1, rank 1
>>> both cores of socket 0 from sunpc1, rank 2 core 0 of socket 1, and
>>> rank 3 core 1 of socket 1 from sunpc1. Everything is fine for my
>>> processes with rank 0 and 1, but it's wrong for ranks 2 and 3,
>>> because they both get all four cores of sunpc1. Is something wrong
>>> with my rankfile or with your mapping of processes to cores? I have
>>> removed the output from "hostname" and wrapped long lines.
>>>
>>> tyr rankfiles 115 mpiexec -report-bindings -rf rf_bsp hostname
>>> [linpc1:31092] MCW rank 0 bound to socket 0[core 0-1] socket 1[core 0-1]:
>>> [B B][B B] (slot list 0:0-1,1:0-1)
>>> [sunpc1:12916] MCW rank 1 bound to socket 0[core 0-1]:
>>> [B B][. .] (slot list 0:0-1)
>>> [sunpc1:12916] MCW rank 2 bound to socket 0[core 0-1] socket 1[core 0-1]:
>>> [B B][B B] (slot list 1:0)
>>> [sunpc1:12916] MCW rank 3 bound to socket 0[core 0-1] socket 1[core 0-1]:
>>> [B B][B B] (slot list 1:1)
>>>
>>>
>>> I get the following output, if I add the options which you mentioned
>>> in a previous email.
>>>
>>> tyr rankfiles 124 mpiexec -report-bindings -rf rf_bsp \
>>> -display-allocation -mca ras_base_verbose 5 hostname
>>> [tyr.informatik.hs-fulda.de:19401] mca:base:select:( ras)
>>> Querying component [cm]
>>> [tyr.informatik.hs-fulda.de:19401] mca:base:select:( ras)
>>> Skipping component [cm]. Query failed to return a module
>>> [tyr.informatik.hs-fulda.de:19401] mca:base:select:( ras)
>>> No component selected!
>>> [tyr.informatik.hs-fulda.de:19401] [[27101,0],0] ras:base:allocate
>>> [tyr.informatik.hs-fulda.de:19401] [[27101,0],0] ras:base:allocate
>>> nothing found in module - proceeding to hostfile
>>> [tyr.informatik.hs-fulda.de:19401] [[27101,0],0] ras:base:allocate
>>> parsing default hostfile
>>> /usr/local/openmpi-1.6.4_64_cc/etc/openmpi-default-hostfile
>>> [tyr.informatik.hs-fulda.de:19401] [[27101,0],0] ras:base:allocate
>>> nothing found in hostfiles or dash-host - checking for rankfile
>>> [tyr.informatik.hs-fulda.de:19401] [[27101,0],0]
>>> ras:base:node_insert inserting 2 nodes
>>> [tyr.informatik.hs-fulda.de:19401] [[27101,0],0]
>>> ras:base:node_insert node linpc1
>>> [tyr.informatik.hs-fulda.de:19401] [[27101,0],0]
>>> ras:base:node_insert node sunpc1
>>>
>>> ====================== ALLOCATED NODES ======================
>>>
>>> Data for node: tyr.informatik.hs-fulda.de Num slots: 0 Max slots: 0
>>> Data for node: linpc1 Num slots: 1 Max slots: 0
>>> Data for node: sunpc1 Num slots: 3 Max slots: 0
>>>
>>> =================================================================
>>> [linpc1:31532] MCW rank 0 bound to socket 0[core 0-1] socket 1[core 0-1]:
>>> [B B][B B] (slot list 0:0-1,1:0-1)
>>> [sunpc1:13136] MCW rank 1 bound to socket 0[core 0-1]:
>>> [B B][. .] (slot list 0:0-1)
>>> [sunpc1:13136] MCW rank 2 bound to socket 0[core 0-1] socket 1[core 0-1]:
>>> [B B][B B] (slot list 1:0)
>>> [sunpc1:13136] MCW rank 3 bound to socket 0[core 0-1] socket 1[core 0-1]:
>>> [B B][B B] (slot list 1:1)
>>>
>>>
>>> Thank you very much for any suggestions and any help in advance.
>>>
>>>
>>> Kind regards
>>>
>>> Siegmar
>>>
>>> _______________________________________________
>>> users mailing list
>>> users_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>>
>