Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] I have still a problem with rankfiles in openmpi-1.6.4rc3
From: Ralph Castain (rhc_at_[hidden])
Date: 2013-01-30 13:20:57


Hmmm....well, it certainly works for me:

[rhc_at_odin ~/v1.6]$ cat rf
rank 0=odin093 slot=0:0-1,1:0-1
rank 1=odin094 slot=0:0-1
rank 2=odin094 slot=1:0
rank 3=odin094 slot=1:1

[rhc_at_odin ~/v1.6]$ mpirun -n 4 -rf ./rf --report-bindings -mca opal_paffinity_alone 0 hostname
[odin093.cs.indiana.edu:04617] MCW rank 0 bound to socket 0[core 0-1] socket 1[core 0-1]: [B B][B B] (slot list 0:0-1,1:0-1)
odin093.cs.indiana.edu
odin094.cs.indiana.edu
[odin094.cs.indiana.edu:04426] MCW rank 1 bound to socket 0[core 0-1]: [B B][. .] (slot list 0:0-1)
odin094.cs.indiana.edu
[odin094.cs.indiana.edu:04426] MCW rank 2 bound to socket 1[core 0]: [. .][B .] (slot list 1:0)
[odin094.cs.indiana.edu:04426] MCW rank 3 bound to socket 1[core 1]: [. .][. B] (slot list 1:1)
odin094.cs.indiana.edu

I see one thing of concern to me in your output - your second node appears to be a Sun computer. Is it the same physical architecture? Is it also running Linux? Are you sure it is using the same version of OMPI, built for that environment and hardware?

On Jan 30, 2013, at 2:08 AM, Siegmar Gross <Siegmar.Gross_at_[hidden]> wrote:

> Hi
>
> I applied your patch "rmaps.diff" to openmpi-1.6.4rc3r27923 and
> it works for my previous rankfile.
>
>
>> #3493: Handle the case where rankfile provides the allocation
>> -----------------------------------+----------------------------
>> Reporter: rhc | Owner: jsquyres
>> Type: changeset move request | Status: new
>> Priority: critical | Milestone: Open MPI 1.6.4
>> Version: trunk | Keywords:
>> -----------------------------------+----------------------------
>> Please apply the attached patch that corrects the rmaps function for
>> obtaining the available nodes when rankfile is providing the allocation.
>
>
> tyr rankfiles 129 more rf_linpc1
> # mpiexec -report-bindings -rf rf_linpc1 hostname
> rank 0=linpc1 slot=0:0-1,1:0-1
>
> tyr rankfiles 130 mpiexec -report-bindings -rf rf_linpc1 hostname
> [linpc1:31603] MCW rank 0 bound to socket 0[core 0-1]
> socket 1[core 0-1]: [B B][B B] (slot list 0:0-1,1:0-1)
>
>
>
> Unfortunately I don't get the expected result for the following
> rankfile.
>
> tyr rankfiles 114 more rf_bsp
> # mpiexec -report-bindings -rf rf_bsp hostname
> rank 0=linpc1 slot=0:0-1,1:0-1
> rank 1=sunpc1 slot=0:0-1
> rank 2=sunpc1 slot=1:0
> rank 3=sunpc1 slot=1:1
>
> I would expect that rank 0 gets all four cores from linpc1, rank 1
> both cores of socket 0 from sunpc1, rank 2 core 0 of socket 1, and
> rank 3 core 1 of socket 1 from sunpc1. Everything is fine for my
> processes with rank 0 and 1, but it's wrong for ranks 2 and 3,
> because they both get all four cores of sunpc1. Is something wrong
> with my rankfile or with your mapping of processes to cores? I have
> removed the output from "hostname" and wrapped long lines.
>
> tyr rankfiles 115 mpiexec -report-bindings -rf rf_bsp hostname
> [linpc1:31092] MCW rank 0 bound to socket 0[core 0-1] socket 1[core 0-1]:
> [B B][B B] (slot list 0:0-1,1:0-1)
> [sunpc1:12916] MCW rank 1 bound to socket 0[core 0-1]:
> [B B][. .] (slot list 0:0-1)
> [sunpc1:12916] MCW rank 2 bound to socket 0[core 0-1] socket 1[core 0-1]:
> [B B][B B] (slot list 1:0)
> [sunpc1:12916] MCW rank 3 bound to socket 0[core 0-1] socket 1[core 0-1]:
> [B B][B B] (slot list 1:1)
>
>
> I get the following output, if I add the options which you mentioned
> in a previous email.
>
> tyr rankfiles 124 mpiexec -report-bindings -rf rf_bsp \
> -display-allocation -mca ras_base_verbose 5 hostname
> [tyr.informatik.hs-fulda.de:19401] mca:base:select:( ras)
> Querying component [cm]
> [tyr.informatik.hs-fulda.de:19401] mca:base:select:( ras)
> Skipping component [cm]. Query failed to return a module
> [tyr.informatik.hs-fulda.de:19401] mca:base:select:( ras)
> No component selected!
> [tyr.informatik.hs-fulda.de:19401] [[27101,0],0] ras:base:allocate
> [tyr.informatik.hs-fulda.de:19401] [[27101,0],0] ras:base:allocate
> nothing found in module - proceeding to hostfile
> [tyr.informatik.hs-fulda.de:19401] [[27101,0],0] ras:base:allocate
> parsing default hostfile
> /usr/local/openmpi-1.6.4_64_cc/etc/openmpi-default-hostfile
> [tyr.informatik.hs-fulda.de:19401] [[27101,0],0] ras:base:allocate
> nothing found in hostfiles or dash-host - checking for rankfile
> [tyr.informatik.hs-fulda.de:19401] [[27101,0],0]
> ras:base:node_insert inserting 2 nodes
> [tyr.informatik.hs-fulda.de:19401] [[27101,0],0]
> ras:base:node_insert node linpc1
> [tyr.informatik.hs-fulda.de:19401] [[27101,0],0]
> ras:base:node_insert node sunpc1
>
> ====================== ALLOCATED NODES ======================
>
> Data for node: tyr.informatik.hs-fulda.de Num slots: 0 Max slots: 0
> Data for node: linpc1 Num slots: 1 Max slots: 0
> Data for node: sunpc1 Num slots: 3 Max slots: 0
>
> =================================================================
> [linpc1:31532] MCW rank 0 bound to socket 0[core 0-1] socket 1[core 0-1]:
> [B B][B B] (slot list 0:0-1,1:0-1)
> [sunpc1:13136] MCW rank 1 bound to socket 0[core 0-1]:
> [B B][. .] (slot list 0:0-1)
> [sunpc1:13136] MCW rank 2 bound to socket 0[core 0-1] socket 1[core 0-1]:
> [B B][B B] (slot list 1:0)
> [sunpc1:13136] MCW rank 3 bound to socket 0[core 0-1] socket 1[core 0-1]:
> [B B][B B] (slot list 1:1)
>
>
> Thank you very much for any suggestions and any help in advance.
>
>
> Kind regards
>
> Siegmar
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users