Hi
I applied your patch "rmaps.diff" to openmpi-1.6.4rc3r27923 and
it works for my previous rankfile.
> #3493: Handle the case where rankfile provides the allocation
> -----------------------------------+----------------------------
> Reporter: rhc | Owner: jsquyres
> Type: changeset move request | Status: new
> Priority: critical | Milestone: Open MPI 1.6.4
> Version: trunk | Keywords:
> -----------------------------------+----------------------------
> Please apply the attached patch that corrects the rmaps function for
> obtaining the available nodes when rankfile is providing the allocation.
tyr rankfiles 129 more rf_linpc1
# mpiexec -report-bindings -rf rf_linpc1 hostname
rank 0=linpc1 slot=0:0-1,1:0-1
tyr rankfiles 130 mpiexec -report-bindings -rf rf_linpc1 hostname
[linpc1:31603] MCW rank 0 bound to socket 0[core 0-1]
socket 1[core 0-1]: [B B][B B] (slot list 0:0-1,1:0-1)
Unfortunately I don't get the expected result for the following
rankfile.
tyr rankfiles 114 more rf_bsp
# mpiexec -report-bindings -rf rf_bsp hostname
rank 0=linpc1 slot=0:0-1,1:0-1
rank 1=sunpc1 slot=0:0-1
rank 2=sunpc1 slot=1:0
rank 3=sunpc1 slot=1:1
I would expect that rank 0 gets all four cores from linpc1, rank 1
both cores of socket 0 from sunpc1, rank 2 core 0 of socket 1, and
rank 3 core 1 of socket 1 from sunpc1. Everything is fine for my
processes with rank 0 and 1, but it's wrong for ranks 2 and 3,
because they both get all four cores of sunpc1. Is something wrong
with my rankfile or with your mapping of processes to cores? I have
removed the output from "hostname" and wrapped long lines.
tyr rankfiles 115 mpiexec -report-bindings -rf rf_bsp hostname
[linpc1:31092] MCW rank 0 bound to socket 0[core 0-1] socket 1[core 0-1]:
[B B][B B] (slot list 0:0-1,1:0-1)
[sunpc1:12916] MCW rank 1 bound to socket 0[core 0-1]:
[B B][. .] (slot list 0:0-1)
[sunpc1:12916] MCW rank 2 bound to socket 0[core 0-1] socket 1[core 0-1]:
[B B][B B] (slot list 1:0)
[sunpc1:12916] MCW rank 3 bound to socket 0[core 0-1] socket 1[core 0-1]:
[B B][B B] (slot list 1:1)
I get the following output, if I add the options which you mentioned
in a previous email.
tyr rankfiles 124 mpiexec -report-bindings -rf rf_bsp \
-display-allocation -mca ras_base_verbose 5 hostname
[tyr.informatik.hs-fulda.de:19401] mca:base:select:( ras)
Querying component [cm]
[tyr.informatik.hs-fulda.de:19401] mca:base:select:( ras)
Skipping component [cm]. Query failed to return a module
[tyr.informatik.hs-fulda.de:19401] mca:base:select:( ras)
No component selected!
[tyr.informatik.hs-fulda.de:19401] [[27101,0],0] ras:base:allocate
[tyr.informatik.hs-fulda.de:19401] [[27101,0],0] ras:base:allocate
nothing found in module - proceeding to hostfile
[tyr.informatik.hs-fulda.de:19401] [[27101,0],0] ras:base:allocate
parsing default hostfile
/usr/local/openmpi-1.6.4_64_cc/etc/openmpi-default-hostfile
[tyr.informatik.hs-fulda.de:19401] [[27101,0],0] ras:base:allocate
nothing found in hostfiles or dash-host - checking for rankfile
[tyr.informatik.hs-fulda.de:19401] [[27101,0],0]
ras:base:node_insert inserting 2 nodes
[tyr.informatik.hs-fulda.de:19401] [[27101,0],0]
ras:base:node_insert node linpc1
[tyr.informatik.hs-fulda.de:19401] [[27101,0],0]
ras:base:node_insert node sunpc1
====================== ALLOCATED NODES ======================
Data for node: tyr.informatik.hs-fulda.de Num slots: 0 Max slots: 0
Data for node: linpc1 Num slots: 1 Max slots: 0
Data for node: sunpc1 Num slots: 3 Max slots: 0
=================================================================
[linpc1:31532] MCW rank 0 bound to socket 0[core 0-1] socket 1[core 0-1]:
[B B][B B] (slot list 0:0-1,1:0-1)
[sunpc1:13136] MCW rank 1 bound to socket 0[core 0-1]:
[B B][. .] (slot list 0:0-1)
[sunpc1:13136] MCW rank 2 bound to socket 0[core 0-1] socket 1[core 0-1]:
[B B][B B] (slot list 1:0)
[sunpc1:13136] MCW rank 3 bound to socket 0[core 0-1] socket 1[core 0-1]:
[B B][B B] (slot list 1:1)
Thank you very much for any suggestions and any help in advance.
Kind regards
Siegmar
|