Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: [OMPI users] I have still a problem with rankfiles in openmpi-1.6.4rc3
From: Siegmar Gross (Siegmar.Gross_at_[hidden])
Date: 2013-01-30 05:08:12


Hi

I applied your patch "rmaps.diff" to openmpi-1.6.4rc3r27923 and
it works for my previous rankfile.

> #3493: Handle the case where rankfile provides the allocation
> -----------------------------------+----------------------------
> Reporter: rhc | Owner: jsquyres
> Type: changeset move request | Status: new
> Priority: critical | Milestone: Open MPI 1.6.4
> Version: trunk | Keywords:
> -----------------------------------+----------------------------
> Please apply the attached patch that corrects the rmaps function for
> obtaining the available nodes when rankfile is providing the allocation.

tyr rankfiles 129 more rf_linpc1
# mpiexec -report-bindings -rf rf_linpc1 hostname
rank 0=linpc1 slot=0:0-1,1:0-1

tyr rankfiles 130 mpiexec -report-bindings -rf rf_linpc1 hostname
[linpc1:31603] MCW rank 0 bound to socket 0[core 0-1]
  socket 1[core 0-1]: [B B][B B] (slot list 0:0-1,1:0-1)

Unfortunately I don't get the expected result for the following
rankfile.

tyr rankfiles 114 more rf_bsp
# mpiexec -report-bindings -rf rf_bsp hostname
rank 0=linpc1 slot=0:0-1,1:0-1
rank 1=sunpc1 slot=0:0-1
rank 2=sunpc1 slot=1:0
rank 3=sunpc1 slot=1:1

I would expect that rank 0 gets all four cores from linpc1, rank 1
both cores of socket 0 from sunpc1, rank 2 core 0 of socket 1, and
rank 3 core 1 of socket 1 from sunpc1. Everything is fine for my
processes with rank 0 and 1, but it's wrong for ranks 2 and 3,
because they both get all four cores of sunpc1. Is something wrong
with my rankfile or with your mapping of processes to cores? I have
removed the output from "hostname" and wrapped long lines.

tyr rankfiles 115 mpiexec -report-bindings -rf rf_bsp hostname
[linpc1:31092] MCW rank 0 bound to socket 0[core 0-1] socket 1[core 0-1]:
  [B B][B B] (slot list 0:0-1,1:0-1)
[sunpc1:12916] MCW rank 1 bound to socket 0[core 0-1]:
  [B B][. .] (slot list 0:0-1)
[sunpc1:12916] MCW rank 2 bound to socket 0[core 0-1] socket 1[core 0-1]:
  [B B][B B] (slot list 1:0)
[sunpc1:12916] MCW rank 3 bound to socket 0[core 0-1] socket 1[core 0-1]:
  [B B][B B] (slot list 1:1)

I get the following output, if I add the options which you mentioned
in a previous email.

tyr rankfiles 124 mpiexec -report-bindings -rf rf_bsp \
  -display-allocation -mca ras_base_verbose 5 hostname
[tyr.informatik.hs-fulda.de:19401] mca:base:select:( ras)
  Querying component [cm]
[tyr.informatik.hs-fulda.de:19401] mca:base:select:( ras)
  Skipping component [cm]. Query failed to return a module
[tyr.informatik.hs-fulda.de:19401] mca:base:select:( ras)
  No component selected!
[tyr.informatik.hs-fulda.de:19401] [[27101,0],0] ras:base:allocate
[tyr.informatik.hs-fulda.de:19401] [[27101,0],0] ras:base:allocate
  nothing found in module - proceeding to hostfile
[tyr.informatik.hs-fulda.de:19401] [[27101,0],0] ras:base:allocate
  parsing default hostfile
   /usr/local/openmpi-1.6.4_64_cc/etc/openmpi-default-hostfile
[tyr.informatik.hs-fulda.de:19401] [[27101,0],0] ras:base:allocate
  nothing found in hostfiles or dash-host - checking for rankfile
[tyr.informatik.hs-fulda.de:19401] [[27101,0],0]
  ras:base:node_insert inserting 2 nodes
[tyr.informatik.hs-fulda.de:19401] [[27101,0],0]
  ras:base:node_insert node linpc1
[tyr.informatik.hs-fulda.de:19401] [[27101,0],0]
  ras:base:node_insert node sunpc1

====================== ALLOCATED NODES ======================

 Data for node: tyr.informatik.hs-fulda.de Num slots: 0 Max slots: 0
 Data for node: linpc1 Num slots: 1 Max slots: 0
 Data for node: sunpc1 Num slots: 3 Max slots: 0

=================================================================
[linpc1:31532] MCW rank 0 bound to socket 0[core 0-1] socket 1[core 0-1]:
  [B B][B B] (slot list 0:0-1,1:0-1)
[sunpc1:13136] MCW rank 1 bound to socket 0[core 0-1]:
  [B B][. .] (slot list 0:0-1)
[sunpc1:13136] MCW rank 2 bound to socket 0[core 0-1] socket 1[core 0-1]:
  [B B][B B] (slot list 1:0)
[sunpc1:13136] MCW rank 3 bound to socket 0[core 0-1] socket 1[core 0-1]:
  [B B][B B] (slot list 1:1)

Thank you very much for any suggestions and any help in advance.

Kind regards

Siegmar