Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] I have still a problem with rankfiles in openmpi-1.6.4rc3
From: Siegmar Gross (Siegmar.Gross_at_[hidden])
Date: 2013-02-05 03:30:34


Hi

now I can use all our machines once more. I have a problem on
Solaris 10 x86_64, because the mapping of processes doesn't
correspond to the rankfile. I removed the output from "hostfile"
and wrapped around long lines.

tyr rankfiles 114 cat rf_ex_sunpc
# mpiexec -report-bindings -rf rf_ex_sunpc hostname

rank 0=sunpc0 slot=0:0-1,1:0-1
rank 1=sunpc1 slot=0:0-1
rank 2=sunpc1 slot=1:0
rank 3=sunpc1 slot=1:1

tyr rankfiles 115 mpiexec -report-bindings -rf rf_ex_sunpc hostname
[sunpc0:17920] MCW rank 0 bound to socket 0[core 0-1]
  socket 1[core 0-1]: [B B][B B] (slot list 0:0-1,1:0-1)
[sunpc1:11265] MCW rank 1 bound to socket 0[core 0-1]:
  [B B][. .] (slot list 0:0-1)
[sunpc1:11265] MCW rank 2 bound to socket 0[core 0-1]
  socket 1[core 0-1]: [B B][B B] (slot list 1:0)
[sunpc1:11265] MCW rank 3 bound to socket 0[core 0-1]
  socket 1[core 0-1]: [B B][B B] (slot list 1:1)

Can I provide any information to solve this problem? My
rankfile works as expected, if I use only Linux machines.

Kind regards

Siegmar

> > Hmmm....well, it certainly works for me:
> >
> > [rhc_at_odin ~/v1.6]$ cat rf
> > rank 0=odin093 slot=0:0-1,1:0-1
> > rank 1=odin094 slot=0:0-1
> > rank 2=odin094 slot=1:0
> > rank 3=odin094 slot=1:1
> >
> >
> > [rhc_at_odin ~/v1.6]$ mpirun -n 4 -rf ./rf --report-bindings
> > -mca opal_paffinity_alone 0 hostname
> > [odin093.cs.indiana.edu:04617] MCW rank 0 bound to
> > socket 0[core 0-1] socket 1[core 0-1]: [B B][B B] (slot list
0:0-1,1:0-1)
> > odin093.cs.indiana.edu
> > odin094.cs.indiana.edu
> > [odin094.cs.indiana.edu:04426] MCW rank 1 bound to
> > socket 0[core 0-1]: [B B][. .] (slot list 0:0-1)
> > odin094.cs.indiana.edu
> > [odin094.cs.indiana.edu:04426] MCW rank 2 bound to
> > socket 1[core 0]: [. .][B .] (slot list 1:0)
> > [odin094.cs.indiana.edu:04426] MCW rank 3 bound to
> > socket 1[core 1]: [. .][. B] (slot list 1:1)
> > odin094.cs.indiana.edu
>
> Interesting that it works on your machines.
>
>
> > I see one thing of concern to me in your output - your second node
> > appears to be a Sun computer. Is it the same physical architecture?
> > Is it also running Linux? Are you sure it is using the same version
> > of OMPI, built for that environment and hardware?
>
> Both machines (in fact all four machines: sunpc0, sunpc1, linpc0, and
> linpc1) use the same hardware. "linpc" uses openSUSE 12.1 and "sunpc"
> Solaris 10 x86_64. All machines use the same version of Open MPI,
> built for that environment. At the moment I can only use sunpc1 and
> linpc1 ("my" developer machines). Next week I will have access to all
> machines so that I can test, if I get a different behaviour when I
> use two machines with the same operating system (although mixed
> operating systems weren't a problem in the past (only machines with
> differnt endians)). I let you know my results.
>
>
> Kind regards
>
> Siegmar
>
>
>
>
> > On Jan 30, 2013, at 2:08 AM, Siegmar Gross
> <Siegmar.Gross_at_[hidden]> wrote:
> >
> > > Hi
> > >
> > > I applied your patch "rmaps.diff" to openmpi-1.6.4rc3r27923 and
> > > it works for my previous rankfile.
> > >
> > >
> > >> #3493: Handle the case where rankfile provides the allocation
> > >> -----------------------------------+----------------------------
> > >> Reporter: rhc | Owner: jsquyres
> > >> Type: changeset move request | Status: new
> > >> Priority: critical | Milestone: Open MPI 1.6.4
> > >> Version: trunk | Keywords:
> > >> -----------------------------------+----------------------------
> > >> Please apply the attached patch that corrects the rmaps function for
> > >> obtaining the available nodes when rankfile is providing the
allocation.
> > >
> > >
> > > tyr rankfiles 129 more rf_linpc1
> > > # mpiexec -report-bindings -rf rf_linpc1 hostname
> > > rank 0=linpc1 slot=0:0-1,1:0-1
> > >
> > > tyr rankfiles 130 mpiexec -report-bindings -rf rf_linpc1 hostname
> > > [linpc1:31603] MCW rank 0 bound to socket 0[core 0-1]
> > > socket 1[core 0-1]: [B B][B B] (slot list 0:0-1,1:0-1)
> > >
> > >
> > >
> > > Unfortunately I don't get the expected result for the following
> > > rankfile.
> > >
> > > tyr rankfiles 114 more rf_bsp
> > > # mpiexec -report-bindings -rf rf_bsp hostname
> > > rank 0=linpc1 slot=0:0-1,1:0-1
> > > rank 1=sunpc1 slot=0:0-1
> > > rank 2=sunpc1 slot=1:0
> > > rank 3=sunpc1 slot=1:1
> > >
> > > I would expect that rank 0 gets all four cores from linpc1, rank 1
> > > both cores of socket 0 from sunpc1, rank 2 core 0 of socket 1, and
> > > rank 3 core 1 of socket 1 from sunpc1. Everything is fine for my
> > > processes with rank 0 and 1, but it's wrong for ranks 2 and 3,
> > > because they both get all four cores of sunpc1. Is something wrong
> > > with my rankfile or with your mapping of processes to cores? I have
> > > removed the output from "hostname" and wrapped long lines.
> > >
> > > tyr rankfiles 115 mpiexec -report-bindings -rf rf_bsp hostname
> > > [linpc1:31092] MCW rank 0 bound to socket 0[core 0-1] socket 1[core
0-1]:
> > > [B B][B B] (slot list 0:0-1,1:0-1)
> > > [sunpc1:12916] MCW rank 1 bound to socket 0[core 0-1]:
> > > [B B][. .] (slot list 0:0-1)
> > > [sunpc1:12916] MCW rank 2 bound to socket 0[core 0-1] socket 1[core
0-1]:
> > > [B B][B B] (slot list 1:0)
> > > [sunpc1:12916] MCW rank 3 bound to socket 0[core 0-1] socket 1[core
0-1]:
> > > [B B][B B] (slot list 1:1)
> > >
> > >
> > > I get the following output, if I add the options which you mentioned
> > > in a previous email.
> > >
> > > tyr rankfiles 124 mpiexec -report-bindings -rf rf_bsp \
> > > -display-allocation -mca ras_base_verbose 5 hostname
> > > [tyr.informatik.hs-fulda.de:19401] mca:base:select:( ras)
> > > Querying component [cm]
> > > [tyr.informatik.hs-fulda.de:19401] mca:base:select:( ras)
> > > Skipping component [cm]. Query failed to return a module
> > > [tyr.informatik.hs-fulda.de:19401] mca:base:select:( ras)
> > > No component selected!
> > > [tyr.informatik.hs-fulda.de:19401] [[27101,0],0] ras:base:allocate
> > > [tyr.informatik.hs-fulda.de:19401] [[27101,0],0] ras:base:allocate
> > > nothing found in module - proceeding to hostfile
> > > [tyr.informatik.hs-fulda.de:19401] [[27101,0],0] ras:base:allocate
> > > parsing default hostfile
> > > /usr/local/openmpi-1.6.4_64_cc/etc/openmpi-default-hostfile
> > > [tyr.informatik.hs-fulda.de:19401] [[27101,0],0] ras:base:allocate
> > > nothing found in hostfiles or dash-host - checking for rankfile
> > > [tyr.informatik.hs-fulda.de:19401] [[27101,0],0]
> > > ras:base:node_insert inserting 2 nodes
> > > [tyr.informatik.hs-fulda.de:19401] [[27101,0],0]
> > > ras:base:node_insert node linpc1
> > > [tyr.informatik.hs-fulda.de:19401] [[27101,0],0]
> > > ras:base:node_insert node sunpc1
> > >
> > > ====================== ALLOCATED NODES ======================
> > >
> > > Data for node: tyr.informatik.hs-fulda.de Num slots: 0 Max slots: 0
> > > Data for node: linpc1 Num slots: 1 Max slots: 0
> > > Data for node: sunpc1 Num slots: 3 Max slots: 0
> > >
> > > =================================================================
> > > [linpc1:31532] MCW rank 0 bound to socket 0[core 0-1] socket 1[core
0-1]:
> > > [B B][B B] (slot list 0:0-1,1:0-1)
> > > [sunpc1:13136] MCW rank 1 bound to socket 0[core 0-1]:
> > > [B B][. .] (slot list 0:0-1)
> > > [sunpc1:13136] MCW rank 2 bound to socket 0[core 0-1] socket 1[core
0-1]:
> > > [B B][B B] (slot list 1:0)
> > > [sunpc1:13136] MCW rank 3 bound to socket 0[core 0-1] socket 1[core
0-1]:
> > > [B B][B B] (slot list 1:1)
> > >
> > >
> > > Thank you very much for any suggestions and any help in advance.
> > >
> > >
> > > Kind regards
> > >
> > > Siegmar
> > >
> > > _______________________________________________
> > > users mailing list
> > > users_at_[hidden]
> > > http://www.open-mpi.org/mailman/listinfo.cgi/users
> >
> >
>