Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] problem with rankfile
From: Siegmar Gross (Siegmar.Gross_at_[hidden])
Date: 2012-09-05 08:50:16


Hi,

I'm new to rankfiles so that I played a little bit with different
options. I thought that the following entry would be similar to an
entry in an appfile and that MPI could place the process with rank 0
on any core of any processor.

rank 0=tyr.informatik.hs-fulda.de

Unfortunately it's not allowed and I got an error. Can somebody add
the missing help to the file?

tyr small_prog 126 mpiexec -rf my_rankfile -report-bindings rank_size
--------------------------------------------------------------------------
Sorry! You were supposed to get help about:
    no-slot-list
from the file:
    help-rmaps_rank_file.txt
But I couldn't find that topic in the file. Sorry!
--------------------------------------------------------------------------

As you can see below I could use a rankfile on my old local machine
(Sun Ultra 45) but not on our "new" one (Sun Server M4000). Today I
logged into the machine via ssh and tried the same command once more
as a local user without success. It's more or less the same error as
before when I tried to bind the process to a remote machine.

rs0 small_prog 118 mpiexec -rf my_rankfile -report-bindings rank_size
[rs0.informatik.hs-fulda.de:13745] [[19734,0],0] odls:default:fork
  binding child [[19734,1],0] to slot_list 0:0
--------------------------------------------------------------------------
We were unable to successfully process/set the requested processor
affinity settings:

Specified slot list: 0:0
Error: Cross-device link

This could mean that a non-existent processor was specified, or
that the specification had improper syntax.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpiexec was unable to start the specified application as it encountered an error:

Error name: No such file or directory
Node: rs0.informatik.hs-fulda.de

when attempting to start process rank 0.
--------------------------------------------------------------------------
rs0 small_prog 119

The application is available.

rs0 small_prog 119 which rank_size
/home/fd1026/SunOS/sparc/bin/rank_size

Is it a problem in the Open MPI implementation or in my rankfile?
How can I request which sockets and cores per socket are
available so that I can use correct values in my rankfile?
In lam-mpi I had a command "lamnodes" which I could use to get
such information. Thank you very much for any help in advance.

Kind regards

Siegmar

> > Are *all* the machines Sparc? Or just the 3rd one (rs0)?
>
> Yes, both machines are Sparc. I tried first in a homogeneous
> environment.
>
> tyr fd1026 106 psrinfo -v
> Status of virtual processor 0 as of: 09/04/2012 07:32:14
> on-line since 08/31/2012 15:44:42.
> The sparcv9 processor operates at 1600 MHz,
> and has a sparcv9 floating point processor.
> Status of virtual processor 1 as of: 09/04/2012 07:32:14
> on-line since 08/31/2012 15:44:39.
> The sparcv9 processor operates at 1600 MHz,
> and has a sparcv9 floating point processor.
> tyr fd1026 107
>
> My local machine (tyr) is a dual processor machine and the
> other one is equipped with two quad-core processors each
> capable of running two hardware threads.
>
>
> Kind regards
>
> Siegmar
>
>
> > On Sep 3, 2012, at 12:43 PM, Siegmar Gross
> <Siegmar.Gross_at_[hidden]> wrote:
> >
> > > Hi,
> > >
> > > the man page for "mpiexec" shows the following:
> > >
> > > cat myrankfile
> > > rank 0=aa slot=1:0-2
> > > rank 1=bb slot=0:0,1
> > > rank 2=cc slot=1-2
> > > mpirun -H aa,bb,cc,dd -rf myrankfile ./a.out So that
> > >
> > > Rank 0 runs on node aa, bound to socket 1, cores 0-2.
> > > Rank 1 runs on node bb, bound to socket 0, cores 0 and 1.
> > > Rank 2 runs on node cc, bound to cores 1 and 2.
> > >
> > > Does it mean that the process with rank 0 should be bound to
> > > core 0, 1, or 2 of socket 1?
> > >
> > > I tried to use a rankfile and have a problem. My rankfile contains
> > > the following lines.
> > >
> > > rank 0=tyr.informatik.hs-fulda.de slot=0:0
> > > rank 1=tyr.informatik.hs-fulda.de slot=1:0
> > > #rank 2=rs0.informatik.hs-fulda.de slot=0:0
> > >
> > >
> > > Everything is fine if I use the file with just my local machine
> > > (the first two lines).
> > >
> > > tyr small_prog 115 mpiexec -report-bindings -rf my_rankfile rank_size
> > > [tyr.informatik.hs-fulda.de:01133] [[9849,0],0]
> > > odls:default:fork binding child [[9849,1],0] to slot_list 0:0
> > > [tyr.informatik.hs-fulda.de:01133] [[9849,0],0]
> > > odls:default:fork binding child [[9849,1],1] to slot_list 1:0
> > > I'm process 0 of 2 available processes running on
> tyr.informatik.hs-fulda.de.
> > > MPI standard 2.1 is supported.
> > > I'm process 1 of 2 available processes running on
> tyr.informatik.hs-fulda.de.
> > > MPI standard 2.1 is supported.
> > > tyr small_prog 116
> > >
> > >
> > > I can also change the socket number and the processes will be attached
> > > to the correct cores. Unfortunately it doesn't work if I add one
> > > other machine (third line).
> > >
> > >
> > > tyr small_prog 112 mpiexec -report-bindings -rf my_rankfile rank_size
> > > --------------------------------------------------------------------------
> > > We were unable to successfully process/set the requested processor
> > > affinity settings:
> > >
> > > Specified slot list: 0:0
> > > Error: Cross-device link
> > >
> > > This could mean that a non-existent processor was specified, or
> > > that the specification had improper syntax.
> > > --------------------------------------------------------------------------
> > > [tyr.informatik.hs-fulda.de:01520] [[10212,0],0]
> > > odls:default:fork binding child [[10212,1],0] to slot_list 0:0
> > > [tyr.informatik.hs-fulda.de:01520] [[10212,0],0]
> > > odls:default:fork binding child [[10212,1],1] to slot_list 1:0
> > > [rs0.informatik.hs-fulda.de:12047] [[10212,0],1]
> > > odls:default:fork binding child [[10212,1],2] to slot_list 0:0
> > > [tyr.informatik.hs-fulda.de:01520] [[10212,0],0]
> > > ORTE_ERROR_LOG: A message is attempting to be sent to a process
> > > whose contact information is unknown in file
> > > ../../../../../openmpi-1.6/orte/mca/rml/oob/rml_oob_send.c at line 145
> > > [tyr.informatik.hs-fulda.de:01520] [[10212,0],0] attempted to send
> > > to [[10212,1],0]: tag 20
> > > [tyr.informatik.hs-fulda.de:01520] [[10212,0],0] ORTE_ERROR_LOG:
> > > A message is attempting to be sent to a process whose contact
> > > information is unknown in file
> > > ../../../../openmpi-1.6/orte/mca/odls/base/odls_base_default_fns.c
> > > at line 2501
> > > --------------------------------------------------------------------------
> > > mpiexec was unable to start the specified application as it
> > > encountered an error:
> > >
> > > Error name: Error 0
> > > Node: rs0.informatik.hs-fulda.de
> > >
> > > when attempting to start process rank 2.
> > > --------------------------------------------------------------------------
> > > tyr small_prog 113
> > >
> > >
> > >
> > > The other machine has two 8 core processors.
> > >
> > > tyr small_prog 121 ssh rs0 psrinfo -v
> > > Status of virtual processor 0 as of: 09/03/2012 19:51:15
> > > on-line since 07/26/2012 15:03:14.
> > > The sparcv9 processor operates at 2400 MHz,
> > > and has a sparcv9 floating point processor.
> > > Status of virtual processor 1 as of: 09/03/2012 19:51:15
> > > ...
> > > Status of virtual processor 15 as of: 09/03/2012 19:51:15
> > > on-line since 07/26/2012 15:03:16.
> > > The sparcv9 processor operates at 2400 MHz,
> > > and has a sparcv9 floating point processor.
> > > tyr small_prog 122
> > >
> > >
> > >
> > > Is it necessary to specify another option on the command line or
> > > is my rankfile faulty? Thank you very much for any suggestions in
> > > advance.
> > >
> > >
> > > Kind regards
> > >
> > > Siegmar
> > >
> > >
> > > _______________________________________________
> > > users mailing list
> > > users_at_[hidden]
> > > http://www.open-mpi.org/mailman/listinfo.cgi/users
> >
> >
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users