Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] problem with rankfile
From: Siegmar Gross (Siegmar.Gross_at_[hidden])
Date: 2012-09-04 01:37:14


Hi,

> Are *all* the machines Sparc? Or just the 3rd one (rs0)?

Yes, both machines are Sparc. I tried first in a homogeneous
environment.

tyr fd1026 106 psrinfo -v
Status of virtual processor 0 as of: 09/04/2012 07:32:14
  on-line since 08/31/2012 15:44:42.
  The sparcv9 processor operates at 1600 MHz,
        and has a sparcv9 floating point processor.
Status of virtual processor 1 as of: 09/04/2012 07:32:14
  on-line since 08/31/2012 15:44:39.
  The sparcv9 processor operates at 1600 MHz,
        and has a sparcv9 floating point processor.
tyr fd1026 107

My local machine (tyr) is a dual processor machine and the
other one is equipped with two quad-core processors each
capable of running two hardware threads.

Kind regards

Siegmar

> On Sep 3, 2012, at 12:43 PM, Siegmar Gross
<Siegmar.Gross_at_[hidden]> wrote:
>
> > Hi,
> >
> > the man page for "mpiexec" shows the following:
> >
> > cat myrankfile
> > rank 0=aa slot=1:0-2
> > rank 1=bb slot=0:0,1
> > rank 2=cc slot=1-2
> > mpirun -H aa,bb,cc,dd -rf myrankfile ./a.out So that
> >
> > Rank 0 runs on node aa, bound to socket 1, cores 0-2.
> > Rank 1 runs on node bb, bound to socket 0, cores 0 and 1.
> > Rank 2 runs on node cc, bound to cores 1 and 2.
> >
> > Does it mean that the process with rank 0 should be bound to
> > core 0, 1, or 2 of socket 1?
> >
> > I tried to use a rankfile and have a problem. My rankfile contains
> > the following lines.
> >
> > rank 0=tyr.informatik.hs-fulda.de slot=0:0
> > rank 1=tyr.informatik.hs-fulda.de slot=1:0
> > #rank 2=rs0.informatik.hs-fulda.de slot=0:0
> >
> >
> > Everything is fine if I use the file with just my local machine
> > (the first two lines).
> >
> > tyr small_prog 115 mpiexec -report-bindings -rf my_rankfile rank_size
> > [tyr.informatik.hs-fulda.de:01133] [[9849,0],0]
> > odls:default:fork binding child [[9849,1],0] to slot_list 0:0
> > [tyr.informatik.hs-fulda.de:01133] [[9849,0],0]
> > odls:default:fork binding child [[9849,1],1] to slot_list 1:0
> > I'm process 0 of 2 available processes running on
tyr.informatik.hs-fulda.de.
> > MPI standard 2.1 is supported.
> > I'm process 1 of 2 available processes running on
tyr.informatik.hs-fulda.de.
> > MPI standard 2.1 is supported.
> > tyr small_prog 116
> >
> >
> > I can also change the socket number and the processes will be attached
> > to the correct cores. Unfortunately it doesn't work if I add one
> > other machine (third line).
> >
> >
> > tyr small_prog 112 mpiexec -report-bindings -rf my_rankfile rank_size
> > --------------------------------------------------------------------------
> > We were unable to successfully process/set the requested processor
> > affinity settings:
> >
> > Specified slot list: 0:0
> > Error: Cross-device link
> >
> > This could mean that a non-existent processor was specified, or
> > that the specification had improper syntax.
> > --------------------------------------------------------------------------
> > [tyr.informatik.hs-fulda.de:01520] [[10212,0],0]
> > odls:default:fork binding child [[10212,1],0] to slot_list 0:0
> > [tyr.informatik.hs-fulda.de:01520] [[10212,0],0]
> > odls:default:fork binding child [[10212,1],1] to slot_list 1:0
> > [rs0.informatik.hs-fulda.de:12047] [[10212,0],1]
> > odls:default:fork binding child [[10212,1],2] to slot_list 0:0
> > [tyr.informatik.hs-fulda.de:01520] [[10212,0],0]
> > ORTE_ERROR_LOG: A message is attempting to be sent to a process
> > whose contact information is unknown in file
> > ../../../../../openmpi-1.6/orte/mca/rml/oob/rml_oob_send.c at line 145
> > [tyr.informatik.hs-fulda.de:01520] [[10212,0],0] attempted to send
> > to [[10212,1],0]: tag 20
> > [tyr.informatik.hs-fulda.de:01520] [[10212,0],0] ORTE_ERROR_LOG:
> > A message is attempting to be sent to a process whose contact
> > information is unknown in file
> > ../../../../openmpi-1.6/orte/mca/odls/base/odls_base_default_fns.c
> > at line 2501
> > --------------------------------------------------------------------------
> > mpiexec was unable to start the specified application as it
> > encountered an error:
> >
> > Error name: Error 0
> > Node: rs0.informatik.hs-fulda.de
> >
> > when attempting to start process rank 2.
> > --------------------------------------------------------------------------
> > tyr small_prog 113
> >
> >
> >
> > The other machine has two 8 core processors.
> >
> > tyr small_prog 121 ssh rs0 psrinfo -v
> > Status of virtual processor 0 as of: 09/03/2012 19:51:15
> > on-line since 07/26/2012 15:03:14.
> > The sparcv9 processor operates at 2400 MHz,
> > and has a sparcv9 floating point processor.
> > Status of virtual processor 1 as of: 09/03/2012 19:51:15
> > ...
> > Status of virtual processor 15 as of: 09/03/2012 19:51:15
> > on-line since 07/26/2012 15:03:16.
> > The sparcv9 processor operates at 2400 MHz,
> > and has a sparcv9 floating point processor.
> > tyr small_prog 122
> >
> >
> >
> > Is it necessary to specify another option on the command line or
> > is my rankfile faulty? Thank you very much for any suggestions in
> > advance.
> >
> >
> > Kind regards
> >
> > Siegmar
> >
> >
> > _______________________________________________
> > users mailing list
> > users_at_[hidden]
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>