Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] problem with rankfile
From: Ralph Castain (rhc_at_[hidden])
Date: 2012-09-05 09:16:48


I couldn't really say for certain - I don't see anything obviously wrong with your syntax, and the code appears to be working or else it would fail on the other nodes as well. The fact that it fails solely on that machine seems suspect.

Set aside the rankfile for the moment and try to just bind to cores on that machine, something like:

mpiexec --report-bindings -bind-to-core -host rs0.informatik.hs-fulda.de -n 2 rank_size

If that doesn't work, then the problem isn't with rankfile

On Sep 5, 2012, at 5:50 AM, Siegmar Gross <Siegmar.Gross_at_[hidden]> wrote:

> Hi,
>
> I'm new to rankfiles so that I played a little bit with different
> options. I thought that the following entry would be similar to an
> entry in an appfile and that MPI could place the process with rank 0
> on any core of any processor.
>
> rank 0=tyr.informatik.hs-fulda.de
>
> Unfortunately it's not allowed and I got an error. Can somebody add
> the missing help to the file?
>
>
> tyr small_prog 126 mpiexec -rf my_rankfile -report-bindings rank_size
> --------------------------------------------------------------------------
> Sorry! You were supposed to get help about:
> no-slot-list
> from the file:
> help-rmaps_rank_file.txt
> But I couldn't find that topic in the file. Sorry!
> --------------------------------------------------------------------------
>
>
> As you can see below I could use a rankfile on my old local machine
> (Sun Ultra 45) but not on our "new" one (Sun Server M4000). Today I
> logged into the machine via ssh and tried the same command once more
> as a local user without success. It's more or less the same error as
> before when I tried to bind the process to a remote machine.
>
> rs0 small_prog 118 mpiexec -rf my_rankfile -report-bindings rank_size
> [rs0.informatik.hs-fulda.de:13745] [[19734,0],0] odls:default:fork
> binding child [[19734,1],0] to slot_list 0:0
> --------------------------------------------------------------------------
> We were unable to successfully process/set the requested processor
> affinity settings:
>
> Specified slot list: 0:0
> Error: Cross-device link
>
> This could mean that a non-existent processor was specified, or
> that the specification had improper syntax.
> --------------------------------------------------------------------------
> --------------------------------------------------------------------------
> mpiexec was unable to start the specified application as it encountered an error:
>
> Error name: No such file or directory
> Node: rs0.informatik.hs-fulda.de
>
> when attempting to start process rank 0.
> --------------------------------------------------------------------------
> rs0 small_prog 119
>
>
> The application is available.
>
> rs0 small_prog 119 which rank_size
> /home/fd1026/SunOS/sparc/bin/rank_size
>
>
> Is it a problem in the Open MPI implementation or in my rankfile?
> How can I request which sockets and cores per socket are
> available so that I can use correct values in my rankfile?
> In lam-mpi I had a command "lamnodes" which I could use to get
> such information. Thank you very much for any help in advance.
>
>
> Kind regards
>
> Siegmar
>
>
>
>>> Are *all* the machines Sparc? Or just the 3rd one (rs0)?
>>
>> Yes, both machines are Sparc. I tried first in a homogeneous
>> environment.
>>
>> tyr fd1026 106 psrinfo -v
>> Status of virtual processor 0 as of: 09/04/2012 07:32:14
>> on-line since 08/31/2012 15:44:42.
>> The sparcv9 processor operates at 1600 MHz,
>> and has a sparcv9 floating point processor.
>> Status of virtual processor 1 as of: 09/04/2012 07:32:14
>> on-line since 08/31/2012 15:44:39.
>> The sparcv9 processor operates at 1600 MHz,
>> and has a sparcv9 floating point processor.
>> tyr fd1026 107
>>
>> My local machine (tyr) is a dual processor machine and the
>> other one is equipped with two quad-core processors each
>> capable of running two hardware threads.
>>
>>
>> Kind regards
>>
>> Siegmar
>>
>>
>>> On Sep 3, 2012, at 12:43 PM, Siegmar Gross
>> <Siegmar.Gross_at_[hidden]> wrote:
>>>
>>>> Hi,
>>>>
>>>> the man page for "mpiexec" shows the following:
>>>>
>>>> cat myrankfile
>>>> rank 0=aa slot=1:0-2
>>>> rank 1=bb slot=0:0,1
>>>> rank 2=cc slot=1-2
>>>> mpirun -H aa,bb,cc,dd -rf myrankfile ./a.out So that
>>>>
>>>> Rank 0 runs on node aa, bound to socket 1, cores 0-2.
>>>> Rank 1 runs on node bb, bound to socket 0, cores 0 and 1.
>>>> Rank 2 runs on node cc, bound to cores 1 and 2.
>>>>
>>>> Does it mean that the process with rank 0 should be bound to
>>>> core 0, 1, or 2 of socket 1?
>>>>
>>>> I tried to use a rankfile and have a problem. My rankfile contains
>>>> the following lines.
>>>>
>>>> rank 0=tyr.informatik.hs-fulda.de slot=0:0
>>>> rank 1=tyr.informatik.hs-fulda.de slot=1:0
>>>> #rank 2=rs0.informatik.hs-fulda.de slot=0:0
>>>>
>>>>
>>>> Everything is fine if I use the file with just my local machine
>>>> (the first two lines).
>>>>
>>>> tyr small_prog 115 mpiexec -report-bindings -rf my_rankfile rank_size
>>>> [tyr.informatik.hs-fulda.de:01133] [[9849,0],0]
>>>> odls:default:fork binding child [[9849,1],0] to slot_list 0:0
>>>> [tyr.informatik.hs-fulda.de:01133] [[9849,0],0]
>>>> odls:default:fork binding child [[9849,1],1] to slot_list 1:0
>>>> I'm process 0 of 2 available processes running on
>> tyr.informatik.hs-fulda.de.
>>>> MPI standard 2.1 is supported.
>>>> I'm process 1 of 2 available processes running on
>> tyr.informatik.hs-fulda.de.
>>>> MPI standard 2.1 is supported.
>>>> tyr small_prog 116
>>>>
>>>>
>>>> I can also change the socket number and the processes will be attached
>>>> to the correct cores. Unfortunately it doesn't work if I add one
>>>> other machine (third line).
>>>>
>>>>
>>>> tyr small_prog 112 mpiexec -report-bindings -rf my_rankfile rank_size
>>>> --------------------------------------------------------------------------
>>>> We were unable to successfully process/set the requested processor
>>>> affinity settings:
>>>>
>>>> Specified slot list: 0:0
>>>> Error: Cross-device link
>>>>
>>>> This could mean that a non-existent processor was specified, or
>>>> that the specification had improper syntax.
>>>> --------------------------------------------------------------------------
>>>> [tyr.informatik.hs-fulda.de:01520] [[10212,0],0]
>>>> odls:default:fork binding child [[10212,1],0] to slot_list 0:0
>>>> [tyr.informatik.hs-fulda.de:01520] [[10212,0],0]
>>>> odls:default:fork binding child [[10212,1],1] to slot_list 1:0
>>>> [rs0.informatik.hs-fulda.de:12047] [[10212,0],1]
>>>> odls:default:fork binding child [[10212,1],2] to slot_list 0:0
>>>> [tyr.informatik.hs-fulda.de:01520] [[10212,0],0]
>>>> ORTE_ERROR_LOG: A message is attempting to be sent to a process
>>>> whose contact information is unknown in file
>>>> ../../../../../openmpi-1.6/orte/mca/rml/oob/rml_oob_send.c at line 145
>>>> [tyr.informatik.hs-fulda.de:01520] [[10212,0],0] attempted to send
>>>> to [[10212,1],0]: tag 20
>>>> [tyr.informatik.hs-fulda.de:01520] [[10212,0],0] ORTE_ERROR_LOG:
>>>> A message is attempting to be sent to a process whose contact
>>>> information is unknown in file
>>>> ../../../../openmpi-1.6/orte/mca/odls/base/odls_base_default_fns.c
>>>> at line 2501
>>>> --------------------------------------------------------------------------
>>>> mpiexec was unable to start the specified application as it
>>>> encountered an error:
>>>>
>>>> Error name: Error 0
>>>> Node: rs0.informatik.hs-fulda.de
>>>>
>>>> when attempting to start process rank 2.
>>>> --------------------------------------------------------------------------
>>>> tyr small_prog 113
>>>>
>>>>
>>>>
>>>> The other machine has two 8 core processors.
>>>>
>>>> tyr small_prog 121 ssh rs0 psrinfo -v
>>>> Status of virtual processor 0 as of: 09/03/2012 19:51:15
>>>> on-line since 07/26/2012 15:03:14.
>>>> The sparcv9 processor operates at 2400 MHz,
>>>> and has a sparcv9 floating point processor.
>>>> Status of virtual processor 1 as of: 09/03/2012 19:51:15
>>>> ...
>>>> Status of virtual processor 15 as of: 09/03/2012 19:51:15
>>>> on-line since 07/26/2012 15:03:16.
>>>> The sparcv9 processor operates at 2400 MHz,
>>>> and has a sparcv9 floating point processor.
>>>> tyr small_prog 122
>>>>
>>>>
>>>>
>>>> Is it necessary to specify another option on the command line or
>>>> is my rankfile faulty? Thank you very much for any suggestions in
>>>> advance.
>>>>
>>>>
>>>> Kind regards
>>>>
>>>> Siegmar
>>>>
>>>>
>>>> _______________________________________________
>>>> users mailing list
>>>> users_at_[hidden]
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>
>>>
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users