Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] 1.3.1 -rf rankfile behaviour ??
From: Ralph Castain (rhc_at_[hidden])
Date: 2009-05-11 19:45:50


This is fixed as of r21208.

Thanks for reporting it!
Ralph

On May 11, 2009, at 12:51 PM, Anton Starikov wrote:

> Although removing this check solves problem of having more slots in
> rankfile than necessary, there is another problem.
>
> If I set rmaps_base_no_oversubscribe=1 then if, for example:
>
>
> hostfile:
>
> node01
> node01
> node02
> node02
>
> rankfile:
>
> rank 0=node01 slot=1
> rank 1=node01 slot=0
> rank 2=node02 slot=1
> rank 3=node02 slot=0
>
> mpirun -np 4 ./something
>
> complains with:
>
> "There are not enough slots available in the system to satisfy the 4
> slots
> that were requested by the application"
>
> but "mpirun -np 3 ./something" will work though. It works, when you
> ask for 1 CPU less. And the same behavior in any case (shared nodes,
> non-shared nodes, multi-node)
>
> If you switch off rmaps_base_no_oversubscribe, then it works and all
> affinities set as it requested in rankfile, there is no
> oversubscription.
>
>
> Anton.
>
> On May 5, 2009, at 3:08 PM, Ralph Castain wrote:
>
>> Ah - thx for catching that, I'll remove that check. It no longer is
>> required.
>>
>> Thx!
>>
>> On Tue, May 5, 2009 at 7:04 AM, Lenny Verkhovsky <lenny.verkhovsky_at_[hidden]
>> > wrote:
>> According to the code it does cares.
>>
>> $vi orte/mca/rmaps/rank_file/rmaps_rank_file.c +572
>>
>> ival = orte_rmaps_rank_file_value.ival;
>> if ( ival > (np-1) ) {
>> orte_show_help("help-rmaps_rank_file.txt", "bad-rankfile", true,
>> ival, rankfile);
>> rc = ORTE_ERR_BAD_PARAM;
>> goto unlock;
>> }
>>
>> If I remember correctly, I used an array to map ranks, and since
>> the length of array is NP, maximum index must be less than np, so
>> if you have the number of rank > NP, you have no place to put it
>> inside array.
>>
>> "Likewise, if you have more procs than the rankfile specifies, we
>> map the additional procs either byslot (default) or bynode (if you
>> specify that option). So the rankfile doesn't need to contain an
>> entry for every proc." - Correct point.
>>
>>
>> Lenny.
>>
>>
>> On 5/5/09, Ralph Castain <rhc_at_[hidden]> wrote: Sorry Lenny, but
>> that isn't correct. The rankfile mapper doesn't care if the
>> rankfile contains additional info - it only maps up to the number
>> of processes, and ignores anything beyond that number. So there is
>> no need to remove the additional info.
>>
>> Likewise, if you have more procs than the rankfile specifies, we
>> map the additional procs either byslot (default) or bynode (if you
>> specify that option). So the rankfile doesn't need to contain an
>> entry for every proc.
>>
>> Just don't want to confuse folks.
>> Ralph
>>
>>
>>
>>
>> On Tue, May 5, 2009 at 5:59 AM, Lenny Verkhovsky <lenny.verkhovsky_at_[hidden]
>> > wrote:
>> Hi,
>> maximum rank number must be less then np.
>> if np=1 then there is only rank 0 in the system, so rank 1 is
>> invalid.
>> please remove "rank 1=node2 slot=*" from the rankfile
>> Best regards,
>> Lenny.
>>
>> On Mon, May 4, 2009 at 11:14 AM, Geoffroy Pignot
>> <geopignot_at_[hidden]> wrote:
>> Hi ,
>>
>> I got the openmpi-1.4a1r21095.tar.gz tarball, but unfortunately my
>> command doesn't work
>>
>> cat rankf:
>> rank 0=node1 slot=*
>> rank 1=node2 slot=*
>>
>> cat hostf:
>> node1 slots=2
>> node2 slots=2
>>
>> mpirun --rankfile rankf --hostfile hostf --host node1 -n 1
>> hostname : --host node2 -n 1 hostname
>>
>> Error, invalid rank (1) in the rankfile (rankf)
>>
>> --------------------------------------------------------------------------
>> [r011n006:28986] [[45541,0],0] ORTE_ERROR_LOG: Bad parameter in
>> file rmaps_rank_file.c at line 403
>> [r011n006:28986] [[45541,0],0] ORTE_ERROR_LOG: Bad parameter in
>> file base/rmaps_base_map_job.c at line 86
>> [r011n006:28986] [[45541,0],0] ORTE_ERROR_LOG: Bad parameter in
>> file base/plm_base_launch_support.c at line 86
>> [r011n006:28986] [[45541,0],0] ORTE_ERROR_LOG: Bad parameter in
>> file plm_rsh_module.c at line 1016
>>
>>
>> Ralph, could you tell me if my command syntax is correct or not ?
>> if not, give me the expected one ?
>>
>> Regards
>>
>> Geoffroy
>>
>>
>>
>>
>> 2009/4/30 Geoffroy Pignot <geopignot_at_[hidden]>
>>
>> Immediately Sir !!! :)
>>
>> Thanks again Ralph
>>
>> Geoffroy
>>
>>
>>
>>
>>
>> ------------------------------
>>
>> Message: 2
>> Date: Thu, 30 Apr 2009 06:45:39 -0600
>> From: Ralph Castain <rhc_at_[hidden]>
>> Subject: Re: [OMPI users] 1.3.1 -rf rankfile behaviour ??
>> To: Open MPI Users <users_at_[hidden]>
>> Message-ID:
>> <71d2d8cc0904300545v61a42fe1k50086d2704d0f7e6_at_[hidden]>
>> Content-Type: text/plain; charset="iso-8859-1"
>>
>> I believe this is fixed now in our development trunk - you can
>> download any
>> tarball starting from last night and give it a try, if you like. Any
>> feedback would be appreciated.
>>
>> Ralph
>>
>>
>> On Apr 14, 2009, at 7:57 AM, Ralph Castain wrote:
>>
>> Ah now, I didn't say it -worked-, did I? :-)
>>
>> Clearly a bug exists in the program. I'll try to take a look at it
>> (if Lenny
>> doesn't get to it first), but it won't be until later in the week.
>>
>> On Apr 14, 2009, at 7:18 AM, Geoffroy Pignot wrote:
>>
>> I agree with you Ralph , and that 's what I expect from openmpi but
>> my
>> second example shows that it's not working
>>
>> cat hostfile.0
>> r011n002 slots=4
>> r011n003 slots=4
>>
>> cat rankfile.0
>> rank 0=r011n002 slot=0
>> rank 1=r011n003 slot=1
>>
>> mpirun --hostfile hostfile.0 -rf rankfile.0 -n 1 hostname : -n 1
>> hostname
>> ### CRASHED
>>
>> > > Error, invalid rank (1) in the rankfile (rankfile.0)
>> > >
>> >
>> --------------------------------------------------------------------------
>> > > [r011n002:25129] [[63976,0],0] ORTE_ERROR_LOG: Bad parameter in
>> file
>> > > rmaps_rank_file.c at line 404
>> > > [r011n002:25129] [[63976,0],0] ORTE_ERROR_LOG: Bad parameter in
>> file
>> > > base/rmaps_base_map_job.c at line 87
>> > > [r011n002:25129] [[63976,0],0] ORTE_ERROR_LOG: Bad parameter in
>> file
>> > > base/plm_base_launch_support.c at line 77
>> > > [r011n002:25129] [[63976,0],0] ORTE_ERROR_LOG: Bad parameter in
>> file
>> > > plm_rsh_module.c at line 985
>> > >
>> >
>> --------------------------------------------------------------------------
>> > > A daemon (pid unknown) died unexpectedly on signal 1 while
>> > attempting to
>> > > launch so we are aborting.
>> > >
>> > > There may be more information reported by the environment (see
>> > above).
>> > >
>> > > This may be because the daemon was unable to find all the needed
>> > shared
>> > > libraries on the remote node. You may set your LD_LIBRARY_PATH to
>> > have the
>> > > location of the shared libraries on the remote nodes and this
>> will
>> > > automatically be forwarded to the remote nodes.
>> > >
>> >
>> --------------------------------------------------------------------------
>> > >
>> >
>> --------------------------------------------------------------------------
>> > > orterun noticed that the job aborted, but has no info as to the
>> > process
>> > > that caused that situation.
>> > >
>> >
>> --------------------------------------------------------------------------
>> > > orterun: clean termination accomplished
>>
>>
>>
>> Message: 4
>> Date: Tue, 14 Apr 2009 06:55:58 -0600
>> From: Ralph Castain <rhc_at_[hidden]>
>> Subject: Re: [OMPI users] 1.3.1 -rf rankfile behaviour ??
>> To: Open MPI Users <users_at_[hidden]>
>> Message-ID: <F6290ADA-A196-43F0-A853-CBCB802D8D9C_at_[hidden]>
>> Content-Type: text/plain; charset="us-ascii"; Format="flowed";
>> DelSp="yes"
>>
>> The rankfile cuts across the entire job - it isn't applied on an
>> app_context basis. So the ranks in your rankfile must correspond to
>> the eventual rank of each process in the cmd line.
>>
>> Unfortunately, that means you have to count ranks. In your case, you
>> only have four, so that makes life easier. Your rankfile would look
>> something like this:
>>
>> rank 0=r001n001 slot=0
>> rank 1=r001n002 slot=1
>> rank 2=r001n001 slot=1
>> rank 3=r001n002 slot=2
>>
>> HTH
>> Ralph
>>
>> On Apr 14, 2009, at 12:19 AM, Geoffroy Pignot wrote:
>>
>> > Hi,
>> >
>> > I agree that my examples are not very clear. What I want to do is
>> to
>> > launch a multiexes application (masters-slaves) and benefit from
>> the
>> > processor affinity.
>> > Could you show me how to convert this command , using -rf option
>> > (whatever the affinity is)
>> >
>> > mpirun -n 1 -host r001n001 master.x options1 : -n 1 -host r001n002
>> > master.x options2 : -n 1 -host r001n001 slave.x options3 : -n 1 -
>> > host r001n002 slave.x options4
>> >
>> > Thanks for your help
>> >
>> > Geoffroy
>> >
>> >
>> >
>> >
>> >
>> > Message: 2
>> > Date: Sun, 12 Apr 2009 18:26:35 +0300
>> > From: Lenny Verkhovsky <lenny.verkhovsky_at_[hidden]>
>> > Subject: Re: [OMPI users] 1.3.1 -rf rankfile behaviour ??
>> > To: Open MPI Users <users_at_[hidden]>
>> > Message-ID:
>> >
>> <453d39990904120826t2e1d1d33l7bb1fe3de65b5361_at_[hidden]>
>> > Content-Type: text/plain; charset="iso-8859-1"
>> >
>> > Hi,
>> >
>> > The first "crash" is OK, since your rankfile has ranks 0 and 1
>> > defined,
>> > while n=1, which means only rank 0 is present and can be allocated.
>> >
>> > NP must be >= the largest rank in rankfile.
>> >
>> > What exactly are you trying to do ?
>> >
>> > I tried to recreate your seqv but all I got was
>> >
>> > ~/work/svn/ompi/trunk/build_x86-64/install/bin/mpirun --hostfile
>> > hostfile.0
>> > -rf rankfile.0 -n 1 hostname : -rf rankfile.1 -n 1 hostname
>> > [witch19:30798] mca: base: component_find: paffinity
>> > "mca_paffinity_linux"
>> > uses an MCA interface that is not recognized (component MCA
>> v1.0.0 !=
>> > supported MCA v2.0.0) -- ignored
>> >
>> --------------------------------------------------------------------------
>> > It looks like opal_init failed for some reason; your parallel
>> > process is
>> > likely to abort. There are many reasons that a parallel process can
>> > fail during opal_init; some of which are due to configuration or
>> > environment problems. This failure appears to be an internal
>> failure;
>> > here's some additional information (which may only be relevant to
>> an
>> > Open MPI developer):
>> >
>> > opal_carto_base_select failed
>> > --> Returned value -13 instead of OPAL_SUCCESS
>> >
>> --------------------------------------------------------------------------
>> > [witch19:30798] [[INVALID],INVALID] ORTE_ERROR_LOG: Not found in
>> file
>> > ../../orte/runtime/orte_init.c at line 78
>> > [witch19:30798] [[INVALID],INVALID] ORTE_ERROR_LOG: Not found in
>> file
>> > ../../orte/orted/orted_main.c at line 344
>> >
>> --------------------------------------------------------------------------
>> > A daemon (pid 11629) died unexpectedly with status 243 while
>> > attempting
>> > to launch so we are aborting.
>> >
>> > There may be more information reported by the environment (see
>> above).
>> >
>> > This may be because the daemon was unable to find all the needed
>> > shared
>> > libraries on the remote node. You may set your LD_LIBRARY_PATH to
>> > have the
>> > location of the shared libraries on the remote nodes and this will
>> > automatically be forwarded to the remote nodes.
>> >
>> --------------------------------------------------------------------------
>> >
>> --------------------------------------------------------------------------
>> > mpirun noticed that the job aborted, but has no info as to the
>> process
>> > that caused that situation.
>> >
>> --------------------------------------------------------------------------
>> > mpirun: clean termination accomplished
>> >
>> >
>> > Lenny.
>> >
>> >
>> > On 4/10/09, Geoffroy Pignot <geopignot_at_[hidden]> wrote:
>> > >
>> > > Hi ,
>> > >
>> > > I am currently testing the process affinity capabilities of
>> > openmpi and I
>> > > would like to know if the rankfile behaviour I will describe
>> below
>> > is normal
>> > > or not ?
>> > >
>> > > cat hostfile.0
>> > > r011n002 slots=4
>> > > r011n003 slots=4
>> > >
>> > > cat rankfile.0
>> > > rank 0=r011n002 slot=0
>> > > rank 1=r011n003 slot=1
>> > >
>> > >
>> > >
>> >
>> ##################################################################################
>> > >
>> > > mpirun --hostfile hostfile.0 -rf rankfile.0 -n 2 hostname ### OK
>> > > r011n002
>> > > r011n003
>> > >
>> > >
>> > >
>> >
>> ##################################################################################
>> > > but
>> > > mpirun --hostfile hostfile.0 -rf rankfile.0 -n 1 hostname : -n 1
>> > hostname
>> > > ### CRASHED
>> > > *
>> > >
>> >
>> --------------------------------------------------------------------------
>> > > Error, invalid rank (1) in the rankfile (rankfile.0)
>> > >
>> >
>> --------------------------------------------------------------------------
>> > > [r011n002:25129] [[63976,0],0] ORTE_ERROR_LOG: Bad parameter in
>> file
>> > > rmaps_rank_file.c at line 404
>> > > [r011n002:25129] [[63976,0],0] ORTE_ERROR_LOG: Bad parameter in
>> file
>> > > base/rmaps_base_map_job.c at line 87
>> > > [r011n002:25129] [[63976,0],0] ORTE_ERROR_LOG: Bad parameter in
>> file
>> > > base/plm_base_launch_support.c at line 77
>> > > [r011n002:25129] [[63976,0],0] ORTE_ERROR_LOG: Bad parameter in
>> file
>> > > plm_rsh_module.c at line 985
>> > >
>> >
>> --------------------------------------------------------------------------
>> > > A daemon (pid unknown) died unexpectedly on signal 1 while
>> > attempting to
>> > > launch so we are aborting.
>> > >
>> > > There may be more information reported by the environment (see
>> > above).
>> > >
>> > > This may be because the daemon was unable to find all the needed
>> > shared
>> > > libraries on the remote node. You may set your LD_LIBRARY_PATH to
>> > have the
>> > > location of the shared libraries on the remote nodes and this
>> will
>> > > automatically be forwarded to the remote nodes.
>> > >
>> >
>> --------------------------------------------------------------------------
>> > >
>> >
>> --------------------------------------------------------------------------
>> > > orterun noticed that the job aborted, but has no info as to the
>> > process
>> > > that caused that situation.
>> > >
>> >
>> --------------------------------------------------------------------------
>> > > orterun: clean termination accomplished
>> > > *
>> > > It seems that the rankfile option is not propagted to the second
>> > command
>> > > line ; there is no global understanding of the ranking inside a
>> > mpirun
>> > > command.
>> > >
>> > >
>> > >
>> >
>> ##################################################################################
>> > >
>> > > Assuming that , I tried to provide a rankfile to each command
>> line:
>> > >
>> > > cat rankfile.0
>> > > rank 0=r011n002 slot=0
>> > >
>> > > cat rankfile.1
>> > > rank 0=r011n003 slot=1
>> > >
>> > > mpirun --hostfile hostfile.0 -rf rankfile.0 -n 1 hostname : -rf
>> > rankfile.1
>> > > -n 1 hostname ### CRASHED
>> > > *[r011n002:28778] *** Process received signal ***
>> > > [r011n002:28778] Signal: Segmentation fault (11)
>> > > [r011n002:28778] Signal code: Address not mapped (1)
>> > > [r011n002:28778] Failing at address: 0x34
>> > > [r011n002:28778] [ 0] [0xffffe600]
>> > > [r011n002:28778] [ 1]
>> > > /tmp/HALMPI/openmpi-1.3.1/lib/libopen-rte.so.
>> > 0(orte_odls_base_default_get_add_procs_data+0x55d)
>> > > [0x5557decd]
>> > > [r011n002:28778] [ 2]
>> > > /tmp/HALMPI/openmpi-1.3.1/lib/libopen-rte.so.
>> > 0(orte_plm_base_launch_apps+0x117)
>> > > [0x555842a7]
>> > > [r011n002:28778] [ 3] /tmp/HALMPI/openmpi-1.3.1/lib/openmpi/
>> > mca_plm_rsh.so
>> > > [0x556098c0]
>> > > [r011n002:28778] [ 4] /tmp/HALMPI/openmpi-1.3.1/bin/orterun
>> > [0x804aa27]
>> > > [r011n002:28778] [ 5] /tmp/HALMPI/openmpi-1.3.1/bin/orterun
>> > [0x804a022]
>> > > [r011n002:28778] [ 6] /lib/libc.so.6(__libc_start_main+0xdc)
>> > [0x9f1dec]
>> > > [r011n002:28778] [ 7] /tmp/HALMPI/openmpi-1.3.1/bin/orterun
>> > [0x8049f71]
>> > > [r011n002:28778] *** End of error message ***
>> > > Segmentation fault (core dumped)*
>> > >
>> > >
>> > >
>> > > I hope that I've found a bug because it would be very important
>> > for me to
>> > > have this kind of capabiliy .
>> > > Launch a multiexe mpirun command line and be able to bind my exes
>> > and
>> > > sockets together.
>> > >
>> > > Thanks in advance for your help
>> > >
>> > > Geoffroy
>> > _______________________________________________
>> > users mailing list
>> > users_at_[hidden]
>> > http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>> -------------- next part --------------
>> HTML attachment scrubbed and removed
>>
>> ------------------------------
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>> End of users Digest, Vol 1202, Issue 2
>> **************************************
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> -------------- next part --------------
>> HTML attachment scrubbed and removed
>>
>> ------------------------------
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>> End of users Digest, Vol 1218, Issue 2
>> **************************************
>>
>>
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>>
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users