Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] rank file error: Rankfile claimed...
From: Ralph Castain (rhc_at_[hidden])
Date: 2009-08-17 07:37:00


> Is there an explanation for this?

I believe the word is "bug". :-)

The rank_file mapper has been substantially revised lately - we are
discussing now how much of that revision to bring to 1.3.4 versus the
next major release.

Ralph

On Aug 17, 2009, at 4:45 AM, jody wrote:

> Hi Lenny
>
>> I think it has something to do with your environment, /etc/hosts,
>> IT setup,
>> hostname function return value e.t.c
>> I am not sure if it has something to do with Open MPI at all.
>
> OK. I just thought this was Open MPI related because i was able to
> use the
> aliases of the hosts (i.e. plankton instead of plankton.uzh.ch) in
> the host file...
>
> However, I encountered a new problem:
> if the rankfile lists all the entries which occur in the host file
> there is an error message.
> In the following example, the hostfile is
> [jody_at_plankton neander]$ cat th_02
> nano_00.uzh.ch slots=2 max-slots=2
> nano_02.uzh.ch slots=2 max-slots=2
>
> and the rankfile is:
> [jody_at_plankton neander]$ cat rf_02
> rank 0=nano_00.uzh.ch slot=0
> rank 2=nano_00.uzh.ch slot=1
> rank 1=nano_02.uzh.ch slot=0
> rank 3=nano_02.uzh.ch slot=1
>
> Here is the error:
> [jody_at_plankton neander]$ mpirun -np 4 -hostfile th_02 -rf rf_02 ./
> HelloMPI
> --------------------------------------------------------------------------
> There are not enough slots available in the system to satisfy the 4
> slots
> that were requested by the application:
> ./HelloMPI
>
> Either request fewer slots for your application, or make more slots
> available
> for use.
>
> --------------------------------------------------------------------------
> --------------------------------------------------------------------------
> A daemon (pid unknown) died unexpectedly on signal 1 while
> attempting to
> launch so we are aborting.
>
> There may be more information reported by the environment (see above).
>
> This may be because the daemon was unable to find all the needed
> shared
> libraries on the remote node. You may set your LD_LIBRARY_PATH to
> have the
> location of the shared libraries on the remote nodes and this will
> automatically be forwarded to the remote nodes.
> --------------------------------------------------------------------------
> --------------------------------------------------------------------------
> mpirun noticed that the job aborted, but has no info as to the process
> that caused that situation.
> --------------------------------------------------------------------------
> mpirun: clean termination accomplished
>
> If i use a hostfile with one more entry
> [jody_at_aim-plankton neander]$ cat th_021
> aim-nano_00.uzh.ch slots=2 max-slots=2
> aim-nano_02.uzh.ch slots=2 max-slots=2
> aim-nano_01.uzh.ch slots=1 max-slots=1
>
> Then this works fine:
> [jody_at_aim-plankton neander]$ mpirun -np 4 -hostfile th_021 -rf
> rf_02 ./HelloMPI
>
> Is there an explanation for this?
>
> Thank You
> Jody
>
>> Lenny.
>> On Mon, Aug 17, 2009 at 12:59 PM, jody <jody.xha_at_[hidden]> wrote:
>>>
>>> Hi Lenny
>>>
>>> Thanks - using the full names makes it work!
>>> Is there a reason why the rankfile option treats
>>> host names differently than the hostfile option?
>>>
>>> Thanks
>>> Jody
>>>
>>>
>>>
>>> On Mon, Aug 17, 2009 at 11:20 AM, Lenny
>>> Verkhovsky<lenny.verkhovsky_at_[hidden]> wrote:
>>>> Hi
>>>> This message means
>>>> that you are trying to use host "plankton", that was not
>>>> allocated via
>>>> hostfile or hostlist.
>>>> But according to the files and command line, everything seems fine.
>>>> Can you try using "plankton.uzh.ch" hostname instead of "plankton".
>>>> thanks
>>>> Lenny.
>>>> On Mon, Aug 17, 2009 at 10:36 AM, jody <jody.xha_at_[hidden]> wrote:
>>>>>
>>>>> Hi
>>>>>
>>>>> When i use a rankfile, i get an error message which i don't
>>>>> understand:
>>>>>
>>>>> [jody_at_plankton tests]$ mpirun -np 3 -rf rankfile -hostfile
>>>>> testhosts
>>>>> ./HelloMPI
>>>>>
>>>>> --------------------------------------------------------------------------
>>>>> Rankfile claimed host plankton that was not allocated or
>>>>> oversubscribed it's slots:
>>>>>
>>>>>
>>>>> --------------------------------------------------------------------------
>>>>> [plankton.uzh.ch:24327] [[44857,0],0] ORTE_ERROR_LOG: Bad
>>>>> parameter in
>>>>> file rmaps_rank_file.c at line 108
>>>>> [plankton.uzh.ch:24327] [[44857,0],0] ORTE_ERROR_LOG: Bad
>>>>> parameter in
>>>>> file base/rmaps_base_map_job.c at line 87
>>>>> [plankton.uzh.ch:24327] [[44857,0],0] ORTE_ERROR_LOG: Bad
>>>>> parameter in
>>>>> file base/plm_base_launch_support.c at line 77
>>>>> [plankton.uzh.ch:24327] [[44857,0],0] ORTE_ERROR_LOG: Bad
>>>>> parameter in
>>>>> file plm_rsh_module.c at line 990
>>>>>
>>>>> --------------------------------------------------------------------------
>>>>> A daemon (pid unknown) died unexpectedly on signal 1 while
>>>>> attempting
>>>>> to
>>>>> launch so we are aborting.
>>>>>
>>>>> There may be more information reported by the environment (see
>>>>> above).
>>>>>
>>>>> This may be because the daemon was unable to find all the needed
>>>>> shared
>>>>> libraries on the remote node. You may set your LD_LIBRARY_PATH
>>>>> to have
>>>>> the
>>>>> location of the shared libraries on the remote nodes and this will
>>>>> automatically be forwarded to the remote nodes.
>>>>>
>>>>> --------------------------------------------------------------------------
>>>>>
>>>>> --------------------------------------------------------------------------
>>>>> mpirun noticed that the job aborted, but has no info as to the
>>>>> process
>>>>> that caused that situation.
>>>>>
>>>>> --------------------------------------------------------------------------
>>>>> mpirun: clean termination accomplished
>>>>>
>>>>>
>>>>>
>>>>> With out the '-rf rankfile' option everything works as expected.
>>>>>
>>>>> My hostfile :
>>>>> [jody_at_plankton tests]$ cat testhosts
>>>>> # The following node is a quad-processor machine, and we
>>>>> absolutely
>>>>> # want to disallow over-subscribing it:
>>>>> plankton slots=3 max-slots=3
>>>>> # The following nodes are dual-processor machines:
>>>>> nano_00 slots=2 max-slots=2
>>>>> nano_01 slots=2 max-slots=2
>>>>> nano_02 slots=2 max-slots=2
>>>>> nano_03 slots=2 max-slots=2
>>>>> nano_04 slots=2 max-slots=2
>>>>> nano_05 slots=2 max-slots=2
>>>>> nano_06 slots=2 max-slots=2
>>>>>
>>>>> my rank file:
>>>>> [jody_at_plankton neander]$ cat rankfile
>>>>> rank 0=nano_00 slot=1
>>>>> rank 1=plankton slot=0
>>>>> rank 2=nano_01 slot=1
>>>>>
>>>>> my Open MPI version: 1.3.2
>>>>>
>>>>> i get the same error if i use a rankfile which has a single line
>>>>> rank 0=plankton slot=0
>>>>> (plankton is my local machine) and call mpirun with np 1
>>>>>
>>>>> What does the "Rankfile claimed..." message mean?
>>>>> Did i make an error in my rankfile?
>>>>> If yes, what would be the correct way to write it?
>>>>>
>>>>> Thank You
>>>>> Jody
>>>>> _______________________________________________
>>>>> users mailing list
>>>>> users_at_[hidden]
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>
>>>>
>>>> _______________________________________________
>>>> users mailing list
>>>> users_at_[hidden]
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>
>>>
>>> _______________________________________________
>>> users mailing list
>>> users_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users