Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] rank file error: Rankfile claimed...
From: Lenny Verkhovsky (lenny.verkhovsky_at_[hidden])
Date: 2009-08-17 09:29:45


can you try not specifiyng "max-slots" in the hostfile.
if you are the only user of the nodes, there will be no oversibscibing of
the processors.
This one definetly looks like a bug,
but as Ralph said there is a current disscusion and working on this
component.
Lenny.

On Mon, Aug 17, 2009 at 2:37 PM, Ralph Castain <rhc_at_[hidden]> wrote:

> Is there an explanation for this?
>>
>
> I believe the word is "bug". :-)
>
> The rank_file mapper has been substantially revised lately - we are
> discussing now how much of that revision to bring to 1.3.4 versus the next
> major release.
>
> Ralph
>
> On Aug 17, 2009, at 4:45 AM, jody wrote:
>
> Hi Lenny
>>
>> I think it has something to do with your environment, /etc/hosts, IT
>>> setup,
>>> hostname function return value e.t.c
>>> I am not sure if it has something to do with Open MPI at all.
>>>
>>
>> OK. I just thought this was Open MPI related because i was able to use the
>> aliases of the hosts (i.e. plankton instead of plankton.uzh.ch) in
>> the host file...
>>
>> However, I encountered a new problem:
>> if the rankfile lists all the entries which occur in the host file
>> there is an error message.
>> In the following example, the hostfile is
>> [jody_at_plankton neander]$ cat th_02
>> nano_00.uzh.ch slots=2 max-slots=2
>> nano_02.uzh.ch slots=2 max-slots=2
>>
>> and the rankfile is:
>> [jody_at_plankton neander]$ cat rf_02
>> rank 0=nano_00.uzh.ch slot=0
>> rank 2=nano_00.uzh.ch slot=1
>> rank 1=nano_02.uzh.ch slot=0
>> rank 3=nano_02.uzh.ch slot=1
>>
>> Here is the error:
>> [jody_at_plankton neander]$ mpirun -np 4 -hostfile th_02 -rf rf_02
>> ./HelloMPI
>> --------------------------------------------------------------------------
>> There are not enough slots available in the system to satisfy the 4 slots
>> that were requested by the application:
>> ./HelloMPI
>>
>> Either request fewer slots for your application, or make more slots
>> available
>> for use.
>>
>> --------------------------------------------------------------------------
>> --------------------------------------------------------------------------
>> A daemon (pid unknown) died unexpectedly on signal 1 while attempting to
>> launch so we are aborting.
>>
>> There may be more information reported by the environment (see above).
>>
>> This may be because the daemon was unable to find all the needed shared
>> libraries on the remote node. You may set your LD_LIBRARY_PATH to have the
>> location of the shared libraries on the remote nodes and this will
>> automatically be forwarded to the remote nodes.
>> --------------------------------------------------------------------------
>> --------------------------------------------------------------------------
>> mpirun noticed that the job aborted, but has no info as to the process
>> that caused that situation.
>> --------------------------------------------------------------------------
>> mpirun: clean termination accomplished
>>
>> If i use a hostfile with one more entry
>> [jody_at_aim-plankton neander]$ cat th_021
>> aim-nano_00.uzh.ch slots=2 max-slots=2
>> aim-nano_02.uzh.ch slots=2 max-slots=2
>> aim-nano_01.uzh.ch slots=1 max-slots=1
>>
>> Then this works fine:
>> [jody_at_aim-plankton neander]$ mpirun -np 4 -hostfile th_021 -rf rf_02
>> ./HelloMPI
>>
>> Is there an explanation for this?
>>
>> Thank You
>> Jody
>>
>> Lenny.
>>> On Mon, Aug 17, 2009 at 12:59 PM, jody <jody.xha_at_[hidden]> wrote:
>>>
>>>>
>>>> Hi Lenny
>>>>
>>>> Thanks - using the full names makes it work!
>>>> Is there a reason why the rankfile option treats
>>>> host names differently than the hostfile option?
>>>>
>>>> Thanks
>>>> Jody
>>>>
>>>>
>>>>
>>>> On Mon, Aug 17, 2009 at 11:20 AM, Lenny
>>>> Verkhovsky<lenny.verkhovsky_at_[hidden]> wrote:
>>>>
>>>>> Hi
>>>>> This message means
>>>>> that you are trying to use host "plankton", that was not allocated via
>>>>> hostfile or hostlist.
>>>>> But according to the files and command line, everything seems fine.
>>>>> Can you try using "plankton.uzh.ch" hostname instead of "plankton".
>>>>> thanks
>>>>> Lenny.
>>>>> On Mon, Aug 17, 2009 at 10:36 AM, jody <jody.xha_at_[hidden]> wrote:
>>>>>
>>>>>>
>>>>>> Hi
>>>>>>
>>>>>> When i use a rankfile, i get an error message which i don't
>>>>>> understand:
>>>>>>
>>>>>> [jody_at_plankton tests]$ mpirun -np 3 -rf rankfile -hostfile testhosts
>>>>>> ./HelloMPI
>>>>>>
>>>>>>
>>>>>> --------------------------------------------------------------------------
>>>>>> Rankfile claimed host plankton that was not allocated or
>>>>>> oversubscribed it's slots:
>>>>>>
>>>>>>
>>>>>>
>>>>>> --------------------------------------------------------------------------
>>>>>> [plankton.uzh.ch:24327] [[44857,0],0] ORTE_ERROR_LOG: Bad parameter
>>>>>> in
>>>>>> file rmaps_rank_file.c at line 108
>>>>>> [plankton.uzh.ch:24327] [[44857,0],0] ORTE_ERROR_LOG: Bad parameter
>>>>>> in
>>>>>> file base/rmaps_base_map_job.c at line 87
>>>>>> [plankton.uzh.ch:24327] [[44857,0],0] ORTE_ERROR_LOG: Bad parameter
>>>>>> in
>>>>>> file base/plm_base_launch_support.c at line 77
>>>>>> [plankton.uzh.ch:24327] [[44857,0],0] ORTE_ERROR_LOG: Bad parameter
>>>>>> in
>>>>>> file plm_rsh_module.c at line 990
>>>>>>
>>>>>>
>>>>>> --------------------------------------------------------------------------
>>>>>> A daemon (pid unknown) died unexpectedly on signal 1 while attempting
>>>>>> to
>>>>>> launch so we are aborting.
>>>>>>
>>>>>> There may be more information reported by the environment (see above).
>>>>>>
>>>>>> This may be because the daemon was unable to find all the needed
>>>>>> shared
>>>>>> libraries on the remote node. You may set your LD_LIBRARY_PATH to have
>>>>>> the
>>>>>> location of the shared libraries on the remote nodes and this will
>>>>>> automatically be forwarded to the remote nodes.
>>>>>>
>>>>>>
>>>>>> --------------------------------------------------------------------------
>>>>>>
>>>>>>
>>>>>> --------------------------------------------------------------------------
>>>>>> mpirun noticed that the job aborted, but has no info as to the process
>>>>>> that caused that situation.
>>>>>>
>>>>>>
>>>>>> --------------------------------------------------------------------------
>>>>>> mpirun: clean termination accomplished
>>>>>>
>>>>>>
>>>>>>
>>>>>> With out the '-rf rankfile' option everything works as expected.
>>>>>>
>>>>>> My hostfile :
>>>>>> [jody_at_plankton tests]$ cat testhosts
>>>>>> # The following node is a quad-processor machine, and we absolutely
>>>>>> # want to disallow over-subscribing it:
>>>>>> plankton slots=3 max-slots=3
>>>>>> # The following nodes are dual-processor machines:
>>>>>> nano_00 slots=2 max-slots=2
>>>>>> nano_01 slots=2 max-slots=2
>>>>>> nano_02 slots=2 max-slots=2
>>>>>> nano_03 slots=2 max-slots=2
>>>>>> nano_04 slots=2 max-slots=2
>>>>>> nano_05 slots=2 max-slots=2
>>>>>> nano_06 slots=2 max-slots=2
>>>>>>
>>>>>> my rank file:
>>>>>> [jody_at_plankton neander]$ cat rankfile
>>>>>> rank 0=nano_00 slot=1
>>>>>> rank 1=plankton slot=0
>>>>>> rank 2=nano_01 slot=1
>>>>>>
>>>>>> my Open MPI version: 1.3.2
>>>>>>
>>>>>> i get the same error if i use a rankfile which has a single line
>>>>>> rank 0=plankton slot=0
>>>>>> (plankton is my local machine) and call mpirun with np 1
>>>>>>
>>>>>> What does the "Rankfile claimed..." message mean?
>>>>>> Did i make an error in my rankfile?
>>>>>> If yes, what would be the correct way to write it?
>>>>>>
>>>>>> Thank You
>>>>>> Jody
>>>>>> _______________________________________________
>>>>>> users mailing list
>>>>>> users_at_[hidden]
>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> users mailing list
>>>>> users_at_[hidden]
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>
>>>>>
>>>> _______________________________________________
>>>> users mailing list
>>>> users_at_[hidden]
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>
>>>
>>>
>>> _______________________________________________
>>> users mailing list
>>> users_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>
>>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>