Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |  

This web mail archive is frozen.

This page is part of a frozen web archive of this mailing list.

You can still navigate around this archive, but know that no new mails have been added to it since July of 2016.

Click here to be taken to the new web archives of this list; it includes all the mails that are in this frozen archive plus all new mails that have been sent to the list since it was migrated to the new archives.

Subject: Re: [OMPI users] rank file error: Rankfile claimed...
From: Ralph Castain (rhc_at_[hidden])
Date: 2009-08-17 07:37:00


> Is there an explanation for this?

I believe the word is "bug". :-)

The rank_file mapper has been substantially revised lately - we are
discussing now how much of that revision to bring to 1.3.4 versus the
next major release.

Ralph

On Aug 17, 2009, at 4:45 AM, jody wrote:

> Hi Lenny
>
>> I think it has something to do with your environment, /etc/hosts,
>> IT setup,
>> hostname function return value e.t.c
>> I am not sure if it has something to do with Open MPI at all.
>
> OK. I just thought this was Open MPI related because i was able to
> use the
> aliases of the hosts (i.e. plankton instead of plankton.uzh.ch) in
> the host file...
>
> However, I encountered a new problem:
> if the rankfile lists all the entries which occur in the host file
> there is an error message.
> In the following example, the hostfile is
> [jody_at_plankton neander]$ cat th_02
> nano_00.uzh.ch slots=2 max-slots=2
> nano_02.uzh.ch slots=2 max-slots=2
>
> and the rankfile is:
> [jody_at_plankton neander]$ cat rf_02
> rank 0=nano_00.uzh.ch slot=0
> rank 2=nano_00.uzh.ch slot=1
> rank 1=nano_02.uzh.ch slot=0
> rank 3=nano_02.uzh.ch slot=1
>
> Here is the error:
> [jody_at_plankton neander]$ mpirun -np 4 -hostfile th_02 -rf rf_02 ./
> HelloMPI
> --------------------------------------------------------------------------
> There are not enough slots available in the system to satisfy the 4
> slots
> that were requested by the application:
> ./HelloMPI
>
> Either request fewer slots for your application, or make more slots
> available
> for use.
>
> --------------------------------------------------------------------------
> --------------------------------------------------------------------------
> A daemon (pid unknown) died unexpectedly on signal 1 while
> attempting to
> launch so we are aborting.
>
> There may be more information reported by the environment (see above).
>
> This may be because the daemon was unable to find all the needed
> shared
> libraries on the remote node. You may set your LD_LIBRARY_PATH to
> have the
> location of the shared libraries on the remote nodes and this will
> automatically be forwarded to the remote nodes.
> --------------------------------------------------------------------------
> --------------------------------------------------------------------------
> mpirun noticed that the job aborted, but has no info as to the process
> that caused that situation.
> --------------------------------------------------------------------------
> mpirun: clean termination accomplished
>
> If i use a hostfile with one more entry
> [jody_at_aim-plankton neander]$ cat th_021
> aim-nano_00.uzh.ch slots=2 max-slots=2
> aim-nano_02.uzh.ch slots=2 max-slots=2
> aim-nano_01.uzh.ch slots=1 max-slots=1
>
> Then this works fine:
> [jody_at_aim-plankton neander]$ mpirun -np 4 -hostfile th_021 -rf
> rf_02 ./HelloMPI
>
> Is there an explanation for this?
>
> Thank You
> Jody
>
>> Lenny.
>> On Mon, Aug 17, 2009 at 12:59 PM, jody <jody.xha_at_[hidden]> wrote:
>>>
>>> Hi Lenny
>>>
>>> Thanks - using the full names makes it work!
>>> Is there a reason why the rankfile option treats
>>> host names differently than the hostfile option?
>>>
>>> Thanks
>>> Jody
>>>
>>>
>>>
>>> On Mon, Aug 17, 2009 at 11:20 AM, Lenny
>>> Verkhovsky<lenny.verkhovsky_at_[hidden]> wrote:
>>>> Hi
>>>> This message means
>>>> that you are trying to use host "plankton", that was not
>>>> allocated via
>>>> hostfile or hostlist.
>>>> But according to the files and command line, everything seems fine.
>>>> Can you try using "plankton.uzh.ch" hostname instead of "plankton".
>>>> thanks
>>>> Lenny.
>>>> On Mon, Aug 17, 2009 at 10:36 AM, jody <jody.xha_at_[hidden]> wrote:
>>>>>
>>>>> Hi
>>>>>
>>>>> When i use a rankfile, i get an error message which i don't
>>>>> understand:
>>>>>
>>>>> [jody_at_plankton tests]$ mpirun -np 3 -rf rankfile -hostfile
>>>>> testhosts
>>>>> ./HelloMPI
>>>>>
>>>>> --------------------------------------------------------------------------
>>>>> Rankfile claimed host plankton that was not allocated or
>>>>> oversubscribed it's slots:
>>>>>
>>>>>
>>>>> --------------------------------------------------------------------------
>>>>> [plankton.uzh.ch:24327] [[44857,0],0] ORTE_ERROR_LOG: Bad
>>>>> parameter in
>>>>> file rmaps_rank_file.c at line 108
>>>>> [plankton.uzh.ch:24327] [[44857,0],0] ORTE_ERROR_LOG: Bad
>>>>> parameter in
>>>>> file base/rmaps_base_map_job.c at line 87
>>>>> [plankton.uzh.ch:24327] [[44857,0],0] ORTE_ERROR_LOG: Bad
>>>>> parameter in
>>>>> file base/plm_base_launch_support.c at line 77
>>>>> [plankton.uzh.ch:24327] [[44857,0],0] ORTE_ERROR_LOG: Bad
>>>>> parameter in
>>>>> file plm_rsh_module.c at line 990
>>>>>
>>>>> --------------------------------------------------------------------------
>>>>> A daemon (pid unknown) died unexpectedly on signal 1 while
>>>>> attempting
>>>>> to
>>>>> launch so we are aborting.
>>>>>
>>>>> There may be more information reported by the environment (see
>>>>> above).
>>>>>
>>>>> This may be because the daemon was unable to find all the needed
>>>>> shared
>>>>> libraries on the remote node. You may set your LD_LIBRARY_PATH
>>>>> to have
>>>>> the
>>>>> location of the shared libraries on the remote nodes and this will
>>>>> automatically be forwarded to the remote nodes.
>>>>>
>>>>> --------------------------------------------------------------------------
>>>>>
>>>>> --------------------------------------------------------------------------
>>>>> mpirun noticed that the job aborted, but has no info as to the
>>>>> process
>>>>> that caused that situation.
>>>>>
>>>>> --------------------------------------------------------------------------
>>>>> mpirun: clean termination accomplished
>>>>>
>>>>>
>>>>>
>>>>> With out the '-rf rankfile' option everything works as expected.
>>>>>
>>>>> My hostfile :
>>>>> [jody_at_plankton tests]$ cat testhosts
>>>>> # The following node is a quad-processor machine, and we
>>>>> absolutely
>>>>> # want to disallow over-subscribing it:
>>>>> plankton slots=3 max-slots=3
>>>>> # The following nodes are dual-processor machines:
>>>>> nano_00 slots=2 max-slots=2
>>>>> nano_01 slots=2 max-slots=2
>>>>> nano_02 slots=2 max-slots=2
>>>>> nano_03 slots=2 max-slots=2
>>>>> nano_04 slots=2 max-slots=2
>>>>> nano_05 slots=2 max-slots=2
>>>>> nano_06 slots=2 max-slots=2
>>>>>
>>>>> my rank file:
>>>>> [jody_at_plankton neander]$ cat rankfile
>>>>> rank 0=nano_00 slot=1
>>>>> rank 1=plankton slot=0
>>>>> rank 2=nano_01 slot=1
>>>>>
>>>>> my Open MPI version: 1.3.2
>>>>>
>>>>> i get the same error if i use a rankfile which has a single line
>>>>> rank 0=plankton slot=0
>>>>> (plankton is my local machine) and call mpirun with np 1
>>>>>
>>>>> What does the "Rankfile claimed..." message mean?
>>>>> Did i make an error in my rankfile?
>>>>> If yes, what would be the correct way to write it?
>>>>>
>>>>> Thank You
>>>>> Jody
>>>>> _______________________________________________
>>>>> users mailing list
>>>>> users_at_[hidden]
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>
>>>>
>>>> _______________________________________________
>>>> users mailing list
>>>> users_at_[hidden]
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>
>>>
>>> _______________________________________________
>>> users mailing list
>>> users_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users