Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] rank file error: Rankfile claimed...
From: jody (jody.xha_at_[hidden])
Date: 2009-08-17 10:05:16


Hi Lenny
After removing the max-slots entries,
i could do
  mpirun -np 4 -hostfile th_02 -rf rf_02 ./HelloMPI
without any errors.

But can you explain what the meaning of the max-slots entry is?
I checked the FAQs
  http://www.open-mpi.org/faq/?category=running#simple-spmd-run
  http://www.open-mpi.org/faq/?category=running#mpirun-scheduling
but i couldn't find any explanation. (furthermore, in the FAQ it says
"max-slots"
in one place, but "max_slots" in the other one)

Thank You
  Jody

On Mon, Aug 17, 2009 at 3:29 PM, Lenny
Verkhovsky<lenny.verkhovsky_at_[hidden]> wrote:
> can you try not specifiyng "max-slots" in the hostfile.
> if you are the only user of the nodes, there will be no oversibscibing of
> the processors.
> This one definetly looks like a bug,
> but as Ralph said there is a current disscusion and working on this
> component.
> Lenny.
> On Mon, Aug 17, 2009 at 2:37 PM, Ralph Castain <rhc_at_[hidden]> wrote:
>>>
>>> Is there an explanation for this?
>>
>> I believe the word is "bug". :-)
>>
>> The rank_file mapper has been substantially revised lately - we are
>> discussing now how much of that revision to bring to 1.3.4 versus the next
>> major release.
>>
>> Ralph
>>
>> On Aug 17, 2009, at 4:45 AM, jody wrote:
>>
>>> Hi Lenny
>>>
>>>> I think it has something to do with your environment,  /etc/hosts, IT
>>>> setup,
>>>> hostname function return value e.t.c
>>>> I am not sure if it has something to do with Open MPI at all.
>>>
>>> OK. I just thought this was Open MPI related because i was able to use
>>> the
>>> aliases of the hosts (i.e. plankton instead of plankton.uzh.ch) in
>>> the host file...
>>>
>>> However, I encountered a new problem:
>>> if the rankfile lists all the entries which occur in the host file
>>> there is an error message.
>>> In the following example, the hostfile is
>>> [jody_at_plankton neander]$ cat th_02
>>> nano_00.uzh.ch  slots=2 max-slots=2
>>> nano_02.uzh.ch  slots=2 max-slots=2
>>>
>>> and the rankfile is:
>>> [jody_at_plankton neander]$ cat rf_02
>>> rank  0=nano_00.uzh.ch  slot=0
>>> rank  2=nano_00.uzh.ch  slot=1
>>> rank  1=nano_02.uzh.ch  slot=0
>>> rank  3=nano_02.uzh.ch  slot=1
>>>
>>> Here is the error:
>>> [jody_at_plankton neander]$ mpirun -np 4 -hostfile th_02  -rf rf_02
>>> ./HelloMPI
>>>
>>> --------------------------------------------------------------------------
>>> There are not enough slots available in the system to satisfy the 4 slots
>>> that were requested by the application:
>>>   ./HelloMPI
>>>
>>> Either request fewer slots for your application, or make more slots
>>> available
>>> for use.
>>>
>>>
>>> --------------------------------------------------------------------------
>>>
>>> --------------------------------------------------------------------------
>>> A daemon (pid unknown) died unexpectedly on signal 1  while attempting to
>>> launch so we are aborting.
>>>
>>> There may be more information reported by the environment (see above).
>>>
>>> This may be because the daemon was unable to find all the needed shared
>>> libraries on the remote node. You may set your LD_LIBRARY_PATH to have
>>> the
>>> location of the shared libraries on the remote nodes and this will
>>> automatically be forwarded to the remote nodes.
>>>
>>> --------------------------------------------------------------------------
>>>
>>> --------------------------------------------------------------------------
>>> mpirun noticed that the job aborted, but has no info as to the process
>>> that caused that situation.
>>>
>>> --------------------------------------------------------------------------
>>> mpirun: clean termination accomplished
>>>
>>> If i use a hostfile with one more entry
>>> [jody_at_aim-plankton neander]$ cat th_021
>>> aim-nano_00.uzh.ch  slots=2 max-slots=2
>>> aim-nano_02.uzh.ch  slots=2 max-slots=2
>>> aim-nano_01.uzh.ch  slots=1 max-slots=1
>>>
>>> Then this works fine:
>>> [jody_at_aim-plankton neander]$ mpirun -np 4 -hostfile th_021  -rf rf_02
>>> ./HelloMPI
>>>
>>> Is there an explanation for this?
>>>
>>> Thank You
>>>  Jody
>>>
>>>> Lenny.
>>>> On Mon, Aug 17, 2009 at 12:59 PM, jody <jody.xha_at_[hidden]> wrote:
>>>>>
>>>>> Hi Lenny
>>>>>
>>>>> Thanks - using the full names makes it work!
>>>>> Is there a reason why the rankfile option treats
>>>>> host names differently than the hostfile option?
>>>>>
>>>>> Thanks
>>>>>  Jody
>>>>>
>>>>>
>>>>>
>>>>> On Mon, Aug 17, 2009 at 11:20 AM, Lenny
>>>>> Verkhovsky<lenny.verkhovsky_at_[hidden]> wrote:
>>>>>>
>>>>>> Hi
>>>>>> This message means
>>>>>> that you are trying to use host "plankton", that was not allocated via
>>>>>> hostfile or hostlist.
>>>>>> But according to the files and command line, everything seems fine.
>>>>>> Can you try using "plankton.uzh.ch" hostname instead of "plankton".
>>>>>> thanks
>>>>>> Lenny.
>>>>>> On Mon, Aug 17, 2009 at 10:36 AM, jody <jody.xha_at_[hidden]> wrote:
>>>>>>>
>>>>>>> Hi
>>>>>>>
>>>>>>> When i use a rankfile, i get an error message which i don't
>>>>>>> understand:
>>>>>>>
>>>>>>> [jody_at_plankton tests]$ mpirun -np 3 -rf rankfile -hostfile testhosts
>>>>>>> ./HelloMPI
>>>>>>>
>>>>>>>
>>>>>>> --------------------------------------------------------------------------
>>>>>>> Rankfile claimed host plankton that was not allocated or
>>>>>>> oversubscribed it's slots:
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --------------------------------------------------------------------------
>>>>>>> [plankton.uzh.ch:24327] [[44857,0],0] ORTE_ERROR_LOG: Bad parameter
>>>>>>> in
>>>>>>> file rmaps_rank_file.c at line 108
>>>>>>> [plankton.uzh.ch:24327] [[44857,0],0] ORTE_ERROR_LOG: Bad parameter
>>>>>>> in
>>>>>>> file base/rmaps_base_map_job.c at line 87
>>>>>>> [plankton.uzh.ch:24327] [[44857,0],0] ORTE_ERROR_LOG: Bad parameter
>>>>>>> in
>>>>>>> file base/plm_base_launch_support.c at line 77
>>>>>>> [plankton.uzh.ch:24327] [[44857,0],0] ORTE_ERROR_LOG: Bad parameter
>>>>>>> in
>>>>>>> file plm_rsh_module.c at line 990
>>>>>>>
>>>>>>>
>>>>>>> --------------------------------------------------------------------------
>>>>>>> A daemon (pid unknown) died unexpectedly on signal 1  while
>>>>>>> attempting
>>>>>>> to
>>>>>>> launch so we are aborting.
>>>>>>>
>>>>>>> There may be more information reported by the environment (see
>>>>>>> above).
>>>>>>>
>>>>>>> This may be because the daemon was unable to find all the needed
>>>>>>> shared
>>>>>>> libraries on the remote node. You may set your LD_LIBRARY_PATH to
>>>>>>> have
>>>>>>> the
>>>>>>> location of the shared libraries on the remote nodes and this will
>>>>>>> automatically be forwarded to the remote nodes.
>>>>>>>
>>>>>>>
>>>>>>> --------------------------------------------------------------------------
>>>>>>>
>>>>>>>
>>>>>>> --------------------------------------------------------------------------
>>>>>>> mpirun noticed that the job aborted, but has no info as to the
>>>>>>> process
>>>>>>> that caused that situation.
>>>>>>>
>>>>>>>
>>>>>>> --------------------------------------------------------------------------
>>>>>>> mpirun: clean termination accomplished
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> With out the '-rf rankfile' option everything works as expected.
>>>>>>>
>>>>>>> My hostfile :
>>>>>>> [jody_at_plankton tests]$ cat testhosts
>>>>>>> # The following node is a quad-processor machine, and we absolutely
>>>>>>> # want to disallow over-subscribing it:
>>>>>>> plankton slots=3  max-slots=3
>>>>>>> # The following nodes are dual-processor machines:
>>>>>>> nano_00  slots=2 max-slots=2
>>>>>>> nano_01  slots=2 max-slots=2
>>>>>>> nano_02  slots=2 max-slots=2
>>>>>>> nano_03  slots=2 max-slots=2
>>>>>>> nano_04  slots=2 max-slots=2
>>>>>>> nano_05  slots=2 max-slots=2
>>>>>>> nano_06  slots=2 max-slots=2
>>>>>>>
>>>>>>> my rank file:
>>>>>>> [jody_at_plankton neander]$ cat rankfile
>>>>>>> rank  0=nano_00  slot=1
>>>>>>> rank  1=plankton slot=0
>>>>>>> rank  2=nano_01  slot=1
>>>>>>>
>>>>>>> my Open MPI version: 1.3.2
>>>>>>>
>>>>>>> i get the same error if i use a rankfile which has a single line
>>>>>>>  rank  0=plankton  slot=0
>>>>>>> (plankton is my local machine) and call mpirun with np 1
>>>>>>>
>>>>>>> What does the "Rankfile claimed..." message mean?
>>>>>>> Did i make an error in my rankfile?
>>>>>>> If yes, what would be the correct way to write it?
>>>>>>>
>>>>>>> Thank You
>>>>>>>  Jody
>>>>>>> _______________________________________________
>>>>>>> users mailing list
>>>>>>> users_at_[hidden]
>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> users mailing list
>>>>>> users_at_[hidden]
>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> users mailing list
>>>>> users_at_[hidden]
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>
>>>>
>>>> _______________________________________________
>>>> users mailing list
>>>> users_at_[hidden]
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>
>>>
>>> _______________________________________________
>>> users mailing list
>>> users_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>