Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] why does --rankfile need hostlist?
From: Mike Dubman (mike.ompi_at_[hidden])
Date: 2009-06-23 08:46:08


just an idea, maybe it is worse to provide brand new cmd line option to
mpirun. This option will accept filename and support combined syntax for
machinefile/hostfile (to define allocations) and rankfile (to define
placement).

YAML syntax can be used in order to describe file primitives (
http://www.yaml.org/start.html)

for example:

$ mpirun -clusterfile /path/to/clusterfile
$ cat clusterfile
hostX:
       slots : int
       maxslots : int
       ranks : rankid[@socket:core]

example of clusterfile
===============

hostX:
       slots : 4
       maxslots : 4
       ranks : 1,16,22

hostY:
      slots : 8
      maxslots : 8
      ranks : 1_at_0:*, 3_at_2-3, 4_at_0:1, 5

By doing so, we keep backwards compatability.
after reading clusterfile, code should perform *hostfile* and *rankfile*
parts as today.

what do you think?
Mike

On Mon, Jun 22, 2009 at 1:30 PM, Terry Dontje <Terry.Dontje_at_[hidden]> wrote:

> Let us think about this some more. We'll try and reply later today.
>
> --td
>
> Ralph Castain wrote:
>
>> Had a chance to think about how this might be done, and looked at it for
>> awhile after getting home. I -think- I found a way to do it, but there are a
>> couple of caveats:
>>
>> 1. Len's point about oversubscribing without warning would definitely hold
>> true - this would positively be a "user beware" option
>>
>> 2. there could be no RM-provided allocation, hostfile, or -host options
>> specified. Basically, I would be adding the "read rankfile" option to the
>> end of the current allocation determination procedure
>>
>> I would still allow more procs than shown in the rankfile (mapping the
>> rest bynode on the nodes specified in the rankfile - can't do byslot because
>> I don't know how many slots are on each node), which means the only change
>> in behavior would be the forced bynode mapping of unspecified procs.
>>
>> So use of this option will entail some risks and a slight difference in
>> behavior, but would relieve you from the burden of having to provide a
>> hostfile. I'm not personally convinced it is worth the risk and probable
>> user complaints of "it didn't work", but since we don't use this option, I
>> don't have a strong opinion on the matter.
>>
>> Let's just avoid going back-and-forth over wanting it, or how it should be
>> implemented - let's get it all ironed out, and then implement it once, like
>> we finally did at the end with the whole hostfile thing.
>>
>> Let me know if you want me to do this - it obviously isn't at the top of
>> my priority list, but still could be done in the next few weeks.
>>
>> Ralph
>>
>>
>> On Jun 21, 2009, at 9:00 AM, Lenny Verkhovsky wrote:
>>
>> Sorry for the delay in response, I totally agree with Ralph that it's not
>>> as easy as it seems, 1. rankfile mapper uses already allocated machines ( by
>>> scheduler or hostfile ), by using rankfile as a hostfile we can run into
>>> problem where trying to use unallocated nodes, what can hang the run.
>>> 2. we can't define in rankfile number of slots on each machine, which
>>> means oversubscribing can take place without any warning.
>>> 3. I personally dont see any problem using hostfile, even if it has
>>> redundant info, hostfile and rankfile belong to different layers in the
>>> system and solve different problems. The original hostfile ( if I recall
>>> correctly ) could bind rank to the node, but the syntax wasn't very flexible
>>> and clear.
>>> Lenny.
>>>
>>> On Sun, Jun 21, 2009 at 5:15 PM, Ralph Castain <rhc_at_[hidden]<mailto:
>>> rhc_at_[hidden]>> wrote:
>>>
>>> Let me suggest a two-step process, then:
>>>
>>> 1. let's change the error message as this is easily done and thus
>>> can be done now
>>>
>>> 2. I can look at how to eat the rankfile as a hostfile. This may
>>> not even be possible - the problem is that the entire system is
>>> predicated on certain ordering due to our framework architecture.
>>> So we get an allocation, and then do a mapping against that
>>> allocation, filtering the allocation through hostfiles, -host,
>>> and other options.
>>>
>>> By the time we reach the rankfile mapper, we have already
>>> determined that we don't have an allocation and have to abort. It
>>> is the rankfile mapper itself that looks for the -rankfile
>>> option, so the system can have no knowledge that someone has
>>> specified that option before that point - and thus, even if I
>>> could parse the rankfile, I don't know it was given!
>>>
>>> What will take time is to figure out a way to either:
>>>
>>> (a) allow us to run the mapper even though we don't have any
>>> nodes we know about, and allow the mapper to insert the nodes
>>> itself - without causing non-rankfile uses to break (which could
>>> be a major feat); or
>>>
>>> (b) have the overall system check for the rankfile option and
>>> pass it as a hostfile as well, assuming that a hostfile wasn't
>>> also given, no RM-based allocation exists, etc. - which breaks
>>> our abstraction rules and also opens a possible can of worms.
>>>
>>> Either way, I also then have to teach the hostfile parser how to
>>> realize it is a rankfile format and convert the info in it into
>>> what we expected to receive from a hostfile - another non-trivial
>>> problem.
>>>
>>> I'm willing to give it a try - just trying to make clear why my
>>> response was negative. It isn't as simple as it sounds...which is
>>> why Len and I didn't pursue it when this was originally developed.
>>>
>>> Ralph
>>>
>>>
>>> On Sun, Jun 21, 2009 at 5:28 AM, Terry Dontje
>>> <Terry.Dontje_at_[hidden] <mailto:Terry.Dontje_at_[hidden]>> wrote:
>>>
>>> Being a part of these discussions I can understand your
>>> reticence to reopen this discussion. However, I think this
>>> is a major usability issue with this feature which actually
>>> is fairly important in order to get things to run performant.
>>> Which IMO is important.
>>>
>>> That being said I think there are one of two things that
>>> could be done to mitigate the issue.
>>>
>>> 1. To eliminate the element of surprise by changing mpirun
>>> to eat rankfile without the hostfile.
>>> 2. To change the error message to something understandable
>>> by the user such that they
>>> know they might be missing the hostfile option.
>>>
>>> Again I understand this topic is frustrating and there are
>>> some boundaries with the design that make these two option
>>> orthogonal to each other but I really believe we need to make
>>> the rankfile option something that is easily usable by our users.
>>>
>>>
>>> --td
>>>
>>> Ralph Castain wrote:
>>>
>>> Having gone around in circles on hostfile-related issues
>>> for over five years now, I honestly have little
>>> motivation to re-open the entire discussion again. It
>>> doesn't seem to be that daunting a requirement for those
>>> who are using it, so I'm inclined to just leave well
>>> enough alone.
>>>
>>> :-)
>>>
>>>
>>> On Fri, Jun 19, 2009 at 2:21 PM, Eugene Loh
>>> <Eugene.Loh_at_[hidden] <mailto:Eugene.Loh_at_[hidden]>
>>> <mailto:Eugene.Loh_at_[hidden] <mailto:Eugene.Loh_at_[hidden]>>>
>>>
>>> wrote:
>>>
>>> Ralph Castain wrote:
>>>
>>> The two files have a slightly different format
>>>
>>> Agreed.
>>>
>>> and completely different meaning.
>>>
>>> Somewhat agreed. They're both related to mapping
>>> processes onto a
>>> cluster.
>>>
>>> The hostfile specifies how many slots are on a
>>> node. The rankfile
>>> specifies a rank and what node/slot it is to be
>>> mapped onto.
>>>
>>> Agreed.
>>>
>>> Rankfiles can use relative node indexing and refer
>>> to nodes
>>> received from a resource manager - i.e., without
>>> any hostfile.
>>>
>>> This is the main part I'm concerned about. E.g.,
>>>
>>> % cat rankfile
>>> rank 0=node0 slot=0
>>> rank 1=node1 slot=0
>>> % mpirun -np 2 -rf rankfile ./a.out
>>>
>>> --------------------------------------------------------------------------
>>> Rankfile claimed host node1 that was not allocated or
>>> oversubscribed it's slots:
>>>
>>>
>>> --------------------------------------------------------------------------
>>> [node0:14611] [[61560,0],0] ORTE_ERROR_LOG: Bad
>>> parameter in file
>>> rmaps_rank_file.c at line 107
>>> [node0:14611] [[61560,0],0] ORTE_ERROR_LOG: Bad
>>> parameter in file
>>> base/rmaps_base_map_job.c at line 86
>>> [node0:14611] [[61560,0],0] ORTE_ERROR_LOG: Bad
>>> parameter in file
>>> base/plm_base_launch_support.c at line 86
>>> [node0:14611] [[61560,0],0] ORTE_ERROR_LOG: Bad
>>> parameter in file
>>> plm_rsh_module.c at line 1016
>>> % mpirun -np 2 -host node0,node1 -rf rankfile ./a.out
>>> 0 on node0
>>> 1 on node1
>>> done
>>>
>>> It seems to me that the rankfile has sufficient
>>> information to
>>> express what I want it to do. But mpirun won't accept
>>> this. To
>>> fix this, I have to, e.g., supply/maintain/specify
>>> redundant
>>> information in a hostfile or host list.
>>>
>>> So the files are intentionally quite different.
>>> Trying to combine
>>> them would be rather ugly.
>>>
>>> Right. And my issue is that I'm forced to use both
>>> when I only
>>> want rankfile functionality.
>>>
>>> On Thu, Jun 18, 2009 at 1:52 PM, Eugene Loh
>>> <Eugene.Loh_at_[hidden] <mailto:Eugene.Loh_at_[hidden]>
>>> <mailto:Eugene.Loh_at_[hidden]
>>> <mailto:Eugene.Loh_at_[hidden]>>> wrote:
>>>
>>> In order to use "mpirun --rankfile", I also
>>> need to specify
>>> hosts/hostlist. But that information is
>>> redundant with what
>>> I provide in the rankfile. So, from a user's
>>> point of view,
>>> this strikes me as broken. Yes? Should I
>>> file a ticket, or
>>> am I missing something here about this
>>> functionality?
>>>
>>>
>>> _______________________________________________
>>> devel mailing list
>>> devel_at_[hidden] <mailto:devel_at_[hidden]>
>>> <mailto:devel_at_[hidden] <mailto:devel_at_[hidden]>>
>>>
>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>
>>>
>>>
>>> ------------------------------------------------------------------------
>>>
>>>
>>>
>>> _______________________________________________
>>> devel mailing list
>>> devel_at_[hidden] <mailto:devel_at_[hidden]>
>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>
>>>
>>> _______________________________________________
>>> devel mailing list
>>> devel_at_[hidden] <mailto:devel_at_[hidden]>
>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>
>>>
>>>
>>> _______________________________________________
>>> devel mailing list
>>> devel_at_[hidden] <mailto:devel_at_[hidden]>
>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>
>>>
>>> _______________________________________________
>>> devel mailing list
>>> devel_at_[hidden] <mailto:devel_at_[hidden]>
>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>
>>
>> ------------------------------------------------------------------------
>>
>> _______________________________________________
>> devel mailing list
>> devel_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>
>>
>
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>