Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] How to replace --cpus-per-proc by --map-by
From: Ralph Castain (rhc_at_[hidden])
Date: 2014-05-15 16:10:43


I'm not sure of the issue, but so far as I'm aware the cpus-per-proc functionality continued to work thru all those releases and into today. Yes, the syntax changed during the 1.7 series to reflect a broader desire to consolidate options into something that could be contained in a minimum number of MCA parameters - but the original option was only deprecated and will still work (though we will emit a deprecation warning). Regardless, the 1.8.1 release should certainly understand the "pe=3" modifier and do the right thing.

The "processing element (pe)" terminology is one the general community is migrating towards as the use of hyperthreads grows. The old "slot" terminology simply wasn't accurate enough any more as a processing "slot" could contain multiple hardware threads (or even cores), especially if someone is allocating "containers". So we adopted the "pe" term as indicating the number of processors to be assigned to the process, with "processor" equating to either "core" or "hwthread" depending on whether or not you set the "use-hwthreads-as-cpus" flag.

The comments regarding the meaning of the term "rank" certainly aren't intended to be "snide" - they only reflect the fact that the "rank" or a process is only defined in terms of a given communicator. Thus, one process can have multiple "ranks" depending on (a) how many communicators have been created, and (b) what position it occupies within each of those. In general, we had been using the term only in relation to the initial comm_world communicator, but we unfortunately then started using the term in discussions over comm_spawn and other communicator creation functions - and generating confusion as to the process we were discussing.

We don't support cgroup directly, so if you are using cgroups, it is possible that we aren't picking up resource limits that cgroup might be setting. We *should* be seeing the core limits on the backend nodes, but I can't swear to it as we haven't (to my knowledge) tested against cgroups.

On May 15, 2014, at 11:16 AM, Mark Hahn <hahn_at_[hidden]> wrote:

>> We're open to suggestion, really - just need some help identifying the best
>> way to get this info out there.
>
> well, OpenMPI information is fragmented and sprayed all over.
> In some places, there is mention of a wiki to be updated with an explanation; for other things, a consumer needs to wander around loosely-related blogs, mail archives, FAQs, usage statements, etc.
>
> For instance, I've been trying to figure out how to do a simple thing,
> launch a hybrid job. Assume I have a scheduled, heterogenous cluster
> where mpirun simply receives a normal nodefile like this:
>
> clu357
> clu357
> clu357
> clu354
> clu354
> clu354
>
> and I want to launch a 2-rank, 3-thread-per-rank job. forget about frills like hwloc or binding.
>
> back when --cpus-per-proc was around, this was obvious and worked flawlessly. I honestly can't figure out how it works now, though - for any definition of "now" since:
>
> http://www.open-mpi.org/community/lists/devel/2011/12/10060.php
>
> 2011! then there's a dribble more info in 2014 (!) that hints that "--map-by node:pe=3" might do the trick here:
>
> http://comments.gmane.org/gmane.comp.clustering.open-mpi.user/21193
>
> where did "pe" come from? is it the same as slot, hwthread, core?
> why does the documentation make snide comments about how the conventional
> understanding of "rank" (~ equivalent to process) might not be true?
>
> most of all, when was the break introduced? at this point, I tell people
> that 1.4.3 worked, and that everything after that is broken.
>
> recent releases (I tried 1.7.3, 1.7.5 and 1.8.1) choke on this. I wonder whether it's having trouble with the fact that a job gets an arbitrary set of cores via cgroup, and perhaps hwloc doesn't understand that it can only work within this set...
>
>
>>>> So please see this URL below(especially the first half part
>>>> of it - from 1 to 20 pages):
>>>> http://www.slideshare.net/jsquyres/open-mpi-explorations-in-process-affinity-eurompi13-presentation
>>>>
>>>> Although these slides by Jeff are the explanation for LAMA,
>>>> which is another mapping system installed in the openmpi-1.7
>>>> series, I guess you can easily understand what is mapping and
>>>> binding in general terms.
>
> AFAIKT, the lama slide deck seemed to be only concerned with affinity settings, which are irrelevant here.
>
> confused,
> Mark Hahn.
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users