Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

From: Jeff Squyres (jsquyres_at_[hidden])
Date: 2006-11-29 08:48:48


There's a few issues involved here:

- Brian was pointing out that AMDs are NUMA (and Intel may well go
NUMA someday -- scaling up to hundreds of cores, unless something
quite unexpected happens in terms of computer architectures, simply
does not scale in UMA architectures). So each core is *not* created
equal -- mainly in terms of locality to resources. If MPI allocates
resources local to core X and you end up pinning yourself to core Y,
what happens if X and Y are not local to each other? You've just
killed your performance because of the latency hit to get to MPI- (or
other) allocated resources.

- If you're going to use the Linux sched_setaffinity(), beware that
this function has changed signatures multiple times over the history
of Linux (there are at least 3 versions that I'm aware of).
Shameless plug: try the Portable Linux Processor Affinity (PLPA)
micro-library that provides a simple, consistent interface to Linux
processor affinity regardless of your version of Linux kernel and
glibc (http://www.open-mpi.org/software/plpa/). The library has
nothing to do with MPI and can be used in any application that wants
to use paffinity.

- There's also the issue that some clusters -- particularly those
setup for high-core-count hosts -- may well be setup to allow
multiple MPI jobs to land on the same host. In that case, how does
the MPI app know which core to bind itself to? If every MPI job
starts binding itself to core 0 and counting upwards, the case where
multiple MPI jobs land on the same host becomes a disaster.

- There's also the issue that the BIOS determines core/socket order
mapping to Linux virtual processor IDs. Linux virtual processor 0 is
always socket 0, core 0. But what is linux virtual processor 1? Is
it socket 0, core 1, or socket 1, core 0? This stuff is quite
complicated to figure out, and can have large implications
(particularly in NUMA environments).

On Nov 29, 2006, at 1:08 AM, Durga Choudhury wrote:

> Brian
>
> But does it matter which core the process gets bound to? They are
> all identical, and as long as the task is parallelized in equal
> chunks (that's the key part), it should not matter. The last time I
> had to do this, the problem had to do with real-time processing of
> a very large radar image. My approach was to spawn *ONE* MPI
> process per blade and 12 threads (to utilize the 12 processors).
> Inside the task entry point of each pthread, I called
> sched_setaffinity(). Then I set the scheduling algorithm to real
> time with a very high task priority to avoid preemption. It turns
> out that the last two steps did not buy me much because ours was a
> lean, embedded architecture anyway, designed to run real-time
> applications, but I definitely got a speed up from the task
> distribution.
>
> It sure would be very nice for openMPI to have this feature; no
> questions about that. All I am saying is: if a user wants it today,
> a reasonable workaround is available so he/she does not need to wait.
>
> This is my $0.01's worth, since I am probably a lot less experienced.
>
> Durga
>
>
> On 11/29/06, Brian W. Barrett <bbarrett_at_[hidden]> wrote: It would
> be difficult to do well without some MPI help, in my
> opinion. You certainly could use the Linux processor affinity API
> directly in the MPI application. But how would the process know
> which core to bind to? It could wait until after MPI_INIT and call
> MPI_COMM_RANK, but MPI implementations allocate many of their
> resources during MPI_INIT, so there's high potential of the resources
> (ie, memory) ending up associated with a different processor than the
> one the process gets pinned to. That isn't a big deal on Intel
> machines, but is a major issue for AMD processors.
>
> Just my $0.02, anyway.
>
> Brian
>
> On Nov 28, 2006, at 6:09 PM, Durga Choudhury wrote:
>
> > Jeff (and everybody else)
> >
> > First of all, pardon me if this is a stupid comment; I am learning
> > the nuts-and-bolts of parallel programming; but my comment is as
> > follows:
> >
> > Why can't this be done *outside* openMPI, by calling Linux's
> > processor affinity APIs directly? I work with a blade server kind
> > of archirecture, where each blade has 12 CPUs. I use pthread within
> > each blade and MPI to talk across blades. I use the Linux system
> > calls to attach a thread to a specific CPU and it seems to work
> > fine. The only drawback is: it makes the code unportable to a
> > different OS. But even if you implemented paffinity within openMPI,
> > the code will still be unportable to a different implementation of
> > MPI, which, as is, it is not.
> >
> > Hope this helps to the original poster.
> >
> > Durga
> >
> >
> > On 11/28/06, Jeff Squyres < jsquyres_at_[hidden]> wrote: There is not,
> > right now. However, this is mainly because back when I
> > implemented the processor affinity stuff in OMPI (well over a year
> > ago), no one had any opinions on exactly what interface to expose to
> > the use. :-)
> >
> > So right now there's only this lame control:
> >
> > http://www.open-mpi.org/faq/?category=tuning#using-paffinity
> >
> > I am not opposed to implementing more flexible processor affinity
> > controls, but the Big Discussion over the past few months is exactly
> > how to expose it to the end user. There have been several formats
> > proposed (e.g., mpirun command line parameters, magic MPI
> attributes,
> > MCA parameters, etc.), but nothing that has been "good" and "right".
> > So here's the time to chime in -- anyone have any opinions on this?
> >
> >
> >
> > On Nov 25, 2006, at 9:31 AM, shaposh_at_[hidden] wrote:
> >
> > > Hello,
> > > i cant figure out, is there a way with open-mpi to bind all
> > > threads on a given node to a specified subset of CPUs.
> > > For example, on a multi-socket multi-core machine, i want to use
> > > only a single core on each CPU.
> > > Thank You.
> > >
> > > Best Regards,
> > > Alexander Shaposhnikov
> > > _______________________________________________
> > > users mailing list
> > > users_at_[hidden]
> > > http://www.open-mpi.org/mailman/listinfo.cgi/users
> >
> >
> > --
> > Jeff Squyres
> > Server Virtualization Business Unit
> > Cisco Systems
> >
> > _______________________________________________
> > users mailing list
> > users_at_[hidden]
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
> >
> >
> >
> > --
> > Devil wanted omnipresence;
> > He therefore created communists.
> > _______________________________________________
> > users mailing list
> > users_at_[hidden]
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
>
> --
> Brian Barrett
> Open MPI Team, CCS-1
> Los Alamos National Laboratory
>
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
>
> --
> Devil wanted omnipresence;
> He therefore created communists.
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users

-- 
Jeff Squyres
Server Virtualization Business Unit
Cisco Systems