Open MPI logo

Hardware Locality Users' Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Hardware Locality Users mailing list

Subject: Re: [hwloc-users] hwloc-ps output - how to verify process binding on the core level?
From: Jeff Squyres (jsquyres_at_[hidden])
Date: 2011-02-14 09:53:50


On Feb 14, 2011, at 9:35 AM, Siew Yin Chan wrote:

> 1. I tried Open MPI 1.5.1 before turning to hwloc-bind. Yep. Open MPI 1.5.1 does provide the --bycore and --bind-to-core option, but this option seems to bind processes to cores on my machine according to the *physical* indexes:

FWIW, you might want to try one of the OMPI 1.5.2 nightly tarballs -- we switched the process affinity stuff to hwloc in 1.5.2 (the 1.5.1 stuff uses a different mechanism).

> FYI, my testing environment and application imposes these requirements for optimum performance:
>
> i. Different binaries optimized for heterogeneous machines. This necessitates MIMD, and can be done in OMPI using the -app option (providing an application context file).
> ii. The application is communication-sensitive. Thus, fine-grained process mapping on *machines* and on *cores* is required to minimize inter-machine and inter-socket communication costs occurring on the network and on the system bus. Specifically, processes should be mapped onto successive cores of one socket before the next socket is considered, i.e., socket.0:core0-3, then socket.1:core0-3. In this case, the communication among neighboring rank 0-3 will be confined to socket 0 without going through the system bus. Same for rank 4-7 on socket 1. As such, the order of the cores should follow the *logical* indexes.

I think that OMPI 1.5.2 should do this for you -- rather than following and logical/physical ordering, it does what you describe: traverses successive cores on a socket before going to the next socket (which happens to correspond to hwloc's logical ordering, but that was not the intent).

FWIW, we have a huge revamp of OMPI's affinity support on the mpirun command line that will offer much more flexible binding choices.

> Initially, I tried combining the features of rankfile and appfile, e.g.,
>
> $ cat rankfile8np4
> rank 0=compute-0-8 slot=0:0
> rank 1=compute-0-8 slot=0:1
> rank 2=compute-0-8 slot=0:2
> rank 3=compute-0-8 slot=0:3
> $ cat rankfile9np4
> rank 0=compute-0-9 slot=0:0
> rank 1=compute-0-9 slot=0:1
> rank 2=compute-0-9 slot=0:2
> rank 3=compute-0-9 slot=0:3
> $ cat my_appfile_rankfile
> --host compute-0-8 -rf rankfile8np4 -np 4 ./test1
> --host compute-0-9 -rf rankfile9np4 -np 4 ./test2
> $ mpirun -app my_appfile_rankfile
>
> but found out that only the rankfile stated on the first line took effect; the second was ignored completely. After some time of googling and trial and error, I decided to try an external binder, and this direction led me to hwloc-bind.
>
> Maybe I should bring the issue of rankfile + appfile to the OMPI mailing list.

Yes.

I'd have to look at it more closely, but it's possible that we only allow one rankfile per job -- i.e., that the rankfile should specify all the procs in the job, not on a per-host basis. But perhaps we don't warn/error if multiple rankfiles are used; I would consider that a bug.

-- 
Jeff Squyres
jsquyres_at_[hidden]
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/