Nadia Derbey wrote:
On Fri, 2010-04-09 at 08:41 -0600, Ralph Castain wrote:
Just to check: is this with the latest trunk? Brad and Terry have been making changes to this section of code, including modifying the PROCESS_IS_BOUND test...


Well, it was on the v1.5. But I just checked: looks like
  1. the call to OPAL_PAFFINITY_PROCESS_IS_BOUND is still there in
  2. OPAL_PAFFINITY_PROCESS_IS_BOUND() is defined the same way

But, I'll give it a try with the latest trunk.


The changes, I've done do not touch OPAL_PAFFINITY_PROCESS_IS_BOUND at all.  Also, I am only touching code related to the "bind-to-core" option so I really doubt if my changes are causing issues here.


On Apr 9, 2010, at 3:39 AM, Nadia Derbey wrote:


I am facing a problem with a test that runs fine on some nodes, and
fails on others.

I have a heterogenous cluster, with 3 types of nodes:
1) Single socket , 4 cores
2) 2 sockets, 4cores per socket
3) 2 sockets, 6 cores/socket

I am using:
. salloc to allocate the nodes,
. mpirun binding/mapping options "-bind-to-socket -bysocket"

# salloc -N 1 mpirun -n 4 -bind-to-socket -bysocket sleep 900

This command fails if the allocated node is of type #1 (single socket/4
BTW, in that case orte_show_help is referencing a tag
("could-not-bind-to-socket") that does not exist in

While it succeeds when run on nodes of type #2 or 3.
I think a "bind to socket" should not return an error on a single socket
machine, but rather be a noop.

The problem comes from the test
called in odls_default_fork_local_proc() after the binding to the
processors socket has been done:
   for (n=0; n < orte_default_num_cores_per_socket; n++) {
       OPAL_PAFFINITY_CPU_SET(phys_cpu, mask);
   /* if we did not bind it anywhere, then that is an error */
   if (!bound) {
                      "odls-default:could-not-bind-to-socket", true);
OPAL_PAFFINITY_PROCESS_IS_BOUND() will return true if there bits set in
the mask *AND* the number of bits set is lesser than the number of cpus
on the machine. Thus on a single socket, 4 cores machine the test will
fail. While on other the kinds of machines it will succeed.

Again, I think the problem could be solved by changing the alogrithm,
and assuming that ORTE_BIND_TO_SOCKET, on a single socket machine =

Another solution could be to call the test
OPAL_PAFFINITY_PROCESS_IS_BOUND() at the end of the loop only if we are
bound (orte_odls_globals.bound). Actually that is the only case where I
see a justification to this test (see attached patch).

And may be both solutions could be mixed.


