Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] problem when binding to socket on a single socket node
From: Ralph Castain (rhc_at_[hidden])
Date: 2010-04-09 10:41:11


Just to check: is this with the latest trunk? Brad and Terry have been making changes to this section of code, including modifying the PROCESS_IS_BOUND test...

On Apr 9, 2010, at 3:39 AM, Nadia Derbey wrote:

> Hi,
>
> I am facing a problem with a test that runs fine on some nodes, and
> fails on others.
>
> I have a heterogenous cluster, with 3 types of nodes:
> 1) Single socket , 4 cores
> 2) 2 sockets, 4cores per socket
> 3) 2 sockets, 6 cores/socket
>
> I am using:
> . salloc to allocate the nodes,
> . mpirun binding/mapping options "-bind-to-socket -bysocket"
>
> # salloc -N 1 mpirun -n 4 -bind-to-socket -bysocket sleep 900
>
> This command fails if the allocated node is of type #1 (single socket/4
> cpus).
> BTW, in that case orte_show_help is referencing a tag
> ("could-not-bind-to-socket") that does not exist in
> help-odls-default.txt.
>
> While it succeeds when run on nodes of type #2 or 3.
> I think a "bind to socket" should not return an error on a single socket
> machine, but rather be a noop.
>
> The problem comes from the test
> OPAL_PAFFINITY_PROCESS_IS_BOUND(mask, &bound);
> called in odls_default_fork_local_proc() after the binding to the
> processors socket has been done:
> ========
> <snip>
> OPAL_PAFFINITY_CPU_ZERO(mask);
> for (n=0; n < orte_default_num_cores_per_socket; n++) {
> <snip>
> OPAL_PAFFINITY_CPU_SET(phys_cpu, mask);
> }
> /* if we did not bind it anywhere, then that is an error */
> OPAL_PAFFINITY_PROCESS_IS_BOUND(mask, &bound);
> if (!bound) {
> orte_show_help("help-odls-default.txt",
> "odls-default:could-not-bind-to-socket", true);
> ORTE_ODLS_ERROR_OUT(ORTE_ERR_FATAL);
> }
> ========
> OPAL_PAFFINITY_PROCESS_IS_BOUND() will return true if there bits set in
> the mask *AND* the number of bits set is lesser than the number of cpus
> on the machine. Thus on a single socket, 4 cores machine the test will
> fail. While on other the kinds of machines it will succeed.
>
> Again, I think the problem could be solved by changing the alogrithm,
> and assuming that ORTE_BIND_TO_SOCKET, on a single socket machine =
> noop.
>
> Another solution could be to call the test
> OPAL_PAFFINITY_PROCESS_IS_BOUND() at the end of the loop only if we are
> bound (orte_odls_globals.bound). Actually that is the only case where I
> see a justification to this test (see attached patch).
>
> And may be both solutions could be mixed.
>
> Regards,
> Nadia
>
>
> --
> Nadia Derbey <Nadia.Derbey_at_[hidden]>
> <001_fix_process_binding_test.patch>_______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel