Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |  

This web mail archive is frozen.

This page is part of a frozen web archive of this mailing list.

You can still navigate around this archive, but know that no new mails have been added to it since July of 2016.

Click here to be taken to the new web archives of this list; it includes all the mails that are in this frozen archive plus all new mails that have been sent to the list since it was migrated to the new archives.

Subject: Re: [OMPI devel] problem when binding to socket on a single socket node
From: Ralph Castain (rhc_at_[hidden])
Date: 2010-04-09 10:41:11


Just to check: is this with the latest trunk? Brad and Terry have been making changes to this section of code, including modifying the PROCESS_IS_BOUND test...

On Apr 9, 2010, at 3:39 AM, Nadia Derbey wrote:

> Hi,
>
> I am facing a problem with a test that runs fine on some nodes, and
> fails on others.
>
> I have a heterogenous cluster, with 3 types of nodes:
> 1) Single socket , 4 cores
> 2) 2 sockets, 4cores per socket
> 3) 2 sockets, 6 cores/socket
>
> I am using:
> . salloc to allocate the nodes,
> . mpirun binding/mapping options "-bind-to-socket -bysocket"
>
> # salloc -N 1 mpirun -n 4 -bind-to-socket -bysocket sleep 900
>
> This command fails if the allocated node is of type #1 (single socket/4
> cpus).
> BTW, in that case orte_show_help is referencing a tag
> ("could-not-bind-to-socket") that does not exist in
> help-odls-default.txt.
>
> While it succeeds when run on nodes of type #2 or 3.
> I think a "bind to socket" should not return an error on a single socket
> machine, but rather be a noop.
>
> The problem comes from the test
> OPAL_PAFFINITY_PROCESS_IS_BOUND(mask, &bound);
> called in odls_default_fork_local_proc() after the binding to the
> processors socket has been done:
> ========
> <snip>
> OPAL_PAFFINITY_CPU_ZERO(mask);
> for (n=0; n < orte_default_num_cores_per_socket; n++) {
> <snip>
> OPAL_PAFFINITY_CPU_SET(phys_cpu, mask);
> }
> /* if we did not bind it anywhere, then that is an error */
> OPAL_PAFFINITY_PROCESS_IS_BOUND(mask, &bound);
> if (!bound) {
> orte_show_help("help-odls-default.txt",
> "odls-default:could-not-bind-to-socket", true);
> ORTE_ODLS_ERROR_OUT(ORTE_ERR_FATAL);
> }
> ========
> OPAL_PAFFINITY_PROCESS_IS_BOUND() will return true if there bits set in
> the mask *AND* the number of bits set is lesser than the number of cpus
> on the machine. Thus on a single socket, 4 cores machine the test will
> fail. While on other the kinds of machines it will succeed.
>
> Again, I think the problem could be solved by changing the alogrithm,
> and assuming that ORTE_BIND_TO_SOCKET, on a single socket machine =
> noop.
>
> Another solution could be to call the test
> OPAL_PAFFINITY_PROCESS_IS_BOUND() at the end of the loop only if we are
> bound (orte_odls_globals.bound). Actually that is the only case where I
> see a justification to this test (see attached patch).
>
> And may be both solutions could be mixed.
>
> Regards,
> Nadia
>
>
> --
> Nadia Derbey <Nadia.Derbey_at_[hidden]>
> <001_fix_process_binding_test.patch>_______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel