Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] v1.5 r25914 DOA
From: Eugene Loh (eugene.loh_at_[hidden])
Date: 2012-02-22 18:12:42


On 02/22/12 14:54, Ralph Castain wrote:
> That doesn't really address the issue, though. What I want to know is:
> what happens when you try to bind processes? What about
> -bind-to-socket, and -persocket options? Etc. Reason I'm concerned:
> I'm not sure what happens if the socket layer isn't present. The logic
> in 1.5 is pretty old, but I believe it relies heavily on sockets being
> present.
Okay. So,

*) "out of the box", basically nothing works. For example, "mpirun
hostname" segfaults.

*) With "--mca orte_num_sockets 1", stuff appears to work.

*) With "--mca orte_num_sockets 1" and adding either "--bysocket
--bind-to-socket" or "--npersocket <n>", I get:

--------------------------------------------------------------------------
Unable to bind to socket -13 on node burl-ct-v20z-10.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun was unable to start the specified application as it encountered
an error:

Error name: Fatal
Node: burl-ct-v20z-10

when attempting to start process rank 0.
--------------------------------------------------------------------------
2 total processes failed to start

So, I hear Brice's comment that this is an old kernel. And, I hear what
you're saying about a "real" fix being expensive. Nevertheless, to my
taste, automatically setting num_sockets==1 when num_sockets==0 is
detected makes a lot of sense. It makes things "basically" work,
turning a situation where everything including "mpirun hostname"
segfaults into a situation where default usage works just fine. What
remains broken is binding, which generates an error message that gives
the user a hope of making progress (turning off binding). That's in
contrast from expecting users to go from

% mpirun hostname
Segmentation fault

to knowing that they should set orte_num_sockets==1.