Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] v1.5 r25914 DOA
From: Ralph Castain (rhc_at_[hidden])
Date: 2012-02-22 18:16:30


That's what we needed to know - i.e., that setting num_sockets=1 generates an error instead of segfaulting down the road. I can submit a CMR to do so.

thx!

On Feb 22, 2012, at 4:12 PM, Eugene Loh wrote:

> On 02/22/12 14:54, Ralph Castain wrote:
>> That doesn't really address the issue, though. What I want to know is: what happens when you try to bind processes? What about -bind-to-socket, and -persocket options? Etc. Reason I'm concerned: I'm not sure what happens if the socket layer isn't present. The logic in 1.5 is pretty old, but I believe it relies heavily on sockets being present.
> Okay. So,
>
> *) "out of the box", basically nothing works. For example, "mpirun hostname" segfaults.
>
> *) With "--mca orte_num_sockets 1", stuff appears to work.
>
> *) With "--mca orte_num_sockets 1" and adding either "--bysocket --bind-to-socket" or "--npersocket <n>", I get:
>
> --------------------------------------------------------------------------
> Unable to bind to socket -13 on node burl-ct-v20z-10.
> --------------------------------------------------------------------------
> --------------------------------------------------------------------------
> mpirun was unable to start the specified application as it encountered an error:
>
> Error name: Fatal
> Node: burl-ct-v20z-10
>
> when attempting to start process rank 0.
> --------------------------------------------------------------------------
> 2 total processes failed to start
>
> So, I hear Brice's comment that this is an old kernel. And, I hear what you're saying about a "real" fix being expensive. Nevertheless, to my taste, automatically setting num_sockets==1 when num_sockets==0 is detected makes a lot of sense. It makes things "basically" work, turning a situation where everything including "mpirun hostname" segfaults into a situation where default usage works just fine. What remains broken is binding, which generates an error message that gives the user a hope of making progress (turning off binding). That's in contrast from expecting users to go from
>
> % mpirun hostname
> Segmentation fault
>
> to knowing that they should set orte_num_sockets==1.
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel