Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |  

This web mail archive is frozen.

This page is part of a frozen web archive of this mailing list.

You can still navigate around this archive, but know that no new mails have been added to it since July of 2016.

Click here to be taken to the new web archives of this list; it includes all the mails that are in this frozen archive plus all new mails that have been sent to the list since it was migrated to the new archives.

Subject: Re: [OMPI devel] v1.5 r25914 DOA
From: Ralph Castain (rhc_at_[hidden])
Date: 2012-02-22 18:16:30


That's what we needed to know - i.e., that setting num_sockets=1 generates an error instead of segfaulting down the road. I can submit a CMR to do so.

thx!

On Feb 22, 2012, at 4:12 PM, Eugene Loh wrote:

> On 02/22/12 14:54, Ralph Castain wrote:
>> That doesn't really address the issue, though. What I want to know is: what happens when you try to bind processes? What about -bind-to-socket, and -persocket options? Etc. Reason I'm concerned: I'm not sure what happens if the socket layer isn't present. The logic in 1.5 is pretty old, but I believe it relies heavily on sockets being present.
> Okay. So,
>
> *) "out of the box", basically nothing works. For example, "mpirun hostname" segfaults.
>
> *) With "--mca orte_num_sockets 1", stuff appears to work.
>
> *) With "--mca orte_num_sockets 1" and adding either "--bysocket --bind-to-socket" or "--npersocket <n>", I get:
>
> --------------------------------------------------------------------------
> Unable to bind to socket -13 on node burl-ct-v20z-10.
> --------------------------------------------------------------------------
> --------------------------------------------------------------------------
> mpirun was unable to start the specified application as it encountered an error:
>
> Error name: Fatal
> Node: burl-ct-v20z-10
>
> when attempting to start process rank 0.
> --------------------------------------------------------------------------
> 2 total processes failed to start
>
> So, I hear Brice's comment that this is an old kernel. And, I hear what you're saying about a "real" fix being expensive. Nevertheless, to my taste, automatically setting num_sockets==1 when num_sockets==0 is detected makes a lot of sense. It makes things "basically" work, turning a situation where everything including "mpirun hostname" segfaults into a situation where default usage works just fine. What remains broken is binding, which generates an error message that gives the user a hope of making progress (turning off binding). That's in contrast from expecting users to go from
>
> % mpirun hostname
> Segmentation fault
>
> to knowing that they should set orte_num_sockets==1.
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel