Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] v1.5 r25914 DOA
From: Ralph Castain (rhc_at_[hidden])
Date: 2012-02-22 14:54:09


On Feb 22, 2012, at 12:24 PM, Eugene Loh wrote:

> On 2/22/2012 11:08 AM, Ralph Castain wrote:
>> On Feb 22, 2012, at 11:59 AM, Brice Goglin wrote:
>>> Le 22/02/2012 17:48, Ralph Castain a écrit :
>>>> On Feb 22, 2012, at 9:39 AM, Eugene Loh wrote
>>>>> On 2/21/2012 10:31 PM, Eugene Loh wrote:
>>>>>> ... "sockets" is unknown and hwloc returns 0 for num_sockets and OMPI pukes on divide by zero. OS info was listed in the original message (below). Might we want to do something else? E.g., assume num_sockets==1 when num_sockets==0 (if you know what I mean)? So, which one (or more) of the following should be fixed?
>>>>>>
>>>>>> *) on this platform, hwloc finds no socket level
>>>>>> *) therefore hwloc returns num_sockets==0 to OMPI
>>>>>> *) OMPI divides by 0 and barfs on basically everything
>>>>> Okay. So, Brice's other e-mail indicates that the first two are "not really uncommon":
>>>>>
>>>>> On 2/22/2012 7:55 AM, Brice Goglin wrote:
>>>>>> Anyway, we have seen other systems (mostly non-Linux) where lstopo
>>>>>> reports nothing interesting (only one machine object with multiple PU
>>>>>> children). So numsockets==0 isn't really uncommon.
>>>>> So, it seems to me that OMPI needs to handle the num_sockets==0 case rather than just dividing by num_sockets. This is v1.5 orte_odls_base_open() since r25914.
>>>> Unfortunately, just artificially setting the num_sockets to 1 won't solve much - you'll get past that point in the code, but attempts to bind are likely to fail down the road. Fixing it will require some significant effort.
>>>>
>>>> Given we haven't heard reports of this before, I'm not convinced it is a widespread problem.
> I assume we don't see the problem as widespread because it was only introduced into v1.5 in r25914. In my mind, the real question is how common it is for hwloc to decide numsockets==0. On that one, Brice asserts it "isn't really uncommon."
>>>> For now, let's just use the mca param and see what happens.
>>> I am probably missing something but: Why would setting num_sockets to 1
>>> work fine as a mca param, while artificially setting it as said above
>>> wouldn't ?
>> Because the param means that it isn't hardwired into the code base. I want to first verify that artificially forcing num_sockets to 1 doesn't break the code down the road, so the less change to find out, the better.
> That sounds a lot different to me than the earlier statement. Thanks for asking that question, Brice. Anyhow, I tried using "--mca orte_num_sockets 1" and that seems to allow basic programs to run.

That doesn't really address the issue, though. What I want to know is: what happens when you try to bind processes? What about -bind-to-socket, and -persocket options? Etc.

Reason I'm concerned: I'm not sure what happens if the socket layer isn't present. The logic in 1.5 is pretty old, but I believe it relies heavily on sockets being present.

> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel