Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] v1.5 r25914 DOA
From: Brice Goglin (Brice.Goglin_at_[hidden])
Date: 2012-02-22 15:23:46


Le 22/02/2012 20:24, Eugene Loh a écrit :
> On 2/22/2012 11:08 AM, Ralph Castain wrote:
>> On Feb 22, 2012, at 11:59 AM, Brice Goglin wrote:
>>> Le 22/02/2012 17:48, Ralph Castain a écrit :
>>>> On Feb 22, 2012, at 9:39 AM, Eugene Loh wrote
>>>>> On 2/21/2012 10:31 PM, Eugene Loh wrote:
>>>>>> ... "sockets" is unknown and hwloc returns 0 for num_sockets and
>>>>>> OMPI pukes on divide by zero. OS info was listed in the original
>>>>>> message (below). Might we want to do something else? E.g.,
>>>>>> assume num_sockets==1 when num_sockets==0 (if you know what I
>>>>>> mean)? So, which one (or more) of the following should be fixed?
>>>>>>
>>>>>> *) on this platform, hwloc finds no socket level
>>>>>> *) therefore hwloc returns num_sockets==0 to OMPI
>>>>>> *) OMPI divides by 0 and barfs on basically everything
>>>>> Okay. So, Brice's other e-mail indicates that the first two are
>>>>> "not really uncommon":
>>>>>
>>>>> On 2/22/2012 7:55 AM, Brice Goglin wrote:
>>>>>> Anyway, we have seen other systems (mostly non-Linux) where lstopo
>>>>>> reports nothing interesting (only one machine object with
>>>>>> multiple PU
>>>>>> children). So numsockets==0 isn't really uncommon.
>>>>> So, it seems to me that OMPI needs to handle the num_sockets==0
>>>>> case rather than just dividing by num_sockets. This is v1.5
>>>>> orte_odls_base_open() since r25914.
>>>> Unfortunately, just artificially setting the num_sockets to 1 won't
>>>> solve much - you'll get past that point in the code, but attempts
>>>> to bind are likely to fail down the road. Fixing it will require
>>>> some significant effort.
>>>>
>>>> Given we haven't heard reports of this before, I'm not convinced it
>>>> is a widespread problem.
> I assume we don't see the problem as widespread because it was only
> introduced into v1.5 in r25914. In my mind, the real question is how
> common it is for hwloc to decide numsockets==0. On that one, Brice
> asserts it "isn't really uncommon."

On Linux, it's uncommon: it only happens on some platforms with very old
kernels (2.6.10 or so).
Solaris, Darwin and Windows should get sockets in some/most cases.
FreeBSD should get x86 sockets correctly because we use cpuid directly
there.

Unless I am missing something, others have nothing related to sockets in
their driver: AIX, HPUX, OSF.

Brice