Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] v1.5 r25914 DOA
From: Brice Goglin (Brice.Goglin_at_[hidden])
Date: 2012-02-22 13:59:13

Le 22/02/2012 17:48, Ralph Castain a écrit :
> On Feb 22, 2012, at 9:39 AM, Eugene Loh wrote:
>> On 2/21/2012 10:31 PM, Eugene Loh wrote:
>>> ... "sockets" is unknown and hwloc returns 0 for num_sockets and OMPI pukes on divide by zero. OS info was listed in the original message (below). Might we want to do something else? E.g., assume num_sockets==1 when num_sockets==0 (if you know what I mean)? So, which one (or more) of the following should be fixed?
>>> *) on this platform, hwloc finds no socket level
>>> *) therefore hwloc returns num_sockets==0 to OMPI
>>> *) OMPI divides by 0 and barfs on basically everything
>> Okay. So, Brice's other e-mail indicates that the first two are "not really uncommon":
>> On 2/22/2012 7:55 AM, Brice Goglin wrote:
>>> Anyway, we have seen other systems (mostly non-Linux) where lstopo
>>> reports nothing interesting (only one machine object with multiple PU
>>> children). So numsockets==0 isn't really uncommon.
>> So, it seems to me that OMPI needs to handle the num_sockets==0 case rather than just dividing by num_sockets. This is v1.5 orte_odls_base_open() since r25914.
> Unfortunately, just artificially setting the num_sockets to 1 won't solve much - you'll get past that point in the code, but attempts to bind are likely to fail down the road. Fixing it will require some significant effort.
> Given we haven't heard reports of this before, I'm not convinced it is a widespread problem. For now, let's just use the mca param and see what happens.

I am probably missing something but: Why would setting num_sockets to 1
work fine as a mca param, while artificially setting it as said above
wouldn't ?