Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] v1.5 r25914 DOA
From: Eugene Loh (eugene.loh_at_[hidden])
Date: 2012-02-22 11:39:27


On 2/21/2012 10:31 PM, Eugene Loh wrote:
> ... "sockets" is unknown and hwloc returns 0 for num_sockets and OMPI
> pukes on divide by zero. OS info was listed in the original message
> (below). Might we want to do something else? E.g., assume
> num_sockets==1 when num_sockets==0 (if you know what I mean)? So,
> which one (or more) of the following should be fixed?
>
> *) on this platform, hwloc finds no socket level
> *) therefore hwloc returns num_sockets==0 to OMPI
> *) OMPI divides by 0 and barfs on basically everything
Okay. So, Brice's other e-mail indicates that the first two are "not
really uncommon":

On 2/22/2012 7:55 AM, Brice Goglin wrote:
> Anyway, we have seen other systems (mostly non-Linux) where lstopo
> reports nothing interesting (only one machine object with multiple PU
> children). So numsockets==0 isn't really uncommon.
So, it seems to me that OMPI needs to handle the num_sockets==0 case
rather than just dividing by num_sockets. This is v1.5
orte_odls_base_open() since r25914.
>> On Feb 21, 2012, at 7:20 PM, Eugene Loh wrote:
>>> In r25914, orte/mca/odls/base/odls_base_open.c, we get
>>>
>>> 222 /* get the number of local sockets unless we were given
>>> a number */
>>> 223 if (0 == orte_default_num_sockets_per_board) {
>>> 224
>>> opal_paffinity_base_get_socket_info(&orte_odls_globals.num_sockets);
>>> 225 }
>>> 226 /* get the number of local processors */
>>> 227
>>> opal_paffinity_base_get_processor_info(&orte_odls_globals.num_processors);
>>> 228 /* compute the base number of cores/socket, if not given */
>>> 229 if (0 == orte_default_num_cores_per_socket) {
>>> 230 orte_odls_globals.num_cores_per_socket =
>>> orte_odls_globals.num_processors / orte_odls_globals.num_sockets;
>>> 231 }
>>>
>>> Well, we execute the branch at line 224, but num_sockets remains 0.
>>> This leads to the divide-by-0 at line 230.