Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |  

This web mail archive is frozen.

This page is part of a frozen web archive of this mailing list.

You can still navigate around this archive, but know that no new mails have been added to it since July of 2016.

Click here to be taken to the new web archives of this list; it includes all the mails that are in this frozen archive plus all new mails that have been sent to the list since it was migrated to the new archives.

Subject: Re: [OMPI devel] v1.5 r25914 DOA
From: Ralph Castain (rhc_at_[hidden])
Date: 2012-02-22 11:48:19

On Feb 22, 2012, at 9:39 AM, Eugene Loh wrote:

> On 2/21/2012 10:31 PM, Eugene Loh wrote:
>> ... "sockets" is unknown and hwloc returns 0 for num_sockets and OMPI pukes on divide by zero. OS info was listed in the original message (below). Might we want to do something else? E.g., assume num_sockets==1 when num_sockets==0 (if you know what I mean)? So, which one (or more) of the following should be fixed?
>> *) on this platform, hwloc finds no socket level
>> *) therefore hwloc returns num_sockets==0 to OMPI
>> *) OMPI divides by 0 and barfs on basically everything
> Okay. So, Brice's other e-mail indicates that the first two are "not really uncommon":
> On 2/22/2012 7:55 AM, Brice Goglin wrote:
>> Anyway, we have seen other systems (mostly non-Linux) where lstopo
>> reports nothing interesting (only one machine object with multiple PU
>> children). So numsockets==0 isn't really uncommon.
> So, it seems to me that OMPI needs to handle the num_sockets==0 case rather than just dividing by num_sockets. This is v1.5 orte_odls_base_open() since r25914.

Unfortunately, just artificially setting the num_sockets to 1 won't solve much - you'll get past that point in the code, but attempts to bind are likely to fail down the road. Fixing it will require some significant effort.

Given we haven't heard reports of this before, I'm not convinced it is a widespread problem. For now, let's just use the mca param and see what happens.

>>> On Feb 21, 2012, at 7:20 PM, Eugene Loh wrote:
>>>> In r25914, orte/mca/odls/base/odls_base_open.c, we get
>>>> 222 /* get the number of local sockets unless we were given a number */
>>>> 223 if (0 == orte_default_num_sockets_per_board) {
>>>> 224 opal_paffinity_base_get_socket_info(&orte_odls_globals.num_sockets);
>>>> 225 }
>>>> 226 /* get the number of local processors */
>>>> 227 opal_paffinity_base_get_processor_info(&orte_odls_globals.num_processors);
>>>> 228 /* compute the base number of cores/socket, if not given */
>>>> 229 if (0 == orte_default_num_cores_per_socket) {
>>>> 230 orte_odls_globals.num_cores_per_socket = orte_odls_globals.num_processors / orte_odls_globals.num_sockets;
>>>> 231 }
>>>> Well, we execute the branch at line 224, but num_sockets remains 0. This leads to the divide-by-0 at line 230.
> _______________________________________________
> devel mailing list
> devel_at_[hidden]