Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |  

This web mail archive is frozen.

This page is part of a frozen web archive of this mailing list.

You can still navigate around this archive, but know that no new mails have been added to it since July of 2016.

Click here to be taken to the new web archives of this list; it includes all the mails that are in this frozen archive plus all new mails that have been sent to the list since it was migrated to the new archives.

Subject: Re: [OMPI devel] v1.5 r25914 DOA
From: Eugene Loh (eugene.loh_at_[hidden])
Date: 2012-02-22 01:31:33


On 02/21/12 19:29, Jeffrey Squyres wrote:
> What's the output of running lstopo from hwloc 1.3.2? (this is the version that's in the OMPI trunk and v1.5 branches)
>
> http://www.open-mpi.org/software/hwloc/v1.3/
>
> Is there any difference from v1.4 hwloc?
>
> http://www.open-mpi.org/software/hwloc/v1.4/
Machine (8192MB)
   NUMANode L#0 (P#0 4096MB) + PU L#0 (P#0)
   NUMANode L#1 (P#1 4096MB) + PU L#1 (P#1)

No difference between 1.3 and 1.4. No information about sockets.

As Paul says, doesn't look like a compiler thing. (I get the same with
Intel and gcc.)

The hwloc README has a sample program that has ("third example")

  depth = hwloc_get_type_depth(topology, HWLOC_OBJ_SOCKET);
  if (depth == HWLOC_TYPE_DEPTH_UNKNOWN) {
      printf("*** The number of sockets is unknown\n");
  } else {
     ...
  }

that reports that the number of sockets is unknown. So, "sockets" is
unknown and hwloc returns 0 for num_sockets and OMPI pukes on divide by
zero. OS info was listed in the original message (below). Might we
want to do something else? E.g., assume num_sockets==1 when
num_sockets==0 (if you know what I mean)? So, which one (or more) of
the following should be fixed?

*) on this platform, hwloc finds no socket level
*) therefore hwloc returns num_sockets==0 to OMPI
*) OMPI divides by 0 and barfs on basically everything
> On Feb 21, 2012, at 7:20 PM, Eugene Loh wrote:
>> We have some amount of MTT testing going on every night and on ONE of our systems v1.5 has been dead since r25914. The system is
>>
>> Linux burl-ct-v20z-10 2.6.9-67.ELsmp #1 SMP Wed Nov 7 13:56:44 EST 2007 x86_64 x86_64 x86_64 GNU/Linux
>>
>> and I'm encountering the problem with Intel (composer_xe_2011_sp1.7.256) compilers. I haven't poked around enough yet to figure out what the problematic characteristic of this configuration is.
>>
>> In r25914, orte/mca/odls/base/odls_base_open.c, we get
>>
>> 222 /* get the number of local sockets unless we were given a number */
>> 223 if (0 == orte_default_num_sockets_per_board) {
>> 224 opal_paffinity_base_get_socket_info(&orte_odls_globals.num_sockets);
>> 225 }
>> 226 /* get the number of local processors */
>> 227 opal_paffinity_base_get_processor_info(&orte_odls_globals.num_processors);
>> 228 /* compute the base number of cores/socket, if not given */
>> 229 if (0 == orte_default_num_cores_per_socket) {
>> 230 orte_odls_globals.num_cores_per_socket = orte_odls_globals.num_processors / orte_odls_globals.num_sockets;
>> 231 }
>>
>> Well, we execute the branch at line 224, but num_sockets remains 0. This leads to the divide-by-0 at line 230. Digging deeper, the call at line 224 led us to opal/mca/paffinity/hwloc/paffinity_hwloc_module.c (lots of stuff left out):
>>
>> static int module_get_socket_info(int *num_sockets) {
>> hwloc_topology_t *t =&opal_hwloc_topology;
>> *num_sockets = (int) hwloc_get_nbobjs_by_type(*t, HWLOC_OBJ_SOCKET);
>> return OPAL_SUCCESS;
>> }
>>
>> Anyhow, SOCKET is somehow an unknown layer, so num_sockets is returning 0.
>>
>> I can poke around more, but does someone want to advise?
>> _______________________________________________
>> devel mailing list
>> devel_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>