Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] v1.5 r25914 DOA
From: Eugene Loh (eugene.loh_at_[hidden])
Date: 2012-02-22 01:31:33


On 02/21/12 19:29, Jeffrey Squyres wrote:
> What's the output of running lstopo from hwloc 1.3.2? (this is the version that's in the OMPI trunk and v1.5 branches)
>
> http://www.open-mpi.org/software/hwloc/v1.3/
>
> Is there any difference from v1.4 hwloc?
>
> http://www.open-mpi.org/software/hwloc/v1.4/
Machine (8192MB)
   NUMANode L#0 (P#0 4096MB) + PU L#0 (P#0)
   NUMANode L#1 (P#1 4096MB) + PU L#1 (P#1)

No difference between 1.3 and 1.4. No information about sockets.

As Paul says, doesn't look like a compiler thing. (I get the same with
Intel and gcc.)

The hwloc README has a sample program that has ("third example")

  depth = hwloc_get_type_depth(topology, HWLOC_OBJ_SOCKET);
  if (depth == HWLOC_TYPE_DEPTH_UNKNOWN) {
      printf("*** The number of sockets is unknown\n");
  } else {
     ...
  }

that reports that the number of sockets is unknown. So, "sockets" is
unknown and hwloc returns 0 for num_sockets and OMPI pukes on divide by
zero. OS info was listed in the original message (below). Might we
want to do something else? E.g., assume num_sockets==1 when
num_sockets==0 (if you know what I mean)? So, which one (or more) of
the following should be fixed?

*) on this platform, hwloc finds no socket level
*) therefore hwloc returns num_sockets==0 to OMPI
*) OMPI divides by 0 and barfs on basically everything
> On Feb 21, 2012, at 7:20 PM, Eugene Loh wrote:
>> We have some amount of MTT testing going on every night and on ONE of our systems v1.5 has been dead since r25914. The system is
>>
>> Linux burl-ct-v20z-10 2.6.9-67.ELsmp #1 SMP Wed Nov 7 13:56:44 EST 2007 x86_64 x86_64 x86_64 GNU/Linux
>>
>> and I'm encountering the problem with Intel (composer_xe_2011_sp1.7.256) compilers. I haven't poked around enough yet to figure out what the problematic characteristic of this configuration is.
>>
>> In r25914, orte/mca/odls/base/odls_base_open.c, we get
>>
>> 222 /* get the number of local sockets unless we were given a number */
>> 223 if (0 == orte_default_num_sockets_per_board) {
>> 224 opal_paffinity_base_get_socket_info(&orte_odls_globals.num_sockets);
>> 225 }
>> 226 /* get the number of local processors */
>> 227 opal_paffinity_base_get_processor_info(&orte_odls_globals.num_processors);
>> 228 /* compute the base number of cores/socket, if not given */
>> 229 if (0 == orte_default_num_cores_per_socket) {
>> 230 orte_odls_globals.num_cores_per_socket = orte_odls_globals.num_processors / orte_odls_globals.num_sockets;
>> 231 }
>>
>> Well, we execute the branch at line 224, but num_sockets remains 0. This leads to the divide-by-0 at line 230. Digging deeper, the call at line 224 led us to opal/mca/paffinity/hwloc/paffinity_hwloc_module.c (lots of stuff left out):
>>
>> static int module_get_socket_info(int *num_sockets) {
>> hwloc_topology_t *t =&opal_hwloc_topology;
>> *num_sockets = (int) hwloc_get_nbobjs_by_type(*t, HWLOC_OBJ_SOCKET);
>> return OPAL_SUCCESS;
>> }
>>
>> Anyhow, SOCKET is somehow an unknown layer, so num_sockets is returning 0.
>>
>> I can poke around more, but does someone want to advise?
>> _______________________________________________
>> devel mailing list
>> devel_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>