Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: [OMPI devel] v1.5 r25914 DOA
From: Eugene Loh (eugene.loh_at_[hidden])
Date: 2012-02-21 19:20:49


We have some amount of MTT testing going on every night and on ONE of
our systems v1.5 has been dead since r25914. The system is

Linux burl-ct-v20z-10 2.6.9-67.ELsmp #1 SMP Wed Nov 7 13:56:44 EST 2007
x86_64 x86_64 x86_64 GNU/Linux

and I'm encountering the problem with Intel (composer_xe_2011_sp1.7.256)
compilers. I haven't poked around enough yet to figure out what the
problematic characteristic of this configuration is.

In r25914, orte/mca/odls/base/odls_base_open.c, we get

     222 /* get the number of local sockets unless we were given a
number */
     223 if (0 == orte_default_num_sockets_per_board) {
     224
opal_paffinity_base_get_socket_info(&orte_odls_globals.num_sockets);
     225 }
     226 /* get the number of local processors */
     227
opal_paffinity_base_get_processor_info(&orte_odls_globals.num_processors);
     228 /* compute the base number of cores/socket, if not given */
     229 if (0 == orte_default_num_cores_per_socket) {
     230 orte_odls_globals.num_cores_per_socket =
orte_odls_globals.num_processors / orte_odls_globals.num_sockets;
     231 }

Well, we execute the branch at line 224, but num_sockets remains 0.
This leads to the divide-by-0 at line 230. Digging deeper, the call at
line 224 led us to opal/mca/paffinity/hwloc/paffinity_hwloc_module.c
(lots of stuff left out):

static int module_get_socket_info(int *num_sockets) {
     hwloc_topology_t *t = &opal_hwloc_topology;
     *num_sockets = (int) hwloc_get_nbobjs_by_type(*t, HWLOC_OBJ_SOCKET);
     return OPAL_SUCCESS;
}

Anyhow, SOCKET is somehow an unknown layer, so num_sockets is returning 0.

I can poke around more, but does someone want to advise?