Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] v1.5 r25914 DOA
From: Jeffrey Squyres (jsquyres_at_[hidden])
Date: 2012-02-21 19:29:45


What's the output of running lstopo from hwloc 1.3.2? (this is the version that's in the OMPI trunk and v1.5 branches)

    http://www.open-mpi.org/software/hwloc/v1.3/

Is there any difference from v1.4 hwloc?

    http://www.open-mpi.org/software/hwloc/v1.4/

On Feb 21, 2012, at 7:20 PM, Eugene Loh wrote:

> We have some amount of MTT testing going on every night and on ONE of our systems v1.5 has been dead since r25914. The system is
>
> Linux burl-ct-v20z-10 2.6.9-67.ELsmp #1 SMP Wed Nov 7 13:56:44 EST 2007 x86_64 x86_64 x86_64 GNU/Linux
>
> and I'm encountering the problem with Intel (composer_xe_2011_sp1.7.256) compilers. I haven't poked around enough yet to figure out what the problematic characteristic of this configuration is.
>
> In r25914, orte/mca/odls/base/odls_base_open.c, we get
>
> 222 /* get the number of local sockets unless we were given a number */
> 223 if (0 == orte_default_num_sockets_per_board) {
> 224 opal_paffinity_base_get_socket_info(&orte_odls_globals.num_sockets);
> 225 }
> 226 /* get the number of local processors */
> 227 opal_paffinity_base_get_processor_info(&orte_odls_globals.num_processors);
> 228 /* compute the base number of cores/socket, if not given */
> 229 if (0 == orte_default_num_cores_per_socket) {
> 230 orte_odls_globals.num_cores_per_socket = orte_odls_globals.num_processors / orte_odls_globals.num_sockets;
> 231 }
>
> Well, we execute the branch at line 224, but num_sockets remains 0. This leads to the divide-by-0 at line 230. Digging deeper, the call at line 224 led us to opal/mca/paffinity/hwloc/paffinity_hwloc_module.c (lots of stuff left out):
>
> static int module_get_socket_info(int *num_sockets) {
> hwloc_topology_t *t = &opal_hwloc_topology;
> *num_sockets = (int) hwloc_get_nbobjs_by_type(*t, HWLOC_OBJ_SOCKET);
> return OPAL_SUCCESS;
> }
>
> Anyhow, SOCKET is somehow an unknown layer, so num_sockets is returning 0.
>
> I can poke around more, but does someone want to advise?
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel

-- 
Jeff Squyres
jsquyres_at_[hidden]
For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/