Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] v1.5 r25914 DOA
From: Paul H. Hargrove (PHHargrove_at_[hidden])
Date: 2012-02-21 19:37:39


I have been testing v1.5 with slightly older Intel
"composerxe-2011.5.220" compilers.
I see a "make check" failure in opal_datatype_test which is not present
with any other compiler (such as gcc on the same node).
This has been seen most recently on the 1.5.5rc2r25990 tarball generated
earlier today.
With "make check -k" I can confirm that opal_datatype_test is the ONLY
failure I see with this compiler.
So, I have just assumed this was a buggy compiler and thought nothing
more of it.

I have not yet tested them, but also have the same
"composer_xe_2011_sp1.7.256" compiler and a more recent
"composer_xe_2011_sp1.8.273". I will test both ASAP and report back
with my findings.

-Paul

On 2/21/2012 4:20 PM, Eugene Loh wrote:
> We have some amount of MTT testing going on every night and on ONE of
> our systems v1.5 has been dead since r25914. The system is
>
> Linux burl-ct-v20z-10 2.6.9-67.ELsmp #1 SMP Wed Nov 7 13:56:44 EST
> 2007 x86_64 x86_64 x86_64 GNU/Linux
>
> and I'm encountering the problem with Intel
> (composer_xe_2011_sp1.7.256) compilers. I haven't poked around enough
> yet to figure out what the problematic characteristic of this
> configuration is.
>
> In r25914, orte/mca/odls/base/odls_base_open.c, we get
>
> 222 /* get the number of local sockets unless we were given a
> number */
> 223 if (0 == orte_default_num_sockets_per_board) {
> 224
> opal_paffinity_base_get_socket_info(&orte_odls_globals.num_sockets);
> 225 }
> 226 /* get the number of local processors */
> 227
> opal_paffinity_base_get_processor_info(&orte_odls_globals.num_processors);
> 228 /* compute the base number of cores/socket, if not given */
> 229 if (0 == orte_default_num_cores_per_socket) {
> 230 orte_odls_globals.num_cores_per_socket =
> orte_odls_globals.num_processors / orte_odls_globals.num_sockets;
> 231 }
>
> Well, we execute the branch at line 224, but num_sockets remains 0.
> This leads to the divide-by-0 at line 230. Digging deeper, the call
> at line 224 led us to
> opal/mca/paffinity/hwloc/paffinity_hwloc_module.c (lots of stuff left
> out):
>
> static int module_get_socket_info(int *num_sockets) {
> hwloc_topology_t *t = &opal_hwloc_topology;
> *num_sockets = (int) hwloc_get_nbobjs_by_type(*t, HWLOC_OBJ_SOCKET);
> return OPAL_SUCCESS;
> }
>
> Anyhow, SOCKET is somehow an unknown layer, so num_sockets is
> returning 0.
>
> I can poke around more, but does someone want to advise?
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel

-- 
Paul H. Hargrove                          PHHargrove_at_[hidden]
Future Technologies Group
HPC Research Department                   Tel: +1-510-495-2352
Lawrence Berkeley National Laboratory     Fax: +1-510-486-6900