Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] v1.5 r25914 DOA
From: Paul H. Hargrove (PHHargrove_at_[hidden])
Date: 2012-02-21 20:40:35


Here are the first of the results of the testing I promised.
I am not 100% sure how to reach the code that Eugene reported as
problematic, so I tried just running the ring test with various
-bind-to-* options. I am quite willing to run additional test cases.
All runs are w/ OMPI_MCA_btl=sm,self.

+ 2011.5.220
   FAIL: "make check" fails opal_datatype_test
   OK: mpirun -np 2 ./ring_c
   OK: mpirun -np 2 -bind-to-none ./ring_c
   OK: mpirun -np 2 -bind-to-core ./ring_c
   OK: mpirun -np 2 -bind-to-socket ./ring_c

+ 2011_sp1.7.256
   OK: "make check"
   OK: mpirun -np 2 -bind-to-none ./ring_c
   OK: mpirun -np 2 -bind-to-core ./ring_c
   OK: mpirun -np 2 -bind-to-socket ./ring_c

So, I don't think the "2011_sp1.7.256" compilers are broken (and are
"better" than the ones I've been using).
I have a build with "2011_sp1.8.273" churning away right now (est.
45minutes to complete - should have disabled the Fortan bindings)

If there is something other than the -bind-to-* flags I should be using
to reach the problematic code, let me know.
But based on what I've seen so far, I think we can probably rule out the
compiler as the problem.

-Paul

On 2/21/2012 4:37 PM, Paul H. Hargrove wrote:
> I have been testing v1.5 with slightly older Intel
> "composerxe-2011.5.220" compilers.
> I see a "make check" failure in opal_datatype_test which is not
> present with any other compiler (such as gcc on the same node).
> This has been seen most recently on the 1.5.5rc2r25990 tarball
> generated earlier today.
> With "make check -k" I can confirm that opal_datatype_test is the ONLY
> failure I see with this compiler.
> So, I have just assumed this was a buggy compiler and thought nothing
> more of it.
>
> I have not yet tested them, but also have the same
> "composer_xe_2011_sp1.7.256" compiler and a more recent
> "composer_xe_2011_sp1.8.273". I will test both ASAP and report back
> with my findings.
>
> -Paul
>
>
> On 2/21/2012 4:20 PM, Eugene Loh wrote:
>> We have some amount of MTT testing going on every night and on ONE of
>> our systems v1.5 has been dead since r25914. The system is
>>
>> Linux burl-ct-v20z-10 2.6.9-67.ELsmp #1 SMP Wed Nov 7 13:56:44 EST
>> 2007 x86_64 x86_64 x86_64 GNU/Linux
>>
>> and I'm encountering the problem with Intel
>> (composer_xe_2011_sp1.7.256) compilers. I haven't poked around
>> enough yet to figure out what the problematic characteristic of this
>> configuration is.
>>
>> In r25914, orte/mca/odls/base/odls_base_open.c, we get
>>
>> 222 /* get the number of local sockets unless we were given a
>> number */
>> 223 if (0 == orte_default_num_sockets_per_board) {
>> 224
>> opal_paffinity_base_get_socket_info(&orte_odls_globals.num_sockets);
>> 225 }
>> 226 /* get the number of local processors */
>> 227
>> opal_paffinity_base_get_processor_info(&orte_odls_globals.num_processors);
>> 228 /* compute the base number of cores/socket, if not given */
>> 229 if (0 == orte_default_num_cores_per_socket) {
>> 230 orte_odls_globals.num_cores_per_socket =
>> orte_odls_globals.num_processors / orte_odls_globals.num_sockets;
>> 231 }
>>
>> Well, we execute the branch at line 224, but num_sockets remains 0.
>> This leads to the divide-by-0 at line 230. Digging deeper, the call
>> at line 224 led us to
>> opal/mca/paffinity/hwloc/paffinity_hwloc_module.c (lots of stuff left
>> out):
>>
>> static int module_get_socket_info(int *num_sockets) {
>> hwloc_topology_t *t = &opal_hwloc_topology;
>> *num_sockets = (int) hwloc_get_nbobjs_by_type(*t, HWLOC_OBJ_SOCKET);
>> return OPAL_SUCCESS;
>> }
>>
>> Anyhow, SOCKET is somehow an unknown layer, so num_sockets is
>> returning 0.
>>
>> I can poke around more, but does someone want to advise?
>> _______________________________________________
>> devel mailing list
>> devel_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>

-- 
Paul H. Hargrove                          PHHargrove_at_[hidden]
Future Technologies Group
HPC Research Department                   Tel: +1-510-495-2352
Lawrence Berkeley National Laboratory     Fax: +1-510-486-6900