Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] v1.5 r25914 DOA
From: Paul H. Hargrove (PHHargrove_at_[hidden])
Date: 2012-02-21 21:24:20


My build with the "2011_sp1.8.273" Intel compilers passes the same tests
as I detailed below for "2011_sp1.7.256".
I don't suspect any longer that the compiler is at fault, but am willing
to try additional/alternate tests to help confirm.

-Paul

On 2/21/2012 5:40 PM, Paul H. Hargrove wrote:
> Here are the first of the results of the testing I promised.
> I am not 100% sure how to reach the code that Eugene reported as
> problematic, so I tried just running the ring test with various
> -bind-to-* options. I am quite willing to run additional test
> cases. All runs are w/ OMPI_MCA_btl=sm,self.
>
> + 2011.5.220
> FAIL: "make check" fails opal_datatype_test
> OK: mpirun -np 2 ./ring_c
> OK: mpirun -np 2 -bind-to-none ./ring_c
> OK: mpirun -np 2 -bind-to-core ./ring_c
> OK: mpirun -np 2 -bind-to-socket ./ring_c
>
> + 2011_sp1.7.256
> OK: "make check"
> OK: mpirun -np 2 -bind-to-none ./ring_c
> OK: mpirun -np 2 -bind-to-core ./ring_c
> OK: mpirun -np 2 -bind-to-socket ./ring_c
>
> So, I don't think the "2011_sp1.7.256" compilers are broken (and are
> "better" than the ones I've been using).
> I have a build with "2011_sp1.8.273" churning away right now (est.
> 45minutes to complete - should have disabled the Fortan bindings)
>
> If there is something other than the -bind-to-* flags I should be
> using to reach the problematic code, let me know.
> But based on what I've seen so far, I think we can probably rule out
> the compiler as the problem.
>
> -Paul
>
>
> On 2/21/2012 4:37 PM, Paul H. Hargrove wrote:
>> I have been testing v1.5 with slightly older Intel
>> "composerxe-2011.5.220" compilers.
>> I see a "make check" failure in opal_datatype_test which is not
>> present with any other compiler (such as gcc on the same node).
>> This has been seen most recently on the 1.5.5rc2r25990 tarball
>> generated earlier today.
>> With "make check -k" I can confirm that opal_datatype_test is the
>> ONLY failure I see with this compiler.
>> So, I have just assumed this was a buggy compiler and thought nothing
>> more of it.
>>
>> I have not yet tested them, but also have the same
>> "composer_xe_2011_sp1.7.256" compiler and a more recent
>> "composer_xe_2011_sp1.8.273". I will test both ASAP and report back
>> with my findings.
>>
>> -Paul
>>
>>
>> On 2/21/2012 4:20 PM, Eugene Loh wrote:
>>> We have some amount of MTT testing going on every night and on ONE
>>> of our systems v1.5 has been dead since r25914. The system is
>>>
>>> Linux burl-ct-v20z-10 2.6.9-67.ELsmp #1 SMP Wed Nov 7 13:56:44 EST
>>> 2007 x86_64 x86_64 x86_64 GNU/Linux
>>>
>>> and I'm encountering the problem with Intel
>>> (composer_xe_2011_sp1.7.256) compilers. I haven't poked around
>>> enough yet to figure out what the problematic characteristic of this
>>> configuration is.
>>>
>>> In r25914, orte/mca/odls/base/odls_base_open.c, we get
>>>
>>> 222 /* get the number of local sockets unless we were given
>>> a number */
>>> 223 if (0 == orte_default_num_sockets_per_board) {
>>> 224
>>> opal_paffinity_base_get_socket_info(&orte_odls_globals.num_sockets);
>>> 225 }
>>> 226 /* get the number of local processors */
>>> 227
>>> opal_paffinity_base_get_processor_info(&orte_odls_globals.num_processors);
>>> 228 /* compute the base number of cores/socket, if not given */
>>> 229 if (0 == orte_default_num_cores_per_socket) {
>>> 230 orte_odls_globals.num_cores_per_socket =
>>> orte_odls_globals.num_processors / orte_odls_globals.num_sockets;
>>> 231 }
>>>
>>> Well, we execute the branch at line 224, but num_sockets remains 0.
>>> This leads to the divide-by-0 at line 230. Digging deeper, the call
>>> at line 224 led us to
>>> opal/mca/paffinity/hwloc/paffinity_hwloc_module.c (lots of stuff
>>> left out):
>>>
>>> static int module_get_socket_info(int *num_sockets) {
>>> hwloc_topology_t *t = &opal_hwloc_topology;
>>> *num_sockets = (int) hwloc_get_nbobjs_by_type(*t,
>>> HWLOC_OBJ_SOCKET);
>>> return OPAL_SUCCESS;
>>> }
>>>
>>> Anyhow, SOCKET is somehow an unknown layer, so num_sockets is
>>> returning 0.
>>>
>>> I can poke around more, but does someone want to advise?
>>> _______________________________________________
>>> devel mailing list
>>> devel_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>
>

-- 
Paul H. Hargrove                          PHHargrove_at_[hidden]
Future Technologies Group
HPC Research Department                   Tel: +1-510-495-2352
Lawrence Berkeley National Laboratory     Fax: +1-510-486-6900