Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] divide-by-zero in mca_btl_openib_add_procs
From: Alain Miniussi (alain.miniussi_at_[hidden])
Date: 2014-05-27 18:32:35


Unfortunately, the debug library works like a charm (which make the
uninitialized variable issue more likely).

Still, the stack trace point to mca_btl_openib_add_procs in
ompi/mca/btl/openib/btl_openib.c and there is only one division in that
function (although not floating point) at the end:

     openib_btl->local_procs += local_procs;
     openib_btl->device->mem_reg_max = calculate_max_reg () /
openib_btl->local_procs;

now, I'm not sure how much I would trust the local_procs initialization:

for (i = 0, local_procs = 0 ; i < (int) nprocs; i++) {

I suspect that a compiler could (wrongly) decide to pass the init of
local_proc if procs = 0 or in a few other corner cases.

Anyway, applying the attache patch on btl_openlib.c seems to fix the
issue on my small case (but I have no exhaustive test suite to run).

If there is a more serious patch process to follow (based on the dev
version?) please let me know.

Alain

On 27/05/2014 17:30, Ralph Castain wrote:
> Ah, good. On the setup that fails, could you use gdb to find the line number where it is dividing by zero? It could be an uninitialized variable that gcc inits one way and icc inits another.
>
>
> On May 27, 2014, at 4:49 AM, Alain Miniussi <alain.miniussi_at_oca.eu> wrote:
>
>> So it's working with a gcc compiled openmpi:
>>
>> [alainm_at_gurney mpi]$ /softs/openmpi-1.8.1-gnu447/bin/mpicc --showme
>> gcc -I/softs/openmpi-1.8.1-gnu447/include -pthread -Wl,-rpath -Wl,/softs/openmpi-1.8.1-gnu447/lib -Wl,--enable-new-dtags -L/softs/openmpi-1.8.1-gnu447/lib -lmpi
>> (reverse-i-search)`mpicc': ^Cicc --showme:compile
>> [alainm_at_gurney mpi]$ /softs/openmpi-1.8.1-gnu447/bin/mpicc --showme
>> gcc -I/softs/openmpi-1.8.1-gnu447/include -pthread -Wl,-rpath -Wl,/softs/openmpi-1.8.1-gnu447/lib -Wl,--enable-new-dtags -L/softs/openmpi-1.8.1-gnu447/lib -lmpi
>> [alainm_at_gurney mpi]$ /softs/openmpi-1.8.1-gnu447/bin/mpicc ./test.c
>> [alainm_at_gurney mpi]$ /softs/openmpi-1.8.1-gnu447/bin/mpiexec -n 2 ./a.out
>> [alainm_at_gurney mpi]$ ldd ./a.out
>> linux-vdso.so.1 => (0x00007fffb47ff000)
>> libmpi.so.1 => /softs/openmpi-1.8.1-gnu447/lib/libmpi.so.1 (0x00002aaee80c1000)
>> libpthread.so.0 => /lib64/libpthread.so.0 (0x0000003bd9e00000)
>> libc.so.6 => /lib64/libc.so.6 (0x0000003bd9200000)
>> libopen-rte.so.7 => /softs/openmpi-1.8.1-gnu447/lib/libopen-rte.so.7 (0x00002aaee83b8000)
>> libopen-pal.so.6 => /softs/openmpi-1.8.1-gnu447/lib/libopen-pal.so.6 (0x00002aaee8630000)
>> libnuma.so.1 => /usr/lib64/libnuma.so.1 (0x0000003bd9600000)
>> libdl.so.2 => /lib64/libdl.so.2 (0x00002aaee8904000)
>> librt.so.1 => /lib64/librt.so.1 (0x0000003bda600000)
>> libnsl.so.1 => /lib64/libnsl.so.1 (0x0000003beb000000)
>> libutil.so.1 => /lib64/libutil.so.1 (0x0000003bea000000)
>> libm.so.6 => /lib64/libm.so.6 (0x0000003bd9a00000)
>> /lib64/ld-linux-x86-64.so.2 (0x0000003bd8e00000)
>> [alainm_at_gurney mpi]$ ./a.out
>> [alainm_at_gurney mpi]$
>>
>> So it seems to be specific to Intel's compiler.
>>
>>
>> On 26/05/2014 17:35, Ralph Castain wrote:
>>> If you wouldn't mind, yes - let's see if it is a problem with icc. We know some versions have bugs, though this may not be the issue here
>>>
>>> On May 26, 2014, at 7:39 AM, Alain Miniussi <alain.miniussi_at_oca.eu> wrote:
>>>
>>>> Hi,
>>>>
>>>> Did that too, with the same result:
>>>>
>>>> [alainm_at_tagir mpi]$ mpirun -n 1 ./a.out
>>>> [tagir:05123] *** Process received signal ***
>>>> [tagir:05123] Signal: Floating point exception (8)
>>>> [tagir:05123] Signal code: Integer divide-by-zero (1)
>>>> [tagir:05123] Failing at address: 0x2adb507b3d9f
>>>> [tagir:05123] [ 0] /lib64/libpthread.so.0[0x30f920f710]
>>>> [tagir:05123] [ 1] /softs/openmpi-1.8.1-intel13/lib/openmpi/mca_btl_openib.so(mca_btl_openib_add_procs+0xe9f)[0x2adb507b3d9f]
>>>> [tagir:05123] [ 2] /softs/openmpi-1.8.1-intel13/lib/openmpi/mca_bml_r2.so(+0x1481)[0x2adb505a7481]
>>>> [tagir:05123] [ 3] /softs/openmpi-1.8.1-intel13/lib/openmpi/mca_pml_ob1.so(mca_pml_ob1_add_procs+0xa8)[0x2adb51af02f8]
>>>> [tagir:05123] [ 4] /softs/openmpi-1.8.1-intel13/lib/libmpi.so.1(ompi_mpi_init+0x9f6)[0x2adb4b78b236]
>>>> [tagir:05123] [ 5] /softs/openmpi-1.8.1-intel13/lib/libmpi.so.1(MPI_Init+0xef)[0x2adb4b7ad74f]
>>>> [tagir:05123] [ 6] ./a.out[0x400dd1]
>>>> [tagir:05123] [ 7] /lib64/libc.so.6(__libc_start_main+0xfd)[0x30f8a1ed1d]
>>>> [tagir:05123] [ 8] ./a.out[0x400cc9]
>>>> [tagir:05123] *** End of error message ***
>>>> --------------------------------------------------------------------------
>>>> mpirun noticed that process rank 0 with PID 5123 on node tagir exited on signal 13 (Broken pipe).
>>>> --------------------------------------------------------------------------
>>>> [alainm_at_tagir mpi]$
>>>>
>>>>
>>>> do you want me to try a gcc build ?
>>>>
>>>> Alain
>>>>
>>>> On 26/05/2014 16:09, Ralph Castain wrote:
>>>>> Strange - I note that you are running these as singletons. Can you try running it under mpirun?
>>>>>
>>>>> mpirun -n 1 ./a.out
>>>>>
>>>>> just to see if it is the singleton that is causing the problem, or something in the openib btl itself.
>>>>>
>>>>>
>>>>> On May 26, 2014, at 6:59 AM, Alain Miniussi <alain.miniussi_at_oca.eu> wrote:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> I have a failure with the following minimalistic testcase:
>>>>>> $: more ./test.c
>>>>>> #include "mpi.h"
>>>>>>
>>>>>> int main(int argc, char* argv[]) {
>>>>>> MPI_Init(&argc,&argv);
>>>>>> MPI_Finalize();
>>>>>> return 0;
>>>>>> }
>>>>>> $: mpicc -v
>>>>>> icc version 13.1.1 (gcc version 4.4.7 compatibility)
>>>>>> $: mpicc ./test.c
>>>>>> $: ./a.out
>>>>>> [tagir:02855] *** Process received signal ***
>>>>>> [tagir:02855] Signal: Floating point exception (8)
>>>>>> [tagir:02855] Signal code: Integer divide-by-zero (1)
>>>>>> [tagir:02855] Failing at address: 0x2aef6e5b2d9f
>>>>>> [tagir:02855] [ 0] /lib64/libpthread.so.0[0x30f920f710]
>>>>>> [tagir:02855] [ 1] /softs/openmpi-1.8.1-intel13/lib/openmpi/mca_btl_openib.so(mca_btl_openib_add_procs+0xe9f)[0x2aef6e5b2d9f]
>>>>>> [tagir:02855] [ 2] /softs/openmpi-1.8.1-intel13/lib/openmpi/mca_bml_r2.so(+0x1481)[0x2aef6e3a6481]
>>>>>> [tagir:02855] [ 3] /softs/openmpi-1.8.1-intel13/lib/openmpi/mca_pml_ob1.so(mca_pml_ob1_add_procs+0xa8)[0x2aef6f8ef2f8]
>>>>>> [tagir:02855] [ 4] /softs/openmpi-1.8.1-intel13/lib/libmpi.so.1(ompi_mpi_init+0x9f6)[0x2aef69572236]
>>>>>> [tagir:02855] [ 5] /softs/openmpi-1.8.1-intel13/lib/libmpi.so.1(MPI_Init+0xef)[0x2aef6959474f]
>>>>>> [tagir:02855] [ 6] ./a.out[0x400dd1]
>>>>>> [tagir:02855] [ 7] /lib64/libc.so.6(__libc_start_main+0xfd)[0x30f8a1ed1d]
>>>>>> [tagir:02855] [ 8] ./a.out[0x400cc9]
>>>>>> [tagir:02855] *** End of error message ***
>>>>>> $:
>>>>>>
>>>>>> Versions info:
>>>>>> $: mpicc -v
>>>>>> icc version 13.1.1 (gcc version 4.4.7 compatibility)
>>>>>> $: ldd ./a.out
>>>>>> linux-vdso.so.1 => (0x00007fffbb197000)
>>>>>> libmpi.so.1 => /softs/openmpi-1.8.1-intel13/lib/libmpi.so.1 (0x00002b20262ee000)
>>>>>> libm.so.6 => /lib64/libm.so.6 (0x00000030f8e00000)
>>>>>> libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x00000030ff200000)
>>>>>> libpthread.so.0 => /lib64/libpthread.so.0 (0x00000030f9200000)
>>>>>> libc.so.6 => /lib64/libc.so.6 (0x00000030f8a00000)
>>>>>> libdl.so.2 => /lib64/libdl.so.2 (0x00000030f9600000)
>>>>>> libopen-rte.so.7 => /softs/openmpi-1.8.1-intel13/lib/libopen-rte.so.7 (0x00002b202660d000)
>>>>>> libopen-pal.so.6 => /softs/openmpi-1.8.1-intel13/lib/libopen-pal.so.6 (0x00002b20268a1000)
>>>>>> libnuma.so.1 => /usr/lib64/libnuma.so.1 (0x00002b2026ba6000)
>>>>>> librt.so.1 => /lib64/librt.so.1 (0x00000030f9e00000)
>>>>>> libnsl.so.1 => /lib64/libnsl.so.1 (0x0000003109800000)
>>>>>> libutil.so.1 => /lib64/libutil.so.1 (0x000000310aa00000)
>>>>>> libimf.so => /softs/intel/composer_xe_2013.3.163/compiler/lib/intel64/libimf.so (0x00002b2026db0000)
>>>>>> libsvml.so => /softs/intel/composer_xe_2013.3.163/compiler/lib/intel64/libsvml.so (0x00002b202726d000)
>>>>>> libirng.so => /softs/intel/composer_xe_2013.3.163/compiler/lib/intel64/libirng.so (0x00002b2027c37000)
>>>>>> libintlc.so.5 => /softs/intel/composer_xe_2013.3.163/compiler/lib/intel64/libintlc.so.5 (0x00002b2027e3e000)
>>>>>> /lib64/ld-linux-x86-64.so.2 (0x00000030f8600000)
>>>>>> $:
>>>>>>
>>>>>> I tried to goole the issue, and saw something regarding an old vectorization bug with intel compiler, but that was a lonng time ago and seemed to be fixed for 1.6.x.
>>>>>> Also, "make check" went fine ???
>>>>>>
>>>>>> Any idea ?
>>>>>>
>>>>>> Cheers
>>>>>>
>>>>>> --
>>>>>> ---
>>>>>> Alain
>>>>>>
>>>>>> _______________________________________________
>>>>>> users mailing list
>>>>>> users_at_[hidden]
>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>> _______________________________________________
>>>>> users mailing list
>>>>> users_at_[hidden]
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>> --
>>>> ---
>>>> Alain
>>>>
>>>> _______________________________________________
>>>> users mailing list
>>>> users_at_[hidden]
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>> _______________________________________________
>>> users mailing list
>>> users_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>> --
>> ---
>> Alain
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users

-- 
---
Alain