Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] divide-by-zero in mca_btl_openib_add_procs
From: Ralph Castain (rhc_at_[hidden])
Date: 2014-06-03 08:20:10


Yeah, I think we've concluded that this is just a bug in the compiler and not something wrong in OMPI itself. Sadly, compilers (just like all software) also have bugs.

I'd just use the upgraded version as they apparently fixed the problem.

On Jun 3, 2014, at 4:43 AM, Alain Miniussi <alain.miniussi_at_oca.eu> wrote:

> Please note that I had the problem with 13.1.0 but not with the 13.1.1
>
>
> On 28/05/2014 00:47, Ralph Castain wrote:
>> On May 27, 2014, at 3:32 PM, Alain Miniussi <alain.miniussi_at_oca.eu> wrote:
>>
>>> Unfortunately, the debug library works like a charm (which make the uninitialized variable issue more likely).
>> Indeed - sounds like there is some optimization occurring that triggers the problem.
>>
>>> Still, the stack trace point to mca_btl_openib_add_procs in ompi/mca/btl/openib/btl_openib.c and there is only one division in that function (although not floating point) at the end:
>>>
>>> openib_btl->local_procs += local_procs;
>>> openib_btl->device->mem_reg_max = calculate_max_reg () / openib_btl->local_procs;
>>>
>>> now, I'm not sure how much I would trust the local_procs initialization:
>>>
>>> for (i = 0, local_procs = 0 ; i < (int) nprocs; i++) {
>>>
>>> I suspect that a compiler could (wrongly) decide to pass the init of local_proc if procs = 0 or in a few other corner cases.
>> Yeah, that could be a source of optimization, I suppose - somewhat troubling wrt the expected behavior, but you could sorta see someone doing that.
>>
>>> Anyway, applying the attache patch on btl_openlib.c seems to fix the issue on my small case (but I have no exhaustive test suite to run).
>>>
>>> If there is a more serious patch process to follow (based on the dev version?) please let me know.
>> The fact that it resolves the issue would lend credence to the optimizer indeed skipping that step for some odd reason. I'll bring it to the attention of the folks who maintain that component and see if they can grok the problem.
>>
>> Thanks!
>> Ralph
>>
>>> Alain
>>>
>>> On 27/05/2014 17:30, Ralph Castain wrote:
>>>> Ah, good. On the setup that fails, could you use gdb to find the line number where it is dividing by zero? It could be an uninitialized variable that gcc inits one way and icc inits another.
>>>>
>>>>
>>>> On May 27, 2014, at 4:49 AM, Alain Miniussi <alain.miniussi_at_oca.eu> wrote:
>>>>
>>>>> So it's working with a gcc compiled openmpi:
>>>>>
>>>>> [alainm_at_gurney mpi]$ /softs/openmpi-1.8.1-gnu447/bin/mpicc --showme
>>>>> gcc -I/softs/openmpi-1.8.1-gnu447/include -pthread -Wl,-rpath -Wl,/softs/openmpi-1.8.1-gnu447/lib -Wl,--enable-new-dtags -L/softs/openmpi-1.8.1-gnu447/lib -lmpi
>>>>> (reverse-i-search)`mpicc': ^Cicc --showme:compile
>>>>> [alainm_at_gurney mpi]$ /softs/openmpi-1.8.1-gnu447/bin/mpicc --showme
>>>>> gcc -I/softs/openmpi-1.8.1-gnu447/include -pthread -Wl,-rpath -Wl,/softs/openmpi-1.8.1-gnu447/lib -Wl,--enable-new-dtags -L/softs/openmpi-1.8.1-gnu447/lib -lmpi
>>>>> [alainm_at_gurney mpi]$ /softs/openmpi-1.8.1-gnu447/bin/mpicc ./test.c
>>>>> [alainm_at_gurney mpi]$ /softs/openmpi-1.8.1-gnu447/bin/mpiexec -n 2 ./a.out
>>>>> [alainm_at_gurney mpi]$ ldd ./a.out
>>>>> linux-vdso.so.1 => (0x00007fffb47ff000)
>>>>> libmpi.so.1 => /softs/openmpi-1.8.1-gnu447/lib/libmpi.so.1 (0x00002aaee80c1000)
>>>>> libpthread.so.0 => /lib64/libpthread.so.0 (0x0000003bd9e00000)
>>>>> libc.so.6 => /lib64/libc.so.6 (0x0000003bd9200000)
>>>>> libopen-rte.so.7 => /softs/openmpi-1.8.1-gnu447/lib/libopen-rte.so.7 (0x00002aaee83b8000)
>>>>> libopen-pal.so.6 => /softs/openmpi-1.8.1-gnu447/lib/libopen-pal.so.6 (0x00002aaee8630000)
>>>>> libnuma.so.1 => /usr/lib64/libnuma.so.1 (0x0000003bd9600000)
>>>>> libdl.so.2 => /lib64/libdl.so.2 (0x00002aaee8904000)
>>>>> librt.so.1 => /lib64/librt.so.1 (0x0000003bda600000)
>>>>> libnsl.so.1 => /lib64/libnsl.so.1 (0x0000003beb000000)
>>>>> libutil.so.1 => /lib64/libutil.so.1 (0x0000003bea000000)
>>>>> libm.so.6 => /lib64/libm.so.6 (0x0000003bd9a00000)
>>>>> /lib64/ld-linux-x86-64.so.2 (0x0000003bd8e00000)
>>>>> [alainm_at_gurney mpi]$ ./a.out
>>>>> [alainm_at_gurney mpi]$
>>>>>
>>>>> So it seems to be specific to Intel's compiler.
>>>>>
>>>>>
>>>>> On 26/05/2014 17:35, Ralph Castain wrote:
>>>>>> If you wouldn't mind, yes - let's see if it is a problem with icc. We know some versions have bugs, though this may not be the issue here
>>>>>>
>>>>>> On May 26, 2014, at 7:39 AM, Alain Miniussi <alain.miniussi_at_oca.eu> wrote:
>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>> Did that too, with the same result:
>>>>>>>
>>>>>>> [alainm_at_tagir mpi]$ mpirun -n 1 ./a.out
>>>>>>> [tagir:05123] *** Process received signal ***
>>>>>>> [tagir:05123] Signal: Floating point exception (8)
>>>>>>> [tagir:05123] Signal code: Integer divide-by-zero (1)
>>>>>>> [tagir:05123] Failing at address: 0x2adb507b3d9f
>>>>>>> [tagir:05123] [ 0] /lib64/libpthread.so.0[0x30f920f710]
>>>>>>> [tagir:05123] [ 1] /softs/openmpi-1.8.1-intel13/lib/openmpi/mca_btl_openib.so(mca_btl_openib_add_procs+0xe9f)[0x2adb507b3d9f]
>>>>>>> [tagir:05123] [ 2] /softs/openmpi-1.8.1-intel13/lib/openmpi/mca_bml_r2.so(+0x1481)[0x2adb505a7481]
>>>>>>> [tagir:05123] [ 3] /softs/openmpi-1.8.1-intel13/lib/openmpi/mca_pml_ob1.so(mca_pml_ob1_add_procs+0xa8)[0x2adb51af02f8]
>>>>>>> [tagir:05123] [ 4] /softs/openmpi-1.8.1-intel13/lib/libmpi.so.1(ompi_mpi_init+0x9f6)[0x2adb4b78b236]
>>>>>>> [tagir:05123] [ 5] /softs/openmpi-1.8.1-intel13/lib/libmpi.so.1(MPI_Init+0xef)[0x2adb4b7ad74f]
>>>>>>> [tagir:05123] [ 6] ./a.out[0x400dd1]
>>>>>>> [tagir:05123] [ 7] /lib64/libc.so.6(__libc_start_main+0xfd)[0x30f8a1ed1d]
>>>>>>> [tagir:05123] [ 8] ./a.out[0x400cc9]
>>>>>>> [tagir:05123] *** End of error message ***
>>>>>>> --------------------------------------------------------------------------
>>>>>>> mpirun noticed that process rank 0 with PID 5123 on node tagir exited on signal 13 (Broken pipe).
>>>>>>> --------------------------------------------------------------------------
>>>>>>> [alainm_at_tagir mpi]$
>>>>>>>
>>>>>>>
>>>>>>> do you want me to try a gcc build ?
>>>>>>>
>>>>>>> Alain
>>>>>>>
>>>>>>> On 26/05/2014 16:09, Ralph Castain wrote:
>>>>>>>> Strange - I note that you are running these as singletons. Can you try running it under mpirun?
>>>>>>>>
>>>>>>>> mpirun -n 1 ./a.out
>>>>>>>>
>>>>>>>> just to see if it is the singleton that is causing the problem, or something in the openib btl itself.
>>>>>>>>
>>>>>>>>
>>>>>>>> On May 26, 2014, at 6:59 AM, Alain Miniussi <alain.miniussi_at_oca.eu> wrote:
>>>>>>>>
>>>>>>>>> Hi,
>>>>>>>>>
>>>>>>>>> I have a failure with the following minimalistic testcase:
>>>>>>>>> $: more ./test.c
>>>>>>>>> #include "mpi.h"
>>>>>>>>>
>>>>>>>>> int main(int argc, char* argv[]) {
>>>>>>>>> MPI_Init(&argc,&argv);
>>>>>>>>> MPI_Finalize();
>>>>>>>>> return 0;
>>>>>>>>> }
>>>>>>>>> $: mpicc -v
>>>>>>>>> icc version 13.1.1 (gcc version 4.4.7 compatibility)
>>>>>>>>> $: mpicc ./test.c
>>>>>>>>> $: ./a.out
>>>>>>>>> [tagir:02855] *** Process received signal ***
>>>>>>>>> [tagir:02855] Signal: Floating point exception (8)
>>>>>>>>> [tagir:02855] Signal code: Integer divide-by-zero (1)
>>>>>>>>> [tagir:02855] Failing at address: 0x2aef6e5b2d9f
>>>>>>>>> [tagir:02855] [ 0] /lib64/libpthread.so.0[0x30f920f710]
>>>>>>>>> [tagir:02855] [ 1] /softs/openmpi-1.8.1-intel13/lib/openmpi/mca_btl_openib.so(mca_btl_openib_add_procs+0xe9f)[0x2aef6e5b2d9f]
>>>>>>>>> [tagir:02855] [ 2] /softs/openmpi-1.8.1-intel13/lib/openmpi/mca_bml_r2.so(+0x1481)[0x2aef6e3a6481]
>>>>>>>>> [tagir:02855] [ 3] /softs/openmpi-1.8.1-intel13/lib/openmpi/mca_pml_ob1.so(mca_pml_ob1_add_procs+0xa8)[0x2aef6f8ef2f8]
>>>>>>>>> [tagir:02855] [ 4] /softs/openmpi-1.8.1-intel13/lib/libmpi.so.1(ompi_mpi_init+0x9f6)[0x2aef69572236]
>>>>>>>>> [tagir:02855] [ 5] /softs/openmpi-1.8.1-intel13/lib/libmpi.so.1(MPI_Init+0xef)[0x2aef6959474f]
>>>>>>>>> [tagir:02855] [ 6] ./a.out[0x400dd1]
>>>>>>>>> [tagir:02855] [ 7] /lib64/libc.so.6(__libc_start_main+0xfd)[0x30f8a1ed1d]
>>>>>>>>> [tagir:02855] [ 8] ./a.out[0x400cc9]
>>>>>>>>> [tagir:02855] *** End of error message ***
>>>>>>>>> $:
>>>>>>>>>
>>>>>>>>> Versions info:
>>>>>>>>> $: mpicc -v
>>>>>>>>> icc version 13.1.1 (gcc version 4.4.7 compatibility)
>>>>>>>>> $: ldd ./a.out
>>>>>>>>> linux-vdso.so.1 => (0x00007fffbb197000)
>>>>>>>>> libmpi.so.1 => /softs/openmpi-1.8.1-intel13/lib/libmpi.so.1 (0x00002b20262ee000)
>>>>>>>>> libm.so.6 => /lib64/libm.so.6 (0x00000030f8e00000)
>>>>>>>>> libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x00000030ff200000)
>>>>>>>>> libpthread.so.0 => /lib64/libpthread.so.0 (0x00000030f9200000)
>>>>>>>>> libc.so.6 => /lib64/libc.so.6 (0x00000030f8a00000)
>>>>>>>>> libdl.so.2 => /lib64/libdl.so.2 (0x00000030f9600000)
>>>>>>>>> libopen-rte.so.7 => /softs/openmpi-1.8.1-intel13/lib/libopen-rte.so.7 (0x00002b202660d000)
>>>>>>>>> libopen-pal.so.6 => /softs/openmpi-1.8.1-intel13/lib/libopen-pal.so.6 (0x00002b20268a1000)
>>>>>>>>> libnuma.so.1 => /usr/lib64/libnuma.so.1 (0x00002b2026ba6000)
>>>>>>>>> librt.so.1 => /lib64/librt.so.1 (0x00000030f9e00000)
>>>>>>>>> libnsl.so.1 => /lib64/libnsl.so.1 (0x0000003109800000)
>>>>>>>>> libutil.so.1 => /lib64/libutil.so.1 (0x000000310aa00000)
>>>>>>>>> libimf.so => /softs/intel/composer_xe_2013.3.163/compiler/lib/intel64/libimf.so (0x00002b2026db0000)
>>>>>>>>> libsvml.so => /softs/intel/composer_xe_2013.3.163/compiler/lib/intel64/libsvml.so (0x00002b202726d000)
>>>>>>>>> libirng.so => /softs/intel/composer_xe_2013.3.163/compiler/lib/intel64/libirng.so (0x00002b2027c37000)
>>>>>>>>> libintlc.so.5 => /softs/intel/composer_xe_2013.3.163/compiler/lib/intel64/libintlc.so.5 (0x00002b2027e3e000)
>>>>>>>>> /lib64/ld-linux-x86-64.so.2 (0x00000030f8600000)
>>>>>>>>> $:
>>>>>>>>>
>>>>>>>>> I tried to goole the issue, and saw something regarding an old vectorization bug with intel compiler, but that was a lonng time ago and seemed to be fixed for 1.6.x.
>>>>>>>>> Also, "make check" went fine ???
>>>>>>>>>
>>>>>>>>> Any idea ?
>>>>>>>>>
>>>>>>>>> Cheers
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> ---
>>>>>>>>> Alain
>>>>>>>>>
>>>>>>>>> _______________________________________________
>>>>>>>>> users mailing list
>>>>>>>>> users_at_[hidden]
>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>> _______________________________________________
>>>>>>>> users mailing list
>>>>>>>> users_at_[hidden]
>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>> --
>>>>>>> ---
>>>>>>> Alain
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> users mailing list
>>>>>>> users_at_[hidden]
>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>> _______________________________________________
>>>>>> users mailing list
>>>>>> users_at_[hidden]
>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>> --
>>>>> ---
>>>>> Alain
>>>>>
>>>>> _______________________________________________
>>>>> users mailing list
>>>>> users_at_[hidden]
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>> _______________________________________________
>>>> users mailing list
>>>> users_at_[hidden]
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>
>>> --
>>> ---
>>> Alain
>>>
>>> <btl_openib-1.8.1.diff>_______________________________________________
>>> users mailing list
>>> users_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
> --
> ---
> Alain
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users