Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] divide-by-zero in mca_btl_openib_add_procs
From: Ralph Castain (rhc_at_[hidden])
Date: 2014-05-27 11:30:46


Ah, good. On the setup that fails, could you use gdb to find the line number where it is dividing by zero? It could be an uninitialized variable that gcc inits one way and icc inits another.

On May 27, 2014, at 4:49 AM, Alain Miniussi <alain.miniussi_at_oca.eu> wrote:

> So it's working with a gcc compiled openmpi:
>
> [alainm_at_gurney mpi]$ /softs/openmpi-1.8.1-gnu447/bin/mpicc --showme
> gcc -I/softs/openmpi-1.8.1-gnu447/include -pthread -Wl,-rpath -Wl,/softs/openmpi-1.8.1-gnu447/lib -Wl,--enable-new-dtags -L/softs/openmpi-1.8.1-gnu447/lib -lmpi
> (reverse-i-search)`mpicc': ^Cicc --showme:compile
> [alainm_at_gurney mpi]$ /softs/openmpi-1.8.1-gnu447/bin/mpicc --showme
> gcc -I/softs/openmpi-1.8.1-gnu447/include -pthread -Wl,-rpath -Wl,/softs/openmpi-1.8.1-gnu447/lib -Wl,--enable-new-dtags -L/softs/openmpi-1.8.1-gnu447/lib -lmpi
> [alainm_at_gurney mpi]$ /softs/openmpi-1.8.1-gnu447/bin/mpicc ./test.c
> [alainm_at_gurney mpi]$ /softs/openmpi-1.8.1-gnu447/bin/mpiexec -n 2 ./a.out
> [alainm_at_gurney mpi]$ ldd ./a.out
> linux-vdso.so.1 => (0x00007fffb47ff000)
> libmpi.so.1 => /softs/openmpi-1.8.1-gnu447/lib/libmpi.so.1 (0x00002aaee80c1000)
> libpthread.so.0 => /lib64/libpthread.so.0 (0x0000003bd9e00000)
> libc.so.6 => /lib64/libc.so.6 (0x0000003bd9200000)
> libopen-rte.so.7 => /softs/openmpi-1.8.1-gnu447/lib/libopen-rte.so.7 (0x00002aaee83b8000)
> libopen-pal.so.6 => /softs/openmpi-1.8.1-gnu447/lib/libopen-pal.so.6 (0x00002aaee8630000)
> libnuma.so.1 => /usr/lib64/libnuma.so.1 (0x0000003bd9600000)
> libdl.so.2 => /lib64/libdl.so.2 (0x00002aaee8904000)
> librt.so.1 => /lib64/librt.so.1 (0x0000003bda600000)
> libnsl.so.1 => /lib64/libnsl.so.1 (0x0000003beb000000)
> libutil.so.1 => /lib64/libutil.so.1 (0x0000003bea000000)
> libm.so.6 => /lib64/libm.so.6 (0x0000003bd9a00000)
> /lib64/ld-linux-x86-64.so.2 (0x0000003bd8e00000)
> [alainm_at_gurney mpi]$ ./a.out
> [alainm_at_gurney mpi]$
>
> So it seems to be specific to Intel's compiler.
>
>
> On 26/05/2014 17:35, Ralph Castain wrote:
>> If you wouldn't mind, yes - let's see if it is a problem with icc. We know some versions have bugs, though this may not be the issue here
>>
>> On May 26, 2014, at 7:39 AM, Alain Miniussi <alain.miniussi_at_oca.eu> wrote:
>>
>>> Hi,
>>>
>>> Did that too, with the same result:
>>>
>>> [alainm_at_tagir mpi]$ mpirun -n 1 ./a.out
>>> [tagir:05123] *** Process received signal ***
>>> [tagir:05123] Signal: Floating point exception (8)
>>> [tagir:05123] Signal code: Integer divide-by-zero (1)
>>> [tagir:05123] Failing at address: 0x2adb507b3d9f
>>> [tagir:05123] [ 0] /lib64/libpthread.so.0[0x30f920f710]
>>> [tagir:05123] [ 1] /softs/openmpi-1.8.1-intel13/lib/openmpi/mca_btl_openib.so(mca_btl_openib_add_procs+0xe9f)[0x2adb507b3d9f]
>>> [tagir:05123] [ 2] /softs/openmpi-1.8.1-intel13/lib/openmpi/mca_bml_r2.so(+0x1481)[0x2adb505a7481]
>>> [tagir:05123] [ 3] /softs/openmpi-1.8.1-intel13/lib/openmpi/mca_pml_ob1.so(mca_pml_ob1_add_procs+0xa8)[0x2adb51af02f8]
>>> [tagir:05123] [ 4] /softs/openmpi-1.8.1-intel13/lib/libmpi.so.1(ompi_mpi_init+0x9f6)[0x2adb4b78b236]
>>> [tagir:05123] [ 5] /softs/openmpi-1.8.1-intel13/lib/libmpi.so.1(MPI_Init+0xef)[0x2adb4b7ad74f]
>>> [tagir:05123] [ 6] ./a.out[0x400dd1]
>>> [tagir:05123] [ 7] /lib64/libc.so.6(__libc_start_main+0xfd)[0x30f8a1ed1d]
>>> [tagir:05123] [ 8] ./a.out[0x400cc9]
>>> [tagir:05123] *** End of error message ***
>>> --------------------------------------------------------------------------
>>> mpirun noticed that process rank 0 with PID 5123 on node tagir exited on signal 13 (Broken pipe).
>>> --------------------------------------------------------------------------
>>> [alainm_at_tagir mpi]$
>>>
>>>
>>> do you want me to try a gcc build ?
>>>
>>> Alain
>>>
>>> On 26/05/2014 16:09, Ralph Castain wrote:
>>>> Strange - I note that you are running these as singletons. Can you try running it under mpirun?
>>>>
>>>> mpirun -n 1 ./a.out
>>>>
>>>> just to see if it is the singleton that is causing the problem, or something in the openib btl itself.
>>>>
>>>>
>>>> On May 26, 2014, at 6:59 AM, Alain Miniussi <alain.miniussi_at_oca.eu> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> I have a failure with the following minimalistic testcase:
>>>>> $: more ./test.c
>>>>> #include "mpi.h"
>>>>>
>>>>> int main(int argc, char* argv[]) {
>>>>> MPI_Init(&argc,&argv);
>>>>> MPI_Finalize();
>>>>> return 0;
>>>>> }
>>>>> $: mpicc -v
>>>>> icc version 13.1.1 (gcc version 4.4.7 compatibility)
>>>>> $: mpicc ./test.c
>>>>> $: ./a.out
>>>>> [tagir:02855] *** Process received signal ***
>>>>> [tagir:02855] Signal: Floating point exception (8)
>>>>> [tagir:02855] Signal code: Integer divide-by-zero (1)
>>>>> [tagir:02855] Failing at address: 0x2aef6e5b2d9f
>>>>> [tagir:02855] [ 0] /lib64/libpthread.so.0[0x30f920f710]
>>>>> [tagir:02855] [ 1] /softs/openmpi-1.8.1-intel13/lib/openmpi/mca_btl_openib.so(mca_btl_openib_add_procs+0xe9f)[0x2aef6e5b2d9f]
>>>>> [tagir:02855] [ 2] /softs/openmpi-1.8.1-intel13/lib/openmpi/mca_bml_r2.so(+0x1481)[0x2aef6e3a6481]
>>>>> [tagir:02855] [ 3] /softs/openmpi-1.8.1-intel13/lib/openmpi/mca_pml_ob1.so(mca_pml_ob1_add_procs+0xa8)[0x2aef6f8ef2f8]
>>>>> [tagir:02855] [ 4] /softs/openmpi-1.8.1-intel13/lib/libmpi.so.1(ompi_mpi_init+0x9f6)[0x2aef69572236]
>>>>> [tagir:02855] [ 5] /softs/openmpi-1.8.1-intel13/lib/libmpi.so.1(MPI_Init+0xef)[0x2aef6959474f]
>>>>> [tagir:02855] [ 6] ./a.out[0x400dd1]
>>>>> [tagir:02855] [ 7] /lib64/libc.so.6(__libc_start_main+0xfd)[0x30f8a1ed1d]
>>>>> [tagir:02855] [ 8] ./a.out[0x400cc9]
>>>>> [tagir:02855] *** End of error message ***
>>>>> $:
>>>>>
>>>>> Versions info:
>>>>> $: mpicc -v
>>>>> icc version 13.1.1 (gcc version 4.4.7 compatibility)
>>>>> $: ldd ./a.out
>>>>> linux-vdso.so.1 => (0x00007fffbb197000)
>>>>> libmpi.so.1 => /softs/openmpi-1.8.1-intel13/lib/libmpi.so.1 (0x00002b20262ee000)
>>>>> libm.so.6 => /lib64/libm.so.6 (0x00000030f8e00000)
>>>>> libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x00000030ff200000)
>>>>> libpthread.so.0 => /lib64/libpthread.so.0 (0x00000030f9200000)
>>>>> libc.so.6 => /lib64/libc.so.6 (0x00000030f8a00000)
>>>>> libdl.so.2 => /lib64/libdl.so.2 (0x00000030f9600000)
>>>>> libopen-rte.so.7 => /softs/openmpi-1.8.1-intel13/lib/libopen-rte.so.7 (0x00002b202660d000)
>>>>> libopen-pal.so.6 => /softs/openmpi-1.8.1-intel13/lib/libopen-pal.so.6 (0x00002b20268a1000)
>>>>> libnuma.so.1 => /usr/lib64/libnuma.so.1 (0x00002b2026ba6000)
>>>>> librt.so.1 => /lib64/librt.so.1 (0x00000030f9e00000)
>>>>> libnsl.so.1 => /lib64/libnsl.so.1 (0x0000003109800000)
>>>>> libutil.so.1 => /lib64/libutil.so.1 (0x000000310aa00000)
>>>>> libimf.so => /softs/intel/composer_xe_2013.3.163/compiler/lib/intel64/libimf.so (0x00002b2026db0000)
>>>>> libsvml.so => /softs/intel/composer_xe_2013.3.163/compiler/lib/intel64/libsvml.so (0x00002b202726d000)
>>>>> libirng.so => /softs/intel/composer_xe_2013.3.163/compiler/lib/intel64/libirng.so (0x00002b2027c37000)
>>>>> libintlc.so.5 => /softs/intel/composer_xe_2013.3.163/compiler/lib/intel64/libintlc.so.5 (0x00002b2027e3e000)
>>>>> /lib64/ld-linux-x86-64.so.2 (0x00000030f8600000)
>>>>> $:
>>>>>
>>>>> I tried to goole the issue, and saw something regarding an old vectorization bug with intel compiler, but that was a lonng time ago and seemed to be fixed for 1.6.x.
>>>>> Also, "make check" went fine ???
>>>>>
>>>>> Any idea ?
>>>>>
>>>>> Cheers
>>>>>
>>>>> --
>>>>> ---
>>>>> Alain
>>>>>
>>>>> _______________________________________________
>>>>> users mailing list
>>>>> users_at_[hidden]
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>> _______________________________________________
>>>> users mailing list
>>>> users_at_[hidden]
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>
>>> --
>>> ---
>>> Alain
>>>
>>> _______________________________________________
>>> users mailing list
>>> users_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
> --
> ---
> Alain
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users