Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] divide-by-zero in mca_btl_openib_add_procs
From: Alain Miniussi (alain.miniussi_at_[hidden])
Date: 2014-05-27 07:49:11


So it's working with a gcc compiled openmpi:

[alainm_at_gurney mpi]$ /softs/openmpi-1.8.1-gnu447/bin/mpicc --showme
gcc -I/softs/openmpi-1.8.1-gnu447/include -pthread -Wl,-rpath
-Wl,/softs/openmpi-1.8.1-gnu447/lib -Wl,--enable-new-dtags
-L/softs/openmpi-1.8.1-gnu447/lib -lmpi
(reverse-i-search)`mpicc': ^Cicc --showme:compile
[alainm_at_gurney mpi]$ /softs/openmpi-1.8.1-gnu447/bin/mpicc --showme
gcc -I/softs/openmpi-1.8.1-gnu447/include -pthread -Wl,-rpath
-Wl,/softs/openmpi-1.8.1-gnu447/lib -Wl,--enable-new-dtags
-L/softs/openmpi-1.8.1-gnu447/lib -lmpi
[alainm_at_gurney mpi]$ /softs/openmpi-1.8.1-gnu447/bin/mpicc ./test.c
[alainm_at_gurney mpi]$ /softs/openmpi-1.8.1-gnu447/bin/mpiexec -n 2 ./a.out
[alainm_at_gurney mpi]$ ldd ./a.out
     linux-vdso.so.1 => (0x00007fffb47ff000)
     libmpi.so.1 => /softs/openmpi-1.8.1-gnu447/lib/libmpi.so.1
(0x00002aaee80c1000)
     libpthread.so.0 => /lib64/libpthread.so.0 (0x0000003bd9e00000)
     libc.so.6 => /lib64/libc.so.6 (0x0000003bd9200000)
     libopen-rte.so.7 =>
/softs/openmpi-1.8.1-gnu447/lib/libopen-rte.so.7 (0x00002aaee83b8000)
     libopen-pal.so.6 =>
/softs/openmpi-1.8.1-gnu447/lib/libopen-pal.so.6 (0x00002aaee8630000)
     libnuma.so.1 => /usr/lib64/libnuma.so.1 (0x0000003bd9600000)
     libdl.so.2 => /lib64/libdl.so.2 (0x00002aaee8904000)
     librt.so.1 => /lib64/librt.so.1 (0x0000003bda600000)
     libnsl.so.1 => /lib64/libnsl.so.1 (0x0000003beb000000)
     libutil.so.1 => /lib64/libutil.so.1 (0x0000003bea000000)
     libm.so.6 => /lib64/libm.so.6 (0x0000003bd9a00000)
     /lib64/ld-linux-x86-64.so.2 (0x0000003bd8e00000)
[alainm_at_gurney mpi]$ ./a.out
[alainm_at_gurney mpi]$

So it seems to be specific to Intel's compiler.

On 26/05/2014 17:35, Ralph Castain wrote:
> If you wouldn't mind, yes - let's see if it is a problem with icc. We know some versions have bugs, though this may not be the issue here
>
> On May 26, 2014, at 7:39 AM, Alain Miniussi <alain.miniussi_at_oca.eu> wrote:
>
>> Hi,
>>
>> Did that too, with the same result:
>>
>> [alainm_at_tagir mpi]$ mpirun -n 1 ./a.out
>> [tagir:05123] *** Process received signal ***
>> [tagir:05123] Signal: Floating point exception (8)
>> [tagir:05123] Signal code: Integer divide-by-zero (1)
>> [tagir:05123] Failing at address: 0x2adb507b3d9f
>> [tagir:05123] [ 0] /lib64/libpthread.so.0[0x30f920f710]
>> [tagir:05123] [ 1] /softs/openmpi-1.8.1-intel13/lib/openmpi/mca_btl_openib.so(mca_btl_openib_add_procs+0xe9f)[0x2adb507b3d9f]
>> [tagir:05123] [ 2] /softs/openmpi-1.8.1-intel13/lib/openmpi/mca_bml_r2.so(+0x1481)[0x2adb505a7481]
>> [tagir:05123] [ 3] /softs/openmpi-1.8.1-intel13/lib/openmpi/mca_pml_ob1.so(mca_pml_ob1_add_procs+0xa8)[0x2adb51af02f8]
>> [tagir:05123] [ 4] /softs/openmpi-1.8.1-intel13/lib/libmpi.so.1(ompi_mpi_init+0x9f6)[0x2adb4b78b236]
>> [tagir:05123] [ 5] /softs/openmpi-1.8.1-intel13/lib/libmpi.so.1(MPI_Init+0xef)[0x2adb4b7ad74f]
>> [tagir:05123] [ 6] ./a.out[0x400dd1]
>> [tagir:05123] [ 7] /lib64/libc.so.6(__libc_start_main+0xfd)[0x30f8a1ed1d]
>> [tagir:05123] [ 8] ./a.out[0x400cc9]
>> [tagir:05123] *** End of error message ***
>> --------------------------------------------------------------------------
>> mpirun noticed that process rank 0 with PID 5123 on node tagir exited on signal 13 (Broken pipe).
>> --------------------------------------------------------------------------
>> [alainm_at_tagir mpi]$
>>
>>
>> do you want me to try a gcc build ?
>>
>> Alain
>>
>> On 26/05/2014 16:09, Ralph Castain wrote:
>>> Strange - I note that you are running these as singletons. Can you try running it under mpirun?
>>>
>>> mpirun -n 1 ./a.out
>>>
>>> just to see if it is the singleton that is causing the problem, or something in the openib btl itself.
>>>
>>>
>>> On May 26, 2014, at 6:59 AM, Alain Miniussi <alain.miniussi_at_oca.eu> wrote:
>>>
>>>> Hi,
>>>>
>>>> I have a failure with the following minimalistic testcase:
>>>> $: more ./test.c
>>>> #include "mpi.h"
>>>>
>>>> int main(int argc, char* argv[]) {
>>>> MPI_Init(&argc,&argv);
>>>> MPI_Finalize();
>>>> return 0;
>>>> }
>>>> $: mpicc -v
>>>> icc version 13.1.1 (gcc version 4.4.7 compatibility)
>>>> $: mpicc ./test.c
>>>> $: ./a.out
>>>> [tagir:02855] *** Process received signal ***
>>>> [tagir:02855] Signal: Floating point exception (8)
>>>> [tagir:02855] Signal code: Integer divide-by-zero (1)
>>>> [tagir:02855] Failing at address: 0x2aef6e5b2d9f
>>>> [tagir:02855] [ 0] /lib64/libpthread.so.0[0x30f920f710]
>>>> [tagir:02855] [ 1] /softs/openmpi-1.8.1-intel13/lib/openmpi/mca_btl_openib.so(mca_btl_openib_add_procs+0xe9f)[0x2aef6e5b2d9f]
>>>> [tagir:02855] [ 2] /softs/openmpi-1.8.1-intel13/lib/openmpi/mca_bml_r2.so(+0x1481)[0x2aef6e3a6481]
>>>> [tagir:02855] [ 3] /softs/openmpi-1.8.1-intel13/lib/openmpi/mca_pml_ob1.so(mca_pml_ob1_add_procs+0xa8)[0x2aef6f8ef2f8]
>>>> [tagir:02855] [ 4] /softs/openmpi-1.8.1-intel13/lib/libmpi.so.1(ompi_mpi_init+0x9f6)[0x2aef69572236]
>>>> [tagir:02855] [ 5] /softs/openmpi-1.8.1-intel13/lib/libmpi.so.1(MPI_Init+0xef)[0x2aef6959474f]
>>>> [tagir:02855] [ 6] ./a.out[0x400dd1]
>>>> [tagir:02855] [ 7] /lib64/libc.so.6(__libc_start_main+0xfd)[0x30f8a1ed1d]
>>>> [tagir:02855] [ 8] ./a.out[0x400cc9]
>>>> [tagir:02855] *** End of error message ***
>>>> $:
>>>>
>>>> Versions info:
>>>> $: mpicc -v
>>>> icc version 13.1.1 (gcc version 4.4.7 compatibility)
>>>> $: ldd ./a.out
>>>> linux-vdso.so.1 => (0x00007fffbb197000)
>>>> libmpi.so.1 => /softs/openmpi-1.8.1-intel13/lib/libmpi.so.1 (0x00002b20262ee000)
>>>> libm.so.6 => /lib64/libm.so.6 (0x00000030f8e00000)
>>>> libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x00000030ff200000)
>>>> libpthread.so.0 => /lib64/libpthread.so.0 (0x00000030f9200000)
>>>> libc.so.6 => /lib64/libc.so.6 (0x00000030f8a00000)
>>>> libdl.so.2 => /lib64/libdl.so.2 (0x00000030f9600000)
>>>> libopen-rte.so.7 => /softs/openmpi-1.8.1-intel13/lib/libopen-rte.so.7 (0x00002b202660d000)
>>>> libopen-pal.so.6 => /softs/openmpi-1.8.1-intel13/lib/libopen-pal.so.6 (0x00002b20268a1000)
>>>> libnuma.so.1 => /usr/lib64/libnuma.so.1 (0x00002b2026ba6000)
>>>> librt.so.1 => /lib64/librt.so.1 (0x00000030f9e00000)
>>>> libnsl.so.1 => /lib64/libnsl.so.1 (0x0000003109800000)
>>>> libutil.so.1 => /lib64/libutil.so.1 (0x000000310aa00000)
>>>> libimf.so => /softs/intel/composer_xe_2013.3.163/compiler/lib/intel64/libimf.so (0x00002b2026db0000)
>>>> libsvml.so => /softs/intel/composer_xe_2013.3.163/compiler/lib/intel64/libsvml.so (0x00002b202726d000)
>>>> libirng.so => /softs/intel/composer_xe_2013.3.163/compiler/lib/intel64/libirng.so (0x00002b2027c37000)
>>>> libintlc.so.5 => /softs/intel/composer_xe_2013.3.163/compiler/lib/intel64/libintlc.so.5 (0x00002b2027e3e000)
>>>> /lib64/ld-linux-x86-64.so.2 (0x00000030f8600000)
>>>> $:
>>>>
>>>> I tried to goole the issue, and saw something regarding an old vectorization bug with intel compiler, but that was a lonng time ago and seemed to be fixed for 1.6.x.
>>>> Also, "make check" went fine ???
>>>>
>>>> Any idea ?
>>>>
>>>> Cheers
>>>>
>>>> --
>>>> ---
>>>> Alain
>>>>
>>>> _______________________________________________
>>>> users mailing list
>>>> users_at_[hidden]
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>> _______________________________________________
>>> users mailing list
>>> users_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>> --
>> ---
>> Alain
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users

-- 
---
Alain