Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: [OMPI devel] Segmentation fault on x86_64 on heterogeneous environment
From: Timur Magomedov (timur.magomedov_at_[hidden])
Date: 2010-04-22 09:08:24


Hello, list.

I have a strange segmentation fault on x86_64 machine running together
with x86.
I am running attached program that sends some bytes from process 0 to
process 1. My configuration is:
Machine #1: (process 0)
  arch: x86
  hostname: magomedov-desktop
  linux distro: Ubuntu 9.10
  Open MPI: v1.4 configured with --enable-heterogeneous --enable-debug
Machine #2: (process 1)
  arch: x86_64
  hostname: linuxtche
  linux distro: Fedora 12
  Open MPI: v1.4 configured with --enable-heterogeneous
--prefix=/home/magomedov/openmpi/ --enable-debug

They are connected by ethernet.
My user environment on second (x86_64) machine is set up to use Open MPI
from /home/magomedov/openmpi/.

Then I compile attached program on both machines (at the same path) and
run it. Process 0 from x86 machine should send data to process 1 on
x86_64 machine.

First, let's send 65530 bytes:

mpirun -host timur,linuxtche -np
2 /home/magomedov/workspace/mpi-test/mpi-send-test 65530
magomedov_at_linuxtche's password:
*** processor magomedov-desktop, comm size is 2, my rank is 0, pid 21875
***
*** processor linuxtche, comm size is 2, my rank is 1, pid 11357 ***
Received 65530 bytes

It's OK. Then let's send 65537 bytes:

magomedov_at_magomedov-desktop:~/workspace/mpi-test$ mpirun -host
timur,linuxtche -np 2 /home/magomedov/workspace/mpi-test/mpi-send-test
65537
magomedov_at_linuxtche's password:
*** processor magomedov-desktop, comm size is 2, my rank is 0, pid 9205
***
*** processor linuxtche, comm size is 2, my rank is 1, pid 28858 ***
[linuxtche:28858] *** Process received signal ***
[linuxtche:28858] Signal: Segmentation fault (11)
[linuxtche:28858] Signal code: Address not mapped (1)
[linuxtche:28858] Failing at address: 0x201143bf8
[linuxtche:28858] [ 0] /lib64/libpthread.so.0() [0x3600c0f0f0]
[linuxtche:28858]
[ 1] /home/magomedov/openmpi/lib/openmpi/mca_pml_ob1.so(+0xfc27)
[0x7f5e94076c27]
[linuxtche:28858]
[ 2] /home/magomedov/openmpi/lib/openmpi/mca_btl_tcp.so(+0xadac)
[0x7f5e935c3dac]
[linuxtche:28858]
[ 3] /home/magomedov/openmpi/lib/libopen-pal.so.0(+0x27611)
[0x7f5e96575611]
[linuxtche:28858]
[ 4] /home/magomedov/openmpi/lib/libopen-pal.so.0(+0x27c57)
[0x7f5e96575c57]
[linuxtche:28858]
[ 5] /home/magomedov/openmpi/lib/libopen-pal.so.0(opal_event_loop+0x1f)
[0x7f5e96575848]
[linuxtche:28858]
[ 6] /home/magomedov/openmpi/lib/libopen-pal.so.0(opal_progress+0x89)
[0x7f5e965648dd]
[linuxtche:28858]
[ 7] /home/magomedov/openmpi/lib/openmpi/mca_pml_ob1.so(+0x762f)
[0x7f5e9406e62f]
[linuxtche:28858]
[ 8] /home/magomedov/openmpi/lib/openmpi/mca_pml_ob1.so(+0x777d)
[0x7f5e9406e77d]
[linuxtche:28858]
[ 9] /home/magomedov/openmpi/lib/openmpi/mca_pml_ob1.so(+0x8246)
[0x7f5e9406f246]
[linuxtche:28858] [10] /home/magomedov/openmpi/lib/libmpi.so.0(MPI_Recv
+0x2d2) [0x7f5e96af832c]
[linuxtche:28858]
[11] /home/magomedov/workspace/mpi-test/mpi-send-test(main+0x1e4)
[0x400ee8]
[linuxtche:28858] [12] /lib64/libc.so.6(__libc_start_main+0xfd)
[0x360001eb1d]
[linuxtche:28858]
[13] /home/magomedov/workspace/mpi-test/mpi-send-test() [0x400c49]
[linuxtche:28858] *** End of error message ***
--------------------------------------------------------------------------
mpirun noticed that process rank 1 with PID 28858 on node linuxtche
exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------

If I am trying to send >= 65537 bytes from x86 I always get segfault on
x86_64.

I made some investigations and found that "bad" pointer always has a
valid pointer actually in it's lower 32-bit word and "2" or "1" in it's
upper word. Program segfaults in pml_ob1_recvfrag.c, in function
mca_pml_ob1_recv_frag_callback_fin(), rdma pointer is broken. I inserted
rdma = (mca_btl_base_descriptor_t*)((unsigned long)rdma & 0xFFFFFFFF);
line which I believe truncates 64-bit pointer to 32 bits and segfaults
disappeared. However, this is not the solution.

After some investigations with gdb it seems to me like this pointer was
sent to x86 machine and was received from it broken but I don't realize
what is going on enough to fix it...

Can anyone reproduce it?
I got the same results on openmpi-1.4.2rc1 too.

It looks like the same problem was described here
http://www.open-mpi.org/community/lists/users/2010/02/12182.php in
ompi-users list.

-- 
Kind regards,
Timur Magomedov
Senior C++ Developer
DevelopOnBox LLC / Zodiac Interactive
http://www.zodiac.tv/