Thanks, I will apply the patch to all branches later today. Thanks for your help!
"All the books in the world contain no more information than is broadcast as video in a single large American city in a single year. Not all bits have equal value.". -- Carl Sagan
On Apr 23, 2010, at 3:43, Timur Magomedov <timur.magomedov_at_[hidden]> wrote:
> Thank you, George!
> I checked out trunk version 1.7a1r23028 and got the same errors as on
> 1.4.*. Then I applied your patch, fixed one more file
> Index: pml_ob1_recvreq.c
> --- pml_ob1_recvreq.c (revision 23028)
> +++ pml_ob1_recvreq.c (working copy)
> @@ -331,7 +331,7 @@
> - frag->rdma_hdr.hdr_rget.hdr_des.pval,
> + frag->rdma_hdr.hdr_rget.hdr_des,
> des->order, 0);
> /* is receive request complete */
> and the problem disappeared.
> Ð ÐÑÐ½, 23/04/2010 Ð² 01:38 -0400, George Bosilca Ð¿Ð¸ÑÐµÑ:
>> Thanks for the very detailed analysis of the problem. Based on your observations, I was able to track down the issue pretty quickly. In few words, the 64 bits machine sent a pointer to the 32 bits one, and expected that it will get it back untouched. Unfortunately, on the 32 bits machine this pointer was translated into a void* and the upper 32 bits were lost.
>> I don't have a heterogeneous environment available right away to test my patch. I would really appreciate it if you can test it and let us know if it solve this problem.
>> PS: In order to apply it, please go in the ompi/mca/pml/ob1 directory and do the "patch -p0" from there.
>> On Apr 22, 2010, at 09:08 , Timur Magomedov wrote:
>>> Hello, list.
>>> I have a strange segmentation fault on x86_64 machine running together
>>> with x86.
>>> I am running attached program that sends some bytes from process 0 to
>>> process 1. My configuration is:
>>> Machine #1: (process 0)
>>> arch: x86
>>> hostname: magomedov-desktop
>>> linux distro: Ubuntu 9.10
>>> Open MPI: v1.4 configured with --enable-heterogeneous --enable-debug
>>> Machine #2: (process 1)
>>> arch: x86_64
>>> hostname: linuxtche
>>> linux distro: Fedora 12
>>> Open MPI: v1.4 configured with --enable-heterogeneous
>>> --prefix=/home/magomedov/openmpi/ --enable-debug
>>> They are connected by ethernet.
>>> My user environment on second (x86_64) machine is set up to use Open MPI
>>> from /home/magomedov/openmpi/.
>>> Then I compile attached program on both machines (at the same path) and
>>> run it. Process 0 from x86 machine should send data to process 1 on
>>> x86_64 machine.
>>> First, let's send 65530 bytes:
>>> mpirun -host timur,linuxtche -np
>>> 2 /home/magomedov/workspace/mpi-test/mpi-send-test 65530
>>> magomedov_at_linuxtche's password:
>>> *** processor magomedov-desktop, comm size is 2, my rank is 0, pid 21875
>>> *** processor linuxtche, comm size is 2, my rank is 1, pid 11357 ***
>>> Received 65530 bytes
>>> It's OK. Then let's send 65537 bytes:
>>> magomedov_at_magomedov-desktop:~/workspace/mpi-test$ mpirun -host
>>> timur,linuxtche -np 2 /home/magomedov/workspace/mpi-test/mpi-send-test
>>> magomedov_at_linuxtche's password:
>>> *** processor magomedov-desktop, comm size is 2, my rank is 0, pid 9205
>>> *** processor linuxtche, comm size is 2, my rank is 1, pid 28858 ***
>>> [linuxtche:28858] *** Process received signal ***
>>> [linuxtche:28858] Signal: Segmentation fault (11)
>>> [linuxtche:28858] Signal code: Address not mapped (1)
>>> [linuxtche:28858] Failing at address: 0x201143bf8
>>> [linuxtche:28858] [ 0] /lib64/libpthread.so.0() [0x3600c0f0f0]
>>> [ 1] /home/magomedov/openmpi/lib/openmpi/mca_pml_ob1.so(+0xfc27)
>>> [ 2] /home/magomedov/openmpi/lib/openmpi/mca_btl_tcp.so(+0xadac)
>>> [ 3] /home/magomedov/openmpi/lib/libopen-pal.so.0(+0x27611)
>>> [ 4] /home/magomedov/openmpi/lib/libopen-pal.so.0(+0x27c57)
>>> [ 5] /home/magomedov/openmpi/lib/libopen-pal.so.0(opal_event_loop+0x1f)
>>> [ 6] /home/magomedov/openmpi/lib/libopen-pal.so.0(opal_progress+0x89)
>>> [ 7] /home/magomedov/openmpi/lib/openmpi/mca_pml_ob1.so(+0x762f)
>>> [ 8] /home/magomedov/openmpi/lib/openmpi/mca_pml_ob1.so(+0x777d)
>>> [ 9] /home/magomedov/openmpi/lib/openmpi/mca_pml_ob1.so(+0x8246)
>>> [linuxtche:28858]  /home/magomedov/openmpi/lib/libmpi.so.0(MPI_Recv
>>> +0x2d2) [0x7f5e96af832c]
>>>  /home/magomedov/workspace/mpi-test/mpi-send-test(main+0x1e4)
>>> [linuxtche:28858]  /lib64/libc.so.6(__libc_start_main+0xfd)
>>>  /home/magomedov/workspace/mpi-test/mpi-send-test() [0x400c49]
>>> [linuxtche:28858] *** End of error message ***
>>> mpirun noticed that process rank 1 with PID 28858 on node linuxtche
>>> exited on signal 11 (Segmentation fault).
>>> If I am trying to send >= 65537 bytes from x86 I always get segfault on
>>> I made some investigations and found that "bad" pointer always has a
>>> valid pointer actually in it's lower 32-bit word and "2" or "1" in it's
>>> upper word. Program segfaults in pml_ob1_recvfrag.c, in function
>>> mca_pml_ob1_recv_frag_callback_fin(), rdma pointer is broken. I inserted
>>> rdma = (mca_btl_base_descriptor_t*)((unsigned long)rdma & 0xFFFFFFFF);
>>> line which I believe truncates 64-bit pointer to 32 bits and segfaults
>>> disappeared. However, this is not the solution.
>>> After some investigations with gdb it seems to me like this pointer was
>>> sent to x86 machine and was received from it broken but I don't realize
>>> what is going on enough to fix it...
>>> Can anyone reproduce it?
>>> I got the same results on openmpi-1.4.2rc1 too.
>>> It looks like the same problem was described here
>>> http://www.open-mpi.org/community/lists/users/2010/02/12182.php in
>>> ompi-users list.
>>> Kind regards,
>>> Timur Magomedov
>>> Senior C++ Developer
>>> DevelopOnBox LLC / Zodiac Interactive
>>> devel mailing list
> Kind regards,
> Timur Magomedov
> Senior C++ Developer
> DevelopOnBox LLC / Zodiac Interactive