Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] Segmentation fault when Send/Recv on heterogeneouscluster (32/64 bit machines)
From: TRINH Minh Hieu (mhtrinh_at_[hidden])
Date: 2010-03-04 09:02:19


Hi,

I have some new discovery about this problem :

It seems that the array size sendable from a 32bit to 64bit machines
is proportional to the parameter "btl_tcp_eager_limit"
When I set it to 200 000 000 (2e08 bytes, about 190MB), I can send an
array up to 2e07 double (152MB).

I didn't found much informations about btl_tcp_eager_limit other than
in the "ompi_info --all" command. If I let it at 2e08, will it impacts
the performance of OpenMPI ?

It may be noteworth also that if the master (rank 0) is a 32bit
machines, I don't have segfault. I can send big array with small
"btl_tcp_eager_limit" from a 64bit machine to a 32bit one.

Do I have to move this thread to devel mailing list ?

Regards,

   TMHieu

On Tue, Mar 2, 2010 at 2:54 PM, TRINH Minh Hieu <mhtrinh_at_[hidden]> wrote:
> Hello,
>
> Yes, I compiled OpenMPI with --enable-heterogeneous. More precisely I
> compiled with :
> $ ./configure --prefix=/tmp/openmpi --enable-heterogeneous
> --enable-cxx-exceptions --enable-shared
> --enable-orterun-prefix-by-default
> $ make all install
>
> I attach the output of ompi_info of my 2 machines.
>
>    TMHieu
>
> On Tue, Mar 2, 2010 at 1:57 PM, Jeff Squyres <jsquyres_at_[hidden]> wrote:
>> Did you configure Open MPI with --enable-heterogeneous?
>>
>> On Feb 28, 2010, at 1:22 PM, TRINH Minh Hieu wrote:
>>
>>> Hello,
>>>
>>> I have some problems running MPI on my heterogeneous cluster. More
>>> precisley i got segmentation fault when sending a large array (about
>>> 10000) of double from a i686 machine to a x86_64 machine. It does not
>>> happen with small array. Here is the send/recv code source (complete
>>> source is in attached file) :
>>> ========code ================
>>>     if (me == 0 ) {
>>>         for (int pe=1; pe<nprocs; pe++)
>>>         {
>>>                 printf("Receiving from proc %d : ",pe); fflush(stdout);
>>>             d=(double *)malloc(sizeof(double)*n);
>>>             MPI_Recv(d,n,MPI_DOUBLE,pe,999,MPI_COMM_WORLD,&status);
>>>             printf("OK\n"); fflush(stdout);
>>>         }
>>>         printf("All done.\n");
>>>     }
>>>     else {
>>>       d=(double *)malloc(sizeof(double)*n);
>>>       MPI_Send(d,n,MPI_DOUBLE,0,999,MPI_COMM_WORLD);
>>>     }
>>> ======== code ================
>>>
>>> I got segmentation fault with n=10000 but no error with n=1000
>>> I have 2 machines :
>>> sbtn155 : Intel Xeon,         x86_64
>>> sbtn211 : Intel Pentium 4, i686
>>>
>>> The code is compiled in x86_64 and i686 machine, using OpenMPI 1.4.1,
>>> installed in /tmp/openmpi :
>>> [mhtrinh_at_sbtn211 heterogenous]$ make hetero
>>> gcc -Wall -I. -std=c99 -O3 -I/tmp/openmpi/include -c hetero.c -o hetero.i686.o
>>> /tmp/openmpi/bin/mpicc -Wall -I. -std=c99 -O3 -I/tmp/openmpi/include
>>> hetero.i686.o -o hetero.i686 -lm
>>>
>>> [mhtrinh_at_sbtn155 heterogenous]$ make hetero
>>> gcc -Wall -I. -std=c99 -O3 -I/tmp/openmpi/include -c hetero.c -o hetero.x86_64.o
>>> /tmp/openmpi/bin/mpicc -Wall -I. -std=c99 -O3 -I/tmp/openmpi/include
>>> hetero.x86_64.o -o hetero.x86_64 -lm
>>>
>>> I run with the code using appfile and got thoses error :
>>> $ cat appfile
>>> --host sbtn155 -np 1 hetero.x86_64
>>> --host sbtn155 -np 1 hetero.x86_64
>>> --host sbtn211 -np 1 hetero.i686
>>>
>>> $ mpirun -hetero --app appfile
>>> Input array length :
>>> 10000
>>> Receiving from proc 1 : OK
>>> Receiving from proc 2 : [sbtn155:26386] *** Process received signal ***
>>> [sbtn155:26386] Signal: Segmentation fault (11)
>>> [sbtn155:26386] Signal code: Address not mapped (1)
>>> [sbtn155:26386] Failing at address: 0x200627bd8
>>> [sbtn155:26386] [ 0] /lib64/libpthread.so.0 [0x3fa4e0e540]
>>> [sbtn155:26386] [ 1] /tmp/openmpi/lib/openmpi/mca_pml_ob1.so [0x2aaaad8d7908]
>>> [sbtn155:26386] [ 2] /tmp/openmpi/lib/openmpi/mca_btl_tcp.so [0x2aaaae2fc6e3]
>>> [sbtn155:26386] [ 3] /tmp/openmpi/lib/libopen-pal.so.0 [0x2aaaaafe39db]
>>> [sbtn155:26386] [ 4]
>>> /tmp/openmpi/lib/libopen-pal.so.0(opal_progress+0x9e) [0x2aaaaafd8b9e]
>>> [sbtn155:26386] [ 5] /tmp/openmpi/lib/openmpi/mca_pml_ob1.so [0x2aaaad8d4b25]
>>> [sbtn155:26386] [ 6] /tmp/openmpi/lib/libmpi.so.0(MPI_Recv+0x13b)
>>> [0x2aaaaab30f9b]
>>> [sbtn155:26386] [ 7] hetero.x86_64(main+0xde) [0x400cbe]
>>> [sbtn155:26386] [ 8] /lib64/libc.so.6(__libc_start_main+0xf4) [0x3fa421e074]
>>> [sbtn155:26386] [ 9] hetero.x86_64 [0x400b29]
>>> [sbtn155:26386] *** End of error message ***
>>> --------------------------------------------------------------------------
>>> mpirun noticed that process rank 0 with PID 26386 on node sbtn155
>>> exited on signal 11 (Segmentation fault).
>>> --------------------------------------------------------------------------
>>>
>>> Am I missing an option in order to run in heterogenous cluster ?
>>> MPI_Send/Recv have limit array size when using heterogeneous cluster ?
>>> Thanks for your help. Regards
>>>
>>> --
>>> ============================================
>>>    M. TRINH Minh Hieu
>>>    CEA, IBEB, SBTN/LIRM,
>>>    F-30207 Bagnols-sur-Cèze, FRANCE
>>> ============================================
>>>
>>> <hetero.c.bz2>_______________________________________________
>>> users mailing list
>>> users_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>>
>> --
>> Jeff Squyres
>> jsquyres_at_[hidden]
>> For corporate legal information go to:
>> http://www.cisco.com/web/about/doing_business/legal/cri/
>>
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>