Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] Segmentation fault when Send/Recv on heterogeneouscluster (32/64 bit machines)
From: Aurélien Bouteiller (bouteill_at_[hidden])
Date: 2010-03-05 02:24:58


Hi,

setting the eager limit to such a drastically high value will have the effect of generating gigantic memory consumption for unexpected messages. Any message you send which does not have a preposted ready recv will mallocate 150mb of temporary storage, and will be memcopied from that internal buffer to the recv buffer when the recv is posted. You should expect very poor bandwidth and probably crash/abort due to memory exhaustion on the receivers.

Aurelien

--
Dr. Aurelien Bouteiller
Innovative Computing Laboratory
University of Tennessee
Knoxville, TN, USA
 
Le 4 mars 2010 à 09:02, TRINH Minh Hieu a écrit :
> Hi,
> 
> I have some new discovery about this problem :
> 
> It seems that the array size sendable from a 32bit to 64bit machines
> is proportional to the parameter "btl_tcp_eager_limit"
> When I set it to 200 000 000 (2e08 bytes, about 190MB), I can send an
> array up to 2e07 double (152MB).
> 
> I didn't found much informations about btl_tcp_eager_limit other than
> in the "ompi_info --all" command. If I let it at 2e08, will it impacts
> the performance of OpenMPI ?
> 
> It may be noteworth also that if the master (rank 0) is a 32bit
> machines, I don't have segfault. I can send big array with small
> "btl_tcp_eager_limit" from a 64bit machine to a 32bit one.
> 
> Do I have to move this thread to devel mailing list ?
> 
> Regards,
> 
>   TMHieu
> 
> On Tue, Mar 2, 2010 at 2:54 PM, TRINH Minh Hieu <mhtrinh_at_[hidden]> wrote:
>> Hello,
>> 
>> Yes, I compiled OpenMPI with --enable-heterogeneous. More precisely I
>> compiled with :
>> $ ./configure --prefix=/tmp/openmpi --enable-heterogeneous
>> --enable-cxx-exceptions --enable-shared
>> --enable-orterun-prefix-by-default
>> $ make all install
>> 
>> I attach the output of ompi_info of my 2 machines.
>> 
>>    TMHieu
>> 
>> On Tue, Mar 2, 2010 at 1:57 PM, Jeff Squyres <jsquyres_at_[hidden]> wrote:
>>> Did you configure Open MPI with --enable-heterogeneous?
>>> 
>>> On Feb 28, 2010, at 1:22 PM, TRINH Minh Hieu wrote:
>>> 
>>>> Hello,
>>>> 
>>>> I have some problems running MPI on my heterogeneous cluster. More
>>>> precisley i got segmentation fault when sending a large array (about
>>>> 10000) of double from a i686 machine to a x86_64 machine. It does not
>>>> happen with small array. Here is the send/recv code source (complete
>>>> source is in attached file) :
>>>> ========code ================
>>>>     if (me == 0 ) {
>>>>         for (int pe=1; pe<nprocs; pe++)
>>>>         {
>>>>                 printf("Receiving from proc %d : ",pe); fflush(stdout);
>>>>             d=(double *)malloc(sizeof(double)*n);
>>>>             MPI_Recv(d,n,MPI_DOUBLE,pe,999,MPI_COMM_WORLD,&status);
>>>>             printf("OK\n"); fflush(stdout);
>>>>         }
>>>>         printf("All done.\n");
>>>>     }
>>>>     else {
>>>>       d=(double *)malloc(sizeof(double)*n);
>>>>       MPI_Send(d,n,MPI_DOUBLE,0,999,MPI_COMM_WORLD);
>>>>     }
>>>> ======== code ================
>>>> 
>>>> I got segmentation fault with n=10000 but no error with n=1000
>>>> I have 2 machines :
>>>> sbtn155 : Intel Xeon,         x86_64
>>>> sbtn211 : Intel Pentium 4, i686
>>>> 
>>>> The code is compiled in x86_64 and i686 machine, using OpenMPI 1.4.1,
>>>> installed in /tmp/openmpi :
>>>> [mhtrinh_at_sbtn211 heterogenous]$ make hetero
>>>> gcc -Wall -I. -std=c99 -O3 -I/tmp/openmpi/include -c hetero.c -o hetero.i686.o
>>>> /tmp/openmpi/bin/mpicc -Wall -I. -std=c99 -O3 -I/tmp/openmpi/include
>>>> hetero.i686.o -o hetero.i686 -lm
>>>> 
>>>> [mhtrinh_at_sbtn155 heterogenous]$ make hetero
>>>> gcc -Wall -I. -std=c99 -O3 -I/tmp/openmpi/include -c hetero.c -o hetero.x86_64.o
>>>> /tmp/openmpi/bin/mpicc -Wall -I. -std=c99 -O3 -I/tmp/openmpi/include
>>>> hetero.x86_64.o -o hetero.x86_64 -lm
>>>> 
>>>> I run with the code using appfile and got thoses error :
>>>> $ cat appfile
>>>> --host sbtn155 -np 1 hetero.x86_64
>>>> --host sbtn155 -np 1 hetero.x86_64
>>>> --host sbtn211 -np 1 hetero.i686
>>>> 
>>>> $ mpirun -hetero --app appfile
>>>> Input array length :
>>>> 10000
>>>> Receiving from proc 1 : OK
>>>> Receiving from proc 2 : [sbtn155:26386] *** Process received signal ***
>>>> [sbtn155:26386] Signal: Segmentation fault (11)
>>>> [sbtn155:26386] Signal code: Address not mapped (1)
>>>> [sbtn155:26386] Failing at address: 0x200627bd8
>>>> [sbtn155:26386] [ 0] /lib64/libpthread.so.0 [0x3fa4e0e540]
>>>> [sbtn155:26386] [ 1] /tmp/openmpi/lib/openmpi/mca_pml_ob1.so [0x2aaaad8d7908]
>>>> [sbtn155:26386] [ 2] /tmp/openmpi/lib/openmpi/mca_btl_tcp.so [0x2aaaae2fc6e3]
>>>> [sbtn155:26386] [ 3] /tmp/openmpi/lib/libopen-pal.so.0 [0x2aaaaafe39db]
>>>> [sbtn155:26386] [ 4]
>>>> /tmp/openmpi/lib/libopen-pal.so.0(opal_progress+0x9e) [0x2aaaaafd8b9e]
>>>> [sbtn155:26386] [ 5] /tmp/openmpi/lib/openmpi/mca_pml_ob1.so [0x2aaaad8d4b25]
>>>> [sbtn155:26386] [ 6] /tmp/openmpi/lib/libmpi.so.0(MPI_Recv+0x13b)
>>>> [0x2aaaaab30f9b]
>>>> [sbtn155:26386] [ 7] hetero.x86_64(main+0xde) [0x400cbe]
>>>> [sbtn155:26386] [ 8] /lib64/libc.so.6(__libc_start_main+0xf4) [0x3fa421e074]
>>>> [sbtn155:26386] [ 9] hetero.x86_64 [0x400b29]
>>>> [sbtn155:26386] *** End of error message ***
>>>> --------------------------------------------------------------------------
>>>> mpirun noticed that process rank 0 with PID 26386 on node sbtn155
>>>> exited on signal 11 (Segmentation fault).
>>>> --------------------------------------------------------------------------
>>>> 
>>>> Am I missing an option in order to run in heterogenous cluster ?
>>>> MPI_Send/Recv have limit array size when using heterogeneous cluster ?
>>>> Thanks for your help. Regards
>>>> 
>>>> --
>>>> ============================================
>>>>    M. TRINH Minh Hieu
>>>>    CEA, IBEB, SBTN/LIRM,
>>>>    F-30207 Bagnols-sur-Cèze, FRANCE
>>>> ============================================
>>>> 
>>>> <hetero.c.bz2>_______________________________________________
>>>> users mailing list
>>>> users_at_[hidden]
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>> 
>>> 
>>> --
>>> Jeff Squyres
>>> jsquyres_at_[hidden]
>>> For corporate legal information go to:
>>> http://www.cisco.com/web/about/doing_business/legal/cri/
>>> 
>>> 
>>> _______________________________________________
>>> users mailing list
>>> users_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>> 
> 
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users