Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] Segmentation fault when Send/Recv, onheterogeneouscluster (32/64 bit machines)
From: Terry Dontje (terry.dontje_at_[hidden])
Date: 2010-03-08 07:27:02


We (Oracle) have not done that much extensive limits testing going
between 32 to 64bit applications. Most of the testing we've done is
more around endianess (SPARC vs x86_64).

Though the below is kind of interesting. Sounds like the eager limit
isn't being normalized on the 64 bit machines. Though a 32 bit rank 0
solving the problem also is very interesting, I wonder if that is not
more due to which rank is send and receiving?

--td

>
> Message: 3
> Date: Sun, 7 Mar 2010 05:34:21 -0600
> From: "Jeff Squyres (jsquyres)" <jsquyres_at_[hidden]>
> Subject: Re: [OMPI users] Segmentation fault when Send/Recv
> onheterogeneouscluster (32/64 bit machines)
> To: <users_at_[hidden]>
> Message-ID:
> <58D723FE08DC6A4398E6596E38F3FA1705670F_at_[hidden]>
> Content-Type: text/plain; charset="utf-8"
>
> Ibm and sun (oracle) have probably done the most heterogeneous testing, but its probably not as stable as our homogeneous code paths.
>
> Terry/brad - do you have any insight here?
>
> Yes, setting eager limit high can impact performance. Its the amount of data that ompi will send eagerly without waiting for an ack from the receiver. There are several secondary performance effects that can occur if you are using sockets for transport and/or your program is only loosely synchronized. If your prog is tightly synchronous, it may not have too huge of an overall perf impact.
>
> -jms
> Sent from my PDA. No type good.
>
> ----- Original Message -----
> From: users-bounces_at_[hidden] <users-bounces_at_[hidden]>
> To: Open MPI Users <users_at_[hidden]>
> Sent: Thu Mar 04 09:02:19 2010
> Subject: Re: [OMPI users] Segmentation fault when Send/Recv onheterogeneouscluster (32/64 bit machines)
>
> Hi,
>
> I have some new discovery about this problem :
>
> It seems that the array size sendable from a 32bit to 64bit machines
> is proportional to the parameter "btl_tcp_eager_limit"
> When I set it to 200 000 000 (2e08 bytes, about 190MB), I can send an
> array up to 2e07 double (152MB).
>
> I didn't found much informations about btl_tcp_eager_limit other than
> in the "ompi_info --all" command. If I let it at 2e08, will it impacts
> the performance of OpenMPI ?
>
> It may be noteworth also that if the master (rank 0) is a 32bit
> machines, I don't have segfault. I can send big array with small
> "btl_tcp_eager_limit" from a 64bit machine to a 32bit one.
>
> Do I have to move this thread to devel mailing list ?
>
> Regards,
>
> TMHieu
>
> On Tue, Mar 2, 2010 at 2:54 PM, TRINH Minh Hieu <mhtrinh_at_[hidden]> wrote:
>
>> Hello,
>>
>> Yes, I compiled OpenMPI with --enable-heterogeneous. More precisely I
>> compiled with :
>> $ ./configure --prefix=/tmp/openmpi --enable-heterogeneous
>> --enable-cxx-exceptions --enable-shared
>> --enable-orterun-prefix-by-default
>> $ make all install
>>
>> I attach the output of ompi_info of my 2 machines.
>>
>> ? ?TMHieu
>>
>> On Tue, Mar 2, 2010 at 1:57 PM, Jeff Squyres <jsquyres_at_[hidden]> wrote:
>>
>>> Did you configure Open MPI with --enable-heterogeneous?
>>>
>>> On Feb 28, 2010, at 1:22 PM, TRINH Minh Hieu wrote:
>>>
>>>
>>>> Hello,
>>>>
>>>> I have some problems running MPI on my heterogeneous cluster. More
>>>> precisley i got segmentation fault when sending a large array (about
>>>> 10000) of double from a i686 machine to a x86_64 machine. It does not
>>>> happen with small array. Here is the send/recv code source (complete
>>>> source is in attached file) :
>>>> ========code ================
>>>> ? ? if (me == 0 ) {
>>>> ? ? ? ? for (int pe=1; pe<nprocs; pe++)
>>>> ? ? ? ? {
>>>> ? ? ? ? ? ? ? ? printf("Receiving from proc %d : ",pe); fflush(stdout);
>>>> ? ? ? ? ? ? d=(double *)malloc(sizeof(double)*n);
>>>> ? ? ? ? ? ? MPI_Recv(d,n,MPI_DOUBLE,pe,999,MPI_COMM_WORLD,&status);
>>>> ? ? ? ? ? ? printf("OK\n"); fflush(stdout);
>>>> ? ? ? ? }
>>>> ? ? ? ? printf("All done.\n");
>>>> ? ? }
>>>> ? ? else {
>>>> ? ? ? d=(double *)malloc(sizeof(double)*n);
>>>> ? ? ? MPI_Send(d,n,MPI_DOUBLE,0,999,MPI_COMM_WORLD);
>>>> ? ? }
>>>> ======== code ================
>>>>
>>>> I got segmentation fault with n=10000 but no error with n=1000
>>>> I have 2 machines :
>>>> sbtn155 : Intel Xeon, ? ? ? ? x86_64
>>>> sbtn211 : Intel Pentium 4, i686
>>>>
>>>> The code is compiled in x86_64 and i686 machine, using OpenMPI 1.4.1,
>>>> installed in /tmp/openmpi :
>>>> [mhtrinh_at_sbtn211 heterogenous]$ make hetero
>>>> gcc -Wall -I. -std=c99 -O3 -I/tmp/openmpi/include -c hetero.c -o hetero.i686.o
>>>> /tmp/openmpi/bin/mpicc -Wall -I. -std=c99 -O3 -I/tmp/openmpi/include
>>>> hetero.i686.o -o hetero.i686 -lm
>>>>
>>>> [mhtrinh_at_sbtn155 heterogenous]$ make hetero
>>>> gcc -Wall -I. -std=c99 -O3 -I/tmp/openmpi/include -c hetero.c -o hetero.x86_64.o
>>>> /tmp/openmpi/bin/mpicc -Wall -I. -std=c99 -O3 -I/tmp/openmpi/include
>>>> hetero.x86_64.o -o hetero.x86_64 -lm
>>>>
>>>> I run with the code using appfile and got thoses error :
>>>> $ cat appfile
>>>> --host sbtn155 -np 1 hetero.x86_64
>>>> --host sbtn155 -np 1 hetero.x86_64
>>>> --host sbtn211 -np 1 hetero.i686
>>>>
>>>> $ mpirun -hetero --app appfile
>>>> Input array length :
>>>> 10000
>>>> Receiving from proc 1 : OK
>>>> Receiving from proc 2 : [sbtn155:26386] *** Process received signal ***
>>>> [sbtn155:26386] Signal: Segmentation fault (11)
>>>> [sbtn155:26386] Signal code: Address not mapped (1)
>>>> [sbtn155:26386] Failing at address: 0x200627bd8
>>>> [sbtn155:26386] [ 0] /lib64/libpthread.so.0 [0x3fa4e0e540]
>>>> [sbtn155:26386] [ 1] /tmp/openmpi/lib/openmpi/mca_pml_ob1.so [0x2aaaad8d7908]
>>>> [sbtn155:26386] [ 2] /tmp/openmpi/lib/openmpi/mca_btl_tcp.so [0x2aaaae2fc6e3]
>>>> [sbtn155:26386] [ 3] /tmp/openmpi/lib/libopen-pal.so.0 [0x2aaaaafe39db]
>>>> [sbtn155:26386] [ 4]
>>>> /tmp/openmpi/lib/libopen-pal.so.0(opal_progress+0x9e) [0x2aaaaafd8b9e]
>>>> [sbtn155:26386] [ 5] /tmp/openmpi/lib/openmpi/mca_pml_ob1.so [0x2aaaad8d4b25]
>>>> [sbtn155:26386] [ 6] /tmp/openmpi/lib/libmpi.so.0(MPI_Recv+0x13b)
>>>> [0x2aaaaab30f9b]
>>>> [sbtn155:26386] [ 7] hetero.x86_64(main+0xde) [0x400cbe]
>>>> [sbtn155:26386] [ 8] /lib64/libc.so.6(__libc_start_main+0xf4) [0x3fa421e074]
>>>> [sbtn155:26386] [ 9] hetero.x86_64 [0x400b29]
>>>> [sbtn155:26386] *** End of error message ***
>>>> --------------------------------------------------------------------------
>>>> mpirun noticed that process rank 0 with PID 26386 on node sbtn155
>>>> exited on signal 11 (Segmentation fault).
>>>> --------------------------------------------------------------------------
>>>>
>>>> Am I missing an option in order to run in heterogenous cluster ?
>>>> MPI_Send/Recv have limit array size when using heterogeneous cluster ?
>>>> Thanks for your help. Regards
>>>>
>>>> --
>>>> ============================================
>>>> ? ?M. TRINH Minh Hieu
>>>> ? ?CEA, IBEB, SBTN/LIRM,
>>>> ? ?F-30207 Bagnols-sur-C?ze, FRANCE
>>>> ============================================
>>>>
>>>> <hetero.c.bz2>_______________________________________________
>>>> users mailing list
>>>> users_at_[hidden]
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>
>>> --
>>> Jeff Squyres
>>> jsquyres_at_[hidden]
>>> For corporate legal information go to:
>>> http://www.cisco.com/web/about/doing_business/legal/cri/
>>>
>>>
>>> _______________________________________________
>>> users mailing list
>>> users_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users
> -------------- next part --------------
> HTML attachment scrubbed and removed
>
>
> **************************************
>