Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] Segmentation fault when Send/Recv onheterogeneouscluster (32/64 bit machines)
From: Jeff Squyres (jsquyres) (jsquyres_at_[hidden])
Date: 2010-03-07 06:34:21


Ibm and sun (oracle) have probably done the most heterogeneous testing, but its probably not as stable as our homogeneous code paths.

Terry/brad - do you have any insight here?

Yes, setting eager limit high can impact performance. Its the amount of data that ompi will send eagerly without waiting for an ack from the receiver. There are several secondary performance effects that can occur if you are using sockets for transport and/or your program is only loosely synchronized. If your prog is tightly synchronous, it may not have too huge of an overall perf impact.

-jms
Sent from my PDA. No type good.

----- Original Message -----
From: users-bounces_at_[hidden] <users-bounces_at_[hidden]>
To: Open MPI Users <users_at_[hidden]>
Sent: Thu Mar 04 09:02:19 2010
Subject: Re: [OMPI users] Segmentation fault when Send/Recv onheterogeneouscluster (32/64 bit machines)

Hi,

I have some new discovery about this problem :

It seems that the array size sendable from a 32bit to 64bit machines
is proportional to the parameter "btl_tcp_eager_limit"
When I set it to 200 000 000 (2e08 bytes, about 190MB), I can send an
array up to 2e07 double (152MB).

I didn't found much informations about btl_tcp_eager_limit other than
in the "ompi_info --all" command. If I let it at 2e08, will it impacts
the performance of OpenMPI ?

It may be noteworth also that if the master (rank 0) is a 32bit
machines, I don't have segfault. I can send big array with small
"btl_tcp_eager_limit" from a 64bit machine to a 32bit one.

Do I have to move this thread to devel mailing list ?

Regards,

   TMHieu

On Tue, Mar 2, 2010 at 2:54 PM, TRINH Minh Hieu <mhtrinh_at_[hidden]> wrote:
> Hello,
>
> Yes, I compiled OpenMPI with --enable-heterogeneous. More precisely I
> compiled with :
> $ ./configure --prefix=/tmp/openmpi --enable-heterogeneous
> --enable-cxx-exceptions --enable-shared
> --enable-orterun-prefix-by-default
> $ make all install
>
> I attach the output of ompi_info of my 2 machines.
>
>    TMHieu
>
> On Tue, Mar 2, 2010 at 1:57 PM, Jeff Squyres <jsquyres_at_[hidden]> wrote:
>> Did you configure Open MPI with --enable-heterogeneous?
>>
>> On Feb 28, 2010, at 1:22 PM, TRINH Minh Hieu wrote:
>>
>>> Hello,
>>>
>>> I have some problems running MPI on my heterogeneous cluster. More
>>> precisley i got segmentation fault when sending a large array (about
>>> 10000) of double from a i686 machine to a x86_64 machine. It does not
>>> happen with small array. Here is the send/recv code source (complete
>>> source is in attached file) :
>>> ========code ================
>>>     if (me == 0 ) {
>>>         for (int pe=1; pe<nprocs; pe++)
>>>         {
>>>                 printf("Receiving from proc %d : ",pe); fflush(stdout);
>>>             d=(double *)malloc(sizeof(double)*n);
>>>             MPI_Recv(d,n,MPI_DOUBLE,pe,999,MPI_COMM_WORLD,&status);
>>>             printf("OK\n"); fflush(stdout);
>>>         }
>>>         printf("All done.\n");
>>>     }
>>>     else {
>>>       d=(double *)malloc(sizeof(double)*n);
>>>       MPI_Send(d,n,MPI_DOUBLE,0,999,MPI_COMM_WORLD);
>>>     }
>>> ======== code ================
>>>
>>> I got segmentation fault with n=10000 but no error with n=1000
>>> I have 2 machines :
>>> sbtn155 : Intel Xeon,         x86_64
>>> sbtn211 : Intel Pentium 4, i686
>>>
>>> The code is compiled in x86_64 and i686 machine, using OpenMPI 1.4.1,
>>> installed in /tmp/openmpi :
>>> [mhtrinh_at_sbtn211 heterogenous]$ make hetero
>>> gcc -Wall -I. -std=c99 -O3 -I/tmp/openmpi/include -c hetero.c -o hetero.i686.o
>>> /tmp/openmpi/bin/mpicc -Wall -I. -std=c99 -O3 -I/tmp/openmpi/include
>>> hetero.i686.o -o hetero.i686 -lm
>>>
>>> [mhtrinh_at_sbtn155 heterogenous]$ make hetero
>>> gcc -Wall -I. -std=c99 -O3 -I/tmp/openmpi/include -c hetero.c -o hetero.x86_64.o
>>> /tmp/openmpi/bin/mpicc -Wall -I. -std=c99 -O3 -I/tmp/openmpi/include
>>> hetero.x86_64.o -o hetero.x86_64 -lm
>>>
>>> I run with the code using appfile and got thoses error :
>>> $ cat appfile
>>> --host sbtn155 -np 1 hetero.x86_64
>>> --host sbtn155 -np 1 hetero.x86_64
>>> --host sbtn211 -np 1 hetero.i686
>>>
>>> $ mpirun -hetero --app appfile
>>> Input array length :
>>> 10000
>>> Receiving from proc 1 : OK
>>> Receiving from proc 2 : [sbtn155:26386] *** Process received signal ***
>>> [sbtn155:26386] Signal: Segmentation fault (11)
>>> [sbtn155:26386] Signal code: Address not mapped (1)
>>> [sbtn155:26386] Failing at address: 0x200627bd8
>>> [sbtn155:26386] [ 0] /lib64/libpthread.so.0 [0x3fa4e0e540]
>>> [sbtn155:26386] [ 1] /tmp/openmpi/lib/openmpi/mca_pml_ob1.so [0x2aaaad8d7908]
>>> [sbtn155:26386] [ 2] /tmp/openmpi/lib/openmpi/mca_btl_tcp.so [0x2aaaae2fc6e3]
>>> [sbtn155:26386] [ 3] /tmp/openmpi/lib/libopen-pal.so.0 [0x2aaaaafe39db]
>>> [sbtn155:26386] [ 4]
>>> /tmp/openmpi/lib/libopen-pal.so.0(opal_progress+0x9e) [0x2aaaaafd8b9e]
>>> [sbtn155:26386] [ 5] /tmp/openmpi/lib/openmpi/mca_pml_ob1.so [0x2aaaad8d4b25]
>>> [sbtn155:26386] [ 6] /tmp/openmpi/lib/libmpi.so.0(MPI_Recv+0x13b)
>>> [0x2aaaaab30f9b]
>>> [sbtn155:26386] [ 7] hetero.x86_64(main+0xde) [0x400cbe]
>>> [sbtn155:26386] [ 8] /lib64/libc.so.6(__libc_start_main+0xf4) [0x3fa421e074]
>>> [sbtn155:26386] [ 9] hetero.x86_64 [0x400b29]
>>> [sbtn155:26386] *** End of error message ***
>>> --------------------------------------------------------------------------
>>> mpirun noticed that process rank 0 with PID 26386 on node sbtn155
>>> exited on signal 11 (Segmentation fault).
>>> --------------------------------------------------------------------------
>>>
>>> Am I missing an option in order to run in heterogenous cluster ?
>>> MPI_Send/Recv have limit array size when using heterogeneous cluster ?
>>> Thanks for your help. Regards
>>>
>>> --
>>> ============================================
>>>    M. TRINH Minh Hieu
>>>    CEA, IBEB, SBTN/LIRM,
>>>    F-30207 Bagnols-sur-Cèze, FRANCE
>>> ============================================
>>>
>>> <hetero.c.bz2>_______________________________________________
>>> users mailing list
>>> users_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>>
>> --
>> Jeff Squyres
>> jsquyres_at_[hidden]
>> For corporate legal information go to:
>> http://www.cisco.com/web/about/doing_business/legal/cri/
>>
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>

_______________________________________________
users mailing list
users_at_[hidden]
http://www.open-mpi.org/mailman/listinfo.cgi/users