Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] Segmentation fault when Send/Recv on heterogeneouscluster (32/64 bit machines)
From: Timur Magomedov (timur.magomedov_at_[hidden])
Date: 2010-04-23 12:15:58


Hello,
It seems that this was really a bug. It was recently fixed in repository
https://svn.open-mpi.org/trac/ompi/changeset/23030
and will likely be fixed in next 1.4 release.

Here is corresponding thread in ompi-devel:
http://www.open-mpi.org/community/lists/devel/2010/04/7787.php

В Птн, 05/03/2010 в 10:51 +0100, TRINH Minh Hieu пишет:
> Hi,
>
> Thank you for those informations.
> For the moment, I didn't encountered those problems yet. Maybe
> because, my program don't use much memory (100MB) and the master
> machine have huge RAM (8GB).
> So meanwhile, the solution seems to be the parameter
> "btl_tcp_eager_limit" but a cleaner solution is very welcome :-)
>
> TMHieu
>
> 2010/3/5 Aurélien Bouteiller <bouteill_at_[hidden]>:
> > Hi,
> >
> > setting the eager limit to such a drastically high value will have
> the effect of generating gigantic memory consumption for unexpected
> messages. Any message you send which does not have a preposted ready
> recv will mallocate 150mb of temporary storage, and will be memcopied
> from that internal buffer to the recv buffer when the recv is posted.
> You should expect very poor bandwidth and probably crash/abort due to
> memory exhaustion on the receivers.
> >
> > Aurelien
> > --
> > Dr. Aurelien Bouteiller
> > Innovative Computing Laboratory
> > University of Tennessee
> > Knoxville, TN, USA
> >
> >
> > Le 4 mars 2010 à 09:02, TRINH Minh Hieu a écrit :
> >
> >> Hi,
> >>
> >> I have some new discovery about this problem :
> >>
> >> It seems that the array size sendable from a 32bit to 64bit
> machines
> >> is proportional to the parameter "btl_tcp_eager_limit"
> >> When I set it to 200 000 000 (2e08 bytes, about 190MB), I can send
> an
> >> array up to 2e07 double (152MB).
> >>
> >> I didn't found much informations about btl_tcp_eager_limit other
> than
> >> in the "ompi_info --all" command. If I let it at 2e08, will it
> impacts
> >> the performance of OpenMPI ?
> >>
> >> It may be noteworth also that if the master (rank 0) is a 32bit
> >> machines, I don't have segfault. I can send big array with small
> >> "btl_tcp_eager_limit" from a 64bit machine to a 32bit one.
> >>
> >> Do I have to move this thread to devel mailing list ?
> >>
> >> Regards,
> >>
> >> TMHieu
> >>
> >> On Tue, Mar 2, 2010 at 2:54 PM, TRINH Minh Hieu <mhtrinh_at_[hidden]>
> wrote:
> >>> Hello,
> >>>
> >>> Yes, I compiled OpenMPI with --enable-heterogeneous. More
> precisely I
> >>> compiled with :
> >>> $ ./configure --prefix=/tmp/openmpi --enable-heterogeneous
> >>> --enable-cxx-exceptions --enable-shared
> >>> --enable-orterun-prefix-by-default
> >>> $ make all install
> >>>
> >>> I attach the output of ompi_info of my 2 machines.
> >>>
> >>> TMHieu
> >>>
> >>> On Tue, Mar 2, 2010 at 1:57 PM, Jeff Squyres <jsquyres_at_[hidden]>
> wrote:
> >>>> Did you configure Open MPI with --enable-heterogeneous?
> >>>>
> >>>> On Feb 28, 2010, at 1:22 PM, TRINH Minh Hieu wrote:
> >>>>
> >>>>> Hello,
> >>>>>
> >>>>> I have some problems running MPI on my heterogeneous cluster.
> More
> >>>>> precisley i got segmentation fault when sending a large array
> (about
> >>>>> 10000) of double from a i686 machine to a x86_64 machine. It
> does not
> >>>>> happen with small array. Here is the send/recv code source
> (complete
> >>>>> source is in attached file) :
> >>>>> ========code ================
> >>>>> if (me == 0 ) {
> >>>>> for (int pe=1; pe<nprocs; pe++)
> >>>>> {
> >>>>> printf("Receiving from proc %d : ",pe);
> fflush(stdout);
> >>>>> d=(double *)malloc(sizeof(double)*n);
> >>>>>
> MPI_Recv(d,n,MPI_DOUBLE,pe,999,MPI_COMM_WORLD,&status);
> >>>>> printf("OK\n"); fflush(stdout);
> >>>>> }
> >>>>> printf("All done.\n");
> >>>>> }
> >>>>> else {
> >>>>> d=(double *)malloc(sizeof(double)*n);
> >>>>> MPI_Send(d,n,MPI_DOUBLE,0,999,MPI_COMM_WORLD);
> >>>>> }
> >>>>> ======== code ================
> >>>>>
> >>>>> I got segmentation fault with n=10000 but no error with n=1000
> >>>>> I have 2 machines :
> >>>>> sbtn155 : Intel Xeon, x86_64
> >>>>> sbtn211 : Intel Pentium 4, i686
> >>>>>
> >>>>> The code is compiled in x86_64 and i686 machine, using OpenMPI
> 1.4.1,
> >>>>> installed in /tmp/openmpi :
> >>>>> [mhtrinh_at_sbtn211 heterogenous]$ make hetero
> >>>>> gcc -Wall -I. -std=c99 -O3 -I/tmp/openmpi/include -c hetero.c -o
> hetero.i686.o
> >>>>> /tmp/openmpi/bin/mpicc -Wall -I. -std=c99 -O3
> -I/tmp/openmpi/include
> >>>>> hetero.i686.o -o hetero.i686 -lm
> >>>>>
> >>>>> [mhtrinh_at_sbtn155 heterogenous]$ make hetero
> >>>>> gcc -Wall -I. -std=c99 -O3 -I/tmp/openmpi/include -c hetero.c -o
> hetero.x86_64.o
> >>>>> /tmp/openmpi/bin/mpicc -Wall -I. -std=c99 -O3
> -I/tmp/openmpi/include
> >>>>> hetero.x86_64.o -o hetero.x86_64 -lm
> >>>>>
> >>>>> I run with the code using appfile and got thoses error :
> >>>>> $ cat appfile
> >>>>> --host sbtn155 -np 1 hetero.x86_64
> >>>>> --host sbtn155 -np 1 hetero.x86_64
> >>>>> --host sbtn211 -np 1 hetero.i686
> >>>>>
> >>>>> $ mpirun -hetero --app appfile
> >>>>> Input array length :
> >>>>> 10000
> >>>>> Receiving from proc 1 : OK
> >>>>> Receiving from proc 2 : [sbtn155:26386] *** Process received
> signal ***
> >>>>> [sbtn155:26386] Signal: Segmentation fault (11)
> >>>>> [sbtn155:26386] Signal code: Address not mapped (1)
> >>>>> [sbtn155:26386] Failing at address: 0x200627bd8
> >>>>> [sbtn155:26386] [ 0] /lib64/libpthread.so.0 [0x3fa4e0e540]
> >>>>> [sbtn155:26386] [ 1] /tmp/openmpi/lib/openmpi/mca_pml_ob1.so
> [0x2aaaad8d7908]
> >>>>> [sbtn155:26386] [ 2] /tmp/openmpi/lib/openmpi/mca_btl_tcp.so
> [0x2aaaae2fc6e3]
> >>>>> [sbtn155:26386] [ 3] /tmp/openmpi/lib/libopen-pal.so.0
> [0x2aaaaafe39db]
> >>>>> [sbtn155:26386] [ 4]
> >>>>> /tmp/openmpi/lib/libopen-pal.so.0(opal_progress+0x9e)
> [0x2aaaaafd8b9e]
> >>>>> [sbtn155:26386] [ 5] /tmp/openmpi/lib/openmpi/mca_pml_ob1.so
> [0x2aaaad8d4b25]
> >>>>> [sbtn155:26386] [ 6] /tmp/openmpi/lib/libmpi.so.0(MPI_Recv
> +0x13b)
> >>>>> [0x2aaaaab30f9b]
> >>>>> [sbtn155:26386] [ 7] hetero.x86_64(main+0xde) [0x400cbe]
> >>>>> [sbtn155:26386] [ 8] /lib64/libc.so.6(__libc_start_main+0xf4)
> [0x3fa421e074]
> >>>>> [sbtn155:26386] [ 9] hetero.x86_64 [0x400b29]
> >>>>> [sbtn155:26386] *** End of error message ***
> >>>>>
> --------------------------------------------------------------------------
> >>>>> mpirun noticed that process rank 0 with PID 26386 on node
> sbtn155
> >>>>> exited on signal 11 (Segmentation fault).
> >>>>>
> --------------------------------------------------------------------------
> >>>>>
> >>>>> Am I missing an option in order to run in heterogenous cluster ?
> >>>>> MPI_Send/Recv have limit array size when using heterogeneous
> cluster ?
> >>>>> Thanks for your help. Regards
> >>>>>
> >>>>> --
> >>>>> ============================================
> >>>>> M. TRINH Minh Hieu
> >>>>> CEA, IBEB, SBTN/LIRM,
> >>>>> F-30207 Bagnols-sur-Cèze, FRANCE
> >>>>> ============================================
> >>>>>
> >>>>> <hetero.c.bz2>_______________________________________________
> >>>>> users mailing list
> >>>>> users_at_[hidden]
> >>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> >>>>
> >>>>
> >>>> --
> >>>> Jeff Squyres
> >>>> jsquyres_at_[hidden]
> >>>> For corporate legal information go to:
> >>>> http://www.cisco.com/web/about/doing_business/legal/cri/
> >>>>
> >>>>
> >>>> _______________________________________________
> >>>> users mailing list
> >>>> users_at_[hidden]
> >>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> >>>
> >>
> >> _______________________________________________
> >> users mailing list
> >> users_at_[hidden]
> >> http://www.open-mpi.org/mailman/listinfo.cgi/users
> >
> >
> > _______________________________________________
> > users mailing list
> > users_at_[hidden]
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
> >
>
>
>
> --
> ============================================
> M. TRINH Minh Hieu
> CEA, IBEB, SBTN/LIRM,
> F-30207 Bagnols-sur-Cèze, FRANCE
> ============================================
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users

-- 
Kind regards,
Timur Magomedov
Senior C++ Developer
DevelopOnBox LLC / Zodiac Interactive
http://www.zodiac.tv/