Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |  

This web mail archive is frozen.

This page is part of a frozen web archive of this mailing list.

You can still navigate around this archive, but know that no new mails have been added to it since July of 2016.

Click here to be taken to the new web archives of this list; it includes all the mails that are in this frozen archive plus all new mails that have been sent to the list since it was migrated to the new archives.

Subject: Re: [OMPI users] Buffer size limit and memory consumption problem on heterogeneous (32 bit / 64 bit) machines
From: Olivier Riff (oliriff_at_[hidden])
Date: 2010-05-20 07:26:03


Hello Terry,

Thanks for your answer.

2010/5/20 Terry Dontje <terry.dontje_at_[hidden]>

> Olivier Riff wrote:
>
> Hello,
>
> I assume this question has been already discussed many times, but I can not
> find on Internet a solution to my problem.
> It is about buffer size limit of MPI_Send and MPI_Recv with heterogeneous
> system (32 bit laptop / 64 bit cluster).
> My configuration is :
> open mpi 1.4, configured with: --without-openib --enable-heterogeneous
> --enable-mpi-threads
> Program is launched a laptop (32 bit Mandriva 2008) which distributes tasks
> to do to a cluster of 70 processors (64 bit RedHat Entreprise
> distribution):
> I have to send various buffer size from few bytes till 30Mo.
>
> You really want to get your program running without the tcp_eager_limit
> set if you want a better usage of memory. I believe the crash has something
> to do with the rendezvous protocol in OMPI. Have you narrowed this failure
> down to a simple MPI program? Also I noticed that you're configuring with
> --enable-mpi-threads, have you tried configuring without that option?
>
>
-> No, unfortunately I did not narrowed this behaviour to a simple MPI
program. I think I will have to do it if I do not find a solution in the
next days.
I will also make the test without the --enable-mpi-threads configuration.

> I tested following commands:
> 1) mpirun -v -machinefile machinefile.txt MyMPIProgram
> -> crash on client side ( 64 bit RedHat Entreprise ) when sent buffer size
> > 65536.
> 2) mpirun --mca btl_tcp_eager_limit 30000000 -v -machinefile
> machinefile.txt MyMPIProgram
> -> works but has the effect of generating gigantic memory consumption on 32
> bit machine side after MPI_Recv. Memory consumption goes from 800Mo to 2,1Go
> after receiving about 20ko from each 70 clients ( a total of about 1.4 Mo
> ). This makes my program crash later because I have no more memory to
> allocate new structures. I read in a openmpi forum thread that setting
> btl_tcp_eager_limit to a huge value explains this huge memory consumption
> when a message sent does not have a preposted ready recv. Also after all
> messages have been received and there is no more traffic activity : the
> memory consumed remains at 2.1go... and I do not understand why.
>
> Are the 70 clients all on different nodes? I am curious if the 2.1GB is
> due to the SM BTL or possibly a leak in the TCP BTL.
>

No, 70 clients are only on 9 nodes. In fact it is 72 clients: they are nine
8-processor machines.
The 2.1Gb memory consumption appears when I sequentially try to read the
result on each 72 clients (for loop from 1 to 72 calling MPI_Recv). I assume
that many clients have already sent the result whereas the server has not
called the MPI_Rec for the corresponding rank yet.

>
> What is the best way to do in order to have a working program which also
> has a small memory consumption (the speed performance can be lower) ?
> I tried to play with mca paramters btl_tcp_sndbuf and mca btl_tcp_rcvbuf,
> but without success.
>
> Thanks in advance for you help.
>
> Best regards,
>
> Olivier
>
> ------------------------------
>
> _______________________________________________
> users mailing listusers_at_[hidden]http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
>
> --
> [image: Oracle]
> Terry D. Dontje | Principal Software Engineer
> Developer Tools Engineering | +1.650.633.7054
> Oracle * - Performance Technologies*
> 95 Network Drive, Burlington, MA 01803
> Email terry.dontje_at_[hidden]
>
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>