Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] Asynchronous behaviour of MPI Collectives
From: Gabriele Fatigati (g.fatigati_at_[hidden])
Date: 2009-01-27 05:10:14


Wow! Great and useful explanation.
Thanks Jeff .

2009/1/23 Jeff Squyres <jsquyres_at_[hidden]>:
> FWIW, OMPI v1.3 is much better that registered memory usage than the 1.2
> series. We introduced some new things, to include being able to specify
> exactly what receive queues you want. See:
>
> ...gaaah! It's not on our FAQ yet. :-(
>
> The main idea is that there is a new MCA parameter for the openib BTL:
> btl_openib_receive_queues. It takes a colon-delimited string listing one or
> more receive queues of specific sizes and characteristics. For now, all
> processes in the job *must* use the same string. You can specify three
> kinds of receive queues:
>
> - P: per-peer queues
> - S: shared receive queues
> - X: XRC queues (with OFED 1.4 and later with specific Mellanox hardware)
>
> Here's a copy-n-paste of our help file describing the format of each:
>
> Per-peer receive queues require between 1 and 5 parameters:
>
> 1. Buffer size in bytes (mandatory)
> 2. Number of buffers (optional; defaults to 8)
> 3. Low buffer count watermark (optional; defaults to (num_buffers / 2))
> 4. Credit window size (optional; defaults to (low_watermark / 2))
> 5. Number of buffers reserved for credit messages (optional;
> defaults to (num_buffers*2-1)/credit_window)
>
> Example: P,128,256,128,16
> - 128 byte buffers
> - 256 buffers to receive incoming MPI messages
> - When the number of available buffers reaches 128, re-post 128 more
> buffers to reach a total of 256
> - If the number of available credits reaches 16, send an explicit
> credit message to the sender
> - Defaulting to ((256 * 2) - 1) / 16 = 31; this many buffers are
> reserved for explicit credit messages
>
> Shared receive queues can take between 1 and 4 parameters:
>
> 1. Buffer size in bytes (mandatory)
> 2. Number of buffers (optional; defaults to 16)
> 3. Low buffer count watermark (optional; defaults to (num_buffers / 2))
> 4. Maximum number of outstanding sends a sender can have (optional;
> defaults to (low_watermark / 4)
>
> Example: S,1024,256,128,32
> - 1024 byte buffers
> - 256 buffers to receive incoming MPI messages
> - When the number of available buffers reaches 128, re-post 128 more
> buffers to reach a total of 256
> - A sender will not send to a peer unless it has less than 32
> outstanding sends to that peer.
>
> IIRC, "X" takes the same parameters as "S"...? Note that if you you *any*
> XRC queues, then *all* of your queues must be XRC.
>
> OMPI defaults to a btl_receive_queues value that may be specific to your
> hardware. For example, connectx defaults to the following value:
>
> shell$ ompi_info --param btl openib --parsable | grep receive_queues
> mca:btl:openib:param:btl_openib_receive_queues:value:P,128,256,192,128:S,2048,256,128,32:S,12288,256,128,32:S,65536,256,128,32
> mca:btl:openib:param:btl_openib_receive_queues:data_source:default value
> mca:btl:openib:param:btl_openib_receive_queues:status:writable
> mca:btl:openib:param:btl_openib_receive_queues:help:Colon-delimited, comma
> delimited list of receive queues: P,4096,8,6,4:P,32768,8,6,4
> mca:btl:openib:param:btl_openib_receive_queues:deprecated:no
>
> Hope that helps!
>
>
>
>
> On Jan 23, 2009, at 9:27 AM, Igor Kozin wrote:
>
>> Hi Gabriele,
>> it might be that your message size is too large for available memory per
>> node.
>> I had a problem with IMB when I was not able to run to completion Alltoall
>> on N=128, ppn=8 on our cluster with 16 GB per node. You'd think 16 GB is
>> quite a lot but when you do the maths:
>> 2* 4 MB * 128 procs * 8 procs/node = 8 GB/node plus you need to double
>> because of buffering. I was told by Mellanox (our cards are ConnectX cards)
>> that they introduced XRC in OFED 1.3 in addition to Share Receive Queue
>> which should reduce memory foot print but I have not tested this yet.
>> HTH,
>> Igor
>> 2009/1/23 Gabriele Fatigati <g.fatigati_at_[hidden]>
>> Hi Igor,
>> My message size is 4096kb and i have 4 procs per core.
>> There isn't any difference using different algorithms..
>>
>> 2009/1/23 Igor Kozin <i.n.kozin_at_[hidden]>:
>> > what is your message size and the number of cores per node?
>> > is there any difference using different algorithms?
>> >
>> > 2009/1/23 Gabriele Fatigati <g.fatigati_at_[hidden]>
>> >>
>> >> Hi Jeff,
>> >> i would like to understand why, if i run over 512 procs or more, my
>> >> code stops over mpi collective, also with little send buffer. All
>> >> processors are locked into call, doing nothing. But, if i add
>> >> MPI_Barrier after MPI collective, it works! I run over Infiniband
>> >> net.
>> >>
>> >> I know many people with this strange problem, i think there is a
>> >> strange interaction between Infiniband and OpenMPI that causes it.
>> >>
>> >>
>> >>
>> >> 2009/1/23 Jeff Squyres <jsquyres_at_[hidden]>:
>> >> > On Jan 23, 2009, at 6:32 AM, Gabriele Fatigati wrote:
>> >> >
>> >> >> I've noted that OpenMPI has an asynchronous behaviour in the
>> >> >> collective
>> >> >> calls.
>> >> >> The processors, doesn't wait that other procs arrives in the call.
>> >> >
>> >> > That is correct.
>> >> >
>> >> >> This behaviour sometimes can cause some problems with a lot of
>> >> >> processors in the jobs.
>> >> >
>> >> > Can you describe what exactly you mean? The MPI spec specifically
>> >> > allows
>> >> > this behavior; OMPI made specific design choices and optimizations to
>> >> > support this behavior. FWIW, I'd be pretty surprised if any
>> >> > optimized
>> >> > MPI
>> >> > implementation defaults to fully synchronous collective operations.
>> >> >
>> >> >> Is there an OpenMPI parameter to lock all process in the collective
>> >> >> call until is finished? Otherwise i have to insert many MPI_Barrier
>> >> >> in my code and it is very tedious and strange..
>> >> >
>> >> > As you have notes, MPI_Barrier is the *only* collective operation
>> >> > that
>> >> > MPI
>> >> > guarantees to have any synchronization properties (and it's a fairly
>> >> > weak
>> >> > guarantee at that; no process will exit the barrier until every
>> >> > process
>> >> > has
>> >> > entered the barrier -- but there's no guarantee that all processes
>> >> > leave
>> >> > the
>> >> > barrier at the same time).
>> >> >
>> >> > Why do you need your processes to exit collective operations at the
>> >> > same
>> >> > time?
>> >> >
>> >> > --
>> >> > Jeff Squyres
>> >> > Cisco Systems
>> >> >
>> >> > _______________________________________________
>> >> > users mailing list
>> >> > users_at_[hidden]
>> >> > http://www.open-mpi.org/mailman/listinfo.cgi/users
>> >> >
>> >> >
>> >>
>> >>
>> >>
>> >> --
>> >> Ing. Gabriele Fatigati
>> >>
>> >> Parallel programmer
>> >>
>> >> CINECA Systems & Tecnologies Department
>> >>
>> >> Supercomputing Group
>> >>
>> >> Via Magnanelli 6/3, Casalecchio di Reno (BO) Italy
>> >>
>> >> www.cineca.it Tel: +39 051 6171722
>> >>
>> >> g.fatigati [AT] cineca.it
>> >> _______________________________________________
>> >> users mailing list
>> >> users_at_[hidden]
>> >> http://www.open-mpi.org/mailman/listinfo.cgi/users
>> >
>> >
>> > _______________________________________________
>> > users mailing list
>> > users_at_[hidden]
>> > http://www.open-mpi.org/mailman/listinfo.cgi/users
>> >
>>
>>
>>
>> --
>> Ing. Gabriele Fatigati
>>
>> Parallel programmer
>>
>> CINECA Systems & Tecnologies Department
>>
>> Supercomputing Group
>>
>> Via Magnanelli 6/3, Casalecchio di Reno (BO) Italy
>>
>> www.cineca.it Tel: +39 051 6171722
>>
>> g.fatigati [AT] cineca.it
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
> --
> Jeff Squyres
> Cisco Systems
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>

-- 
Ing. Gabriele Fatigati
Parallel programmer
CINECA Systems & Tecnologies Department
Supercomputing Group
Via Magnanelli 6/3, Casalecchio di Reno (BO) Italy
www.cineca.it                    Tel:   +39 051 6171722
g.fatigati [AT] cineca.it