Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] Asynchronous behaviour of MPI Collectives
From: Jeff Squyres (jsquyres_at_[hidden])
Date: 2009-01-26 13:55:11


Actually, I found out that the help message I pasted lies a little:
the "number of buffers" parameter for both PP and SRQ types is
mandatory, not optional.

On Jan 23, 2009, at 2:59 PM, Jeff Squyres wrote:
> Here's a copy-n-paste of our help file describing the format of each:
>
> Per-peer receive queues require between 1 and 5 parameters:
>
> 1. Buffer size in bytes (mandatory)
> 2. Number of buffers (optional; defaults to 8)
> 3. Low buffer count watermark (optional; defaults to (num_buffers /
> 2))
> 4. Credit window size (optional; defaults to (low_watermark / 2))
> 5. Number of buffers reserved for credit messages (optional;
> defaults to (num_buffers*2-1)/credit_window)
>
> Example: P,128,256,128,16
> - 128 byte buffers
> - 256 buffers to receive incoming MPI messages
> - When the number of available buffers reaches 128, re-post 128 more
> buffers to reach a total of 256
> - If the number of available credits reaches 16, send an explicit
> credit message to the sender
> - Defaulting to ((256 * 2) - 1) / 16 = 31; this many buffers are
> reserved for explicit credit messages
>
> Shared receive queues can take between 1 and 4 parameters:
>
> 1. Buffer size in bytes (mandatory)
> 2. Number of buffers (optional; defaults to 16)
> 3. Low buffer count watermark (optional; defaults to (num_buffers /
> 2))
> 4. Maximum number of outstanding sends a sender can have (optional;
> defaults to (low_watermark / 4)
>
> Example: S,1024,256,128,32
> - 1024 byte buffers
> - 256 buffers to receive incoming MPI messages
> - When the number of available buffers reaches 128, re-post 128 more
> buffers to reach a total of 256
> - A sender will not send to a peer unless it has less than 32
> outstanding sends to that peer.
>
> IIRC, "X" takes the same parameters as "S"...? Note that if you you
> *any* XRC queues, then *all* of your queues must be XRC.
>
> OMPI defaults to a btl_receive_queues value that may be specific to
> your hardware. For example, connectx defaults to the following value:
>
> shell$ ompi_info --param btl openib --parsable | grep receive_queues
> mca:btl:openib:param:btl_openib_receive_queues:value:P,
> 128,256,192,128:S,2048,256,128,32:S,12288,256,128,32:S,
> 65536,256,128,32
> mca:btl:openib:param:btl_openib_receive_queues:data_source:default
> value
> mca:btl:openib:param:btl_openib_receive_queues:status:writable
> mca:btl:openib:param:btl_openib_receive_queues:help:Colon-delimited,
> comma delimited list of receive queues: P,4096,8,6,4:P,32768,8,6,4
> mca:btl:openib:param:btl_openib_receive_queues:deprecated:no
>
> Hope that helps!
>
>
>
>
> On Jan 23, 2009, at 9:27 AM, Igor Kozin wrote:
>
>> Hi Gabriele,
>> it might be that your message size is too large for available
>> memory per node.
>> I had a problem with IMB when I was not able to run to completion
>> Alltoall on N=128, ppn=8 on our cluster with 16 GB per node. You'd
>> think 16 GB is quite a lot but when you do the maths:
>> 2* 4 MB * 128 procs * 8 procs/node = 8 GB/node plus you need to
>> double because of buffering. I was told by Mellanox (our cards are
>> ConnectX cards) that they introduced XRC in OFED 1.3 in addition to
>> Share Receive Queue which should reduce memory foot print but I
>> have not tested this yet.
>> HTH,
>> Igor
>> 2009/1/23 Gabriele Fatigati <g.fatigati_at_[hidden]>
>> Hi Igor,
>> My message size is 4096kb and i have 4 procs per core.
>> There isn't any difference using different algorithms..
>>
>> 2009/1/23 Igor Kozin <i.n.kozin_at_[hidden]>:
>> > what is your message size and the number of cores per node?
>> > is there any difference using different algorithms?
>> >
>> > 2009/1/23 Gabriele Fatigati <g.fatigati_at_[hidden]>
>> >>
>> >> Hi Jeff,
>> >> i would like to understand why, if i run over 512 procs or more,
>> my
>> >> code stops over mpi collective, also with little send buffer. All
>> >> processors are locked into call, doing nothing. But, if i add
>> >> MPI_Barrier after MPI collective, it works! I run over Infiniband
>> >> net.
>> >>
>> >> I know many people with this strange problem, i think there is a
>> >> strange interaction between Infiniband and OpenMPI that causes it.
>> >>
>> >>
>> >>
>> >> 2009/1/23 Jeff Squyres <jsquyres_at_[hidden]>:
>> >> > On Jan 23, 2009, at 6:32 AM, Gabriele Fatigati wrote:
>> >> >
>> >> >> I've noted that OpenMPI has an asynchronous behaviour in the
>> collective
>> >> >> calls.
>> >> >> The processors, doesn't wait that other procs arrives in the
>> call.
>> >> >
>> >> > That is correct.
>> >> >
>> >> >> This behaviour sometimes can cause some problems with a lot of
>> >> >> processors in the jobs.
>> >> >
>> >> > Can you describe what exactly you mean? The MPI spec
>> specifically
>> >> > allows
>> >> > this behavior; OMPI made specific design choices and
>> optimizations to
>> >> > support this behavior. FWIW, I'd be pretty surprised if any
>> optimized
>> >> > MPI
>> >> > implementation defaults to fully synchronous collective
>> operations.
>> >> >
>> >> >> Is there an OpenMPI parameter to lock all process in the
>> collective
>> >> >> call until is finished? Otherwise i have to insert many
>> MPI_Barrier
>> >> >> in my code and it is very tedious and strange..
>> >> >
>> >> > As you have notes, MPI_Barrier is the *only* collective
>> operation that
>> >> > MPI
>> >> > guarantees to have any synchronization properties (and it's a
>> fairly
>> >> > weak
>> >> > guarantee at that; no process will exit the barrier until
>> every process
>> >> > has
>> >> > entered the barrier -- but there's no guarantee that all
>> processes leave
>> >> > the
>> >> > barrier at the same time).
>> >> >
>> >> > Why do you need your processes to exit collective operations
>> at the same
>> >> > time?
>> >> >
>> >> > --
>> >> > Jeff Squyres
>> >> > Cisco Systems
>> >> >
>> >> > _______________________________________________
>> >> > users mailing list
>> >> > users_at_[hidden]
>> >> > http://www.open-mpi.org/mailman/listinfo.cgi/users
>> >> >
>> >> >
>> >>
>> >>
>> >>
>> >> --
>> >> Ing. Gabriele Fatigati
>> >>
>> >> Parallel programmer
>> >>
>> >> CINECA Systems & Tecnologies Department
>> >>
>> >> Supercomputing Group
>> >>
>> >> Via Magnanelli 6/3, Casalecchio di Reno (BO) Italy
>> >>
>> >> www.cineca.it Tel: +39 051 6171722
>> >>
>> >> g.fatigati [AT] cineca.it
>> >> _______________________________________________
>> >> users mailing list
>> >> users_at_[hidden]
>> >> http://www.open-mpi.org/mailman/listinfo.cgi/users
>> >
>> >
>> > _______________________________________________
>> > users mailing list
>> > users_at_[hidden]
>> > http://www.open-mpi.org/mailman/listinfo.cgi/users
>> >
>>
>>
>>
>> --
>> Ing. Gabriele Fatigati
>>
>> Parallel programmer
>>
>> CINECA Systems & Tecnologies Department
>>
>> Supercomputing Group
>>
>> Via Magnanelli 6/3, Casalecchio di Reno (BO) Italy
>>
>> www.cineca.it Tel: +39 051 6171722
>>
>> g.fatigati [AT] cineca.it
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
> --
> Jeff Squyres
> Cisco Systems
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users

-- 
Jeff Squyres
Cisco Systems