Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: [OMPI devel] RFC: meaning of "btl_XXX_eager_limit"
From: Jeff Squyres (jsquyres_at_[hidden])
Date: 2009-07-23 09:59:22


WHAT: a) Clarify the actual max MPI payload size for eager messages
       (i.e., the exact meaning of btl_XXX_eager_limit), and b) allow
       network administrators to shape network traffic by publishing
       actual BTL max wire fragment sizes (i.e., MPI max payload size +
       max PML header size + max BTL header size).

WHY: Currently BTL eager_limit values actually have the PML header
      subtracted from them, meaning that the eager_limit is not
      actually the largest MPI message payload size. Terry and Jeff,
      at least, find this misleading. :-) Additionally, BTLs may add
      their own (variable-sized) headers beyond the eager_limit size,
      so it's not possible for a network administrator to shape network
      traffic because they don't (can't) know what a BTL's max wire
      fragment size.

WHERE: ompi/pml/{ob1,csum,dr}, and likely all BTLs

TIMEOUT: COB, Friday, 31 July 2009

DESCRIPTION:

In trying to fix the checks for eager_limit in the OB1 PML (per
discussion on the OMPI teleconf this past Tuesday), I've come across a
couple gaps. This RFC is to get others (mainly Brian Barrett's and
George Bosilca's) opinions on exactly what should be done for issue #1
and the ok for implementing issue #2.

1. The btl_XXX_eager_limit values are the upper payload value from
    each payload, but this must include the PML header. Hence, the max
    MPI data payload size is (btl_XXX_eager_limit - PML header size);
    but this even depends on which flavor of PML send you are using.
    Terry and Jeff find this misleading. Specifically, if a user sets
    the eager_limit to 1024 bytes and expects their 256 MPI_INT's to
    fit in an eager message, they're wrong. Additionally, network
    administrators who try to adjust the eager_limit to fit the max MTU
    size of their networks are unpleasantly surprised because the BTL
    may actually send (btl_XXX_eager_limit + btl_XXX_header_size) bytes
    at a time. Even worse, the value of btl_XXX_header_size is not
    published anywhere, so a network administrator cannot know if
    they're actually going over the MTU size or not.

    --> Note that we only looked at eager_limit -- similar issues
        likely also exist with btl_XXX_max_send_size, and possibly
        btl_XXX_rdma_pipeline_send_length...?
        btl_XXX_rdma_pipeline_frag_size (i.e., the RDMA size) should be
        ok -- I *think* it's an absolute payload size already. If you
        don't remember what these names mean, look at the pretty
        picture here:

  http://www.open-mpi.org/faq/?category=openfabrics#large-message-tuning-1.3

    There are two solutions I can think of. Which should we do?

    a. Pass the (max?) PML header size down into the BTL during
       initialization such that the the btl_XXX_eager_limit can
       represent the max MPI data payload size (i.e., the BTL can size
       its buffers to accommodate its desired max eager payload size,
       its header size, and the PML header size). Thus, the
       eager_limit can truly be the MPI data payload size -- and easy
       to explain to users.

    b. Stay with the current btl_XXX_eager_limit implementation (which
       OMPI has had for a long, long time) and add the code to check
       for btl_eager_limit less than the pml header size (per this past
       Tuesday's discussion). This is the minimal distance change.

2. OMPI currently does not publish enough information for a user to
    set eager_limit to be able to do BTL traffic shaping. That is, one
    really needs to know the (max) BTL header length and the (max) PML
    header length values to be able to calculate the correct
    eager_limit force a specific (max) BTL wire fragment size. Our
    proposed solution is to have ompi_info print out the (max) PML and
    BTL header sizes. Regardless of whether 1a) or 1b) is chosen, with
    these two pieces of information, a determined network administrator
    could calculate the max wire fragment size used by OMPI, and
    therefore be able to do at least some of traffic shaping.

-- 
Jeff Squyres
jsquyres_at_[hidden]