Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] Inherent limit on #communicators?
From: Ralph Castain (rhc_at_[hidden])
Date: 2009-04-30 15:11:41

On Thu, Apr 30, 2009 at 12:55 PM, Brian W. Barrett <brbarret_at_[hidden]>wrote:

> On Thu, 30 Apr 2009, Edgar Gabriel wrote:
> Brian W. Barrett wrote:
>>> When we added the CM PML, we added a pml_max_contextid field to the PML
>>> structure, which is the max size cid the PML can handle (because the
>>> matching interfaces don't allow 32 bits to be used for the cid. At the same
>>> time, the max cid for OB1 was shrunk significantly, so that the header on a
>>> short message would be packed tightly with no alignment padding.
>>> At the time, we believed 32k simultaneous communicators was plenty, and
>>> that CIDs were reused (we checked, I'm pretty sure). It sounds like someone
>>> removed the CID reuse code, which seems rather bad to me.
>> yes, we added the block algorithm. Not reusing a CID actually doesn't bite
>> me as that dramatic, and I am still not sure and convinced about that:-) We
>> do not have an empty array or something like that, its just a number.
>> The reason for the block algorithm was that the performance of our
>> communicator creation code sucked, and the cid allocation was one portion of
>> that. We used to require *at least* 4 collective operations per communicator
>> creation at that time. We are now potentially down to 0, among others thanks
>> to the block algorithm.
>> However, let me think about reusing entire blocks, its probably doable
>> just requires a little more bookkeeping...
>> There have to be unused CIDs in Ralph's example - is there a way to
>>> fallback out of the block algorithm when it can't find a new CID and find
>>> one it can reuse? Other than setting the multi-threaded case back on, that
>>> is?
>> remember that its not the communicator id allocation that is failing at
>> this point, so the question is do we have to 'validate' a cid with the pml
>> before we declare it to be ok?
> well, that's only because the code's doing something it shouldn't. Have a
> look at comm_cid.c:185 - there's the check we added to the multi-threaded
> case (which was the only case when we added it). The cid generation should
> never generate a number larger than mca_pml.pml_max_contextid. I'm actually
> somewhat amazed this fails gracefully, as OB1 doesn't appear to check it got
> a valid cid in add_comm, which it should probably do.

Actually, as an FYI: it doesn't fail gracefully. It just hangs...ick.

> Looking at the differences between v1.2 and v1.3, the max_contextid code
> was already in v1.2 and OB1 was setting it to 32k. So the cid blocking code
> removed a rather critical feature and probably should be fixed or removed
> for v1.3. On Portals, I only get 8k cids, so not having reuse is a rather
> large problem.
> Brian
> _______________________________________________
> devel mailing list
> devel_at_[hidden]