On Thu, 30 Apr 2009, Edgar Gabriel wrote:
> Brian W. Barrett wrote:
>> When we added the CM PML, we added a pml_max_contextid field to the PML
>> structure, which is the max size cid the PML can handle (because the
>> matching interfaces don't allow 32 bits to be used for the cid. At the
>> same time, the max cid for OB1 was shrunk significantly, so that the header
>> on a short message would be packed tightly with no alignment padding.
>> At the time, we believed 32k simultaneous communicators was plenty, and
>> that CIDs were reused (we checked, I'm pretty sure). It sounds like
>> someone removed the CID reuse code, which seems rather bad to me.
> yes, we added the block algorithm. Not reusing a CID actually doesn't bite me
> as that dramatic, and I am still not sure and convinced about that:-) We do
> not have an empty array or something like that, its just a number.
> The reason for the block algorithm was that the performance of our
> communicator creation code sucked, and the cid allocation was one portion of
> that. We used to require *at least* 4 collective operations per communicator
> creation at that time. We are now potentially down to 0, among others thanks
> to the block algorithm.
> However, let me think about reusing entire blocks, its probably doable just
> requires a little more bookkeeping...
>> There have to be unused CIDs in Ralph's example - is there a way to
>> fallback out of the block algorithm when it can't find a new CID and find
>> one it can reuse? Other than setting the multi-threaded case back on, that
> remember that its not the communicator id allocation that is failing at this
> point, so the question is do we have to 'validate' a cid with the pml before
> we declare it to be ok?
well, that's only because the code's doing something it shouldn't. Have a
look at comm_cid.c:185 - there's the check we added to the multi-threaded
case (which was the only case when we added it). The cid generation
should never generate a number larger than mca_pml.pml_max_contextid.
I'm actually somewhat amazed this fails gracefully, as OB1 doesn't appear
to check it got a valid cid in add_comm, which it should probably do.
Looking at the differences between v1.2 and v1.3, the max_contextid code
was already in v1.2 and OB1 was setting it to 32k. So the cid blocking
code removed a rather critical feature and probably should be fixed or
removed for v1.3. On Portals, I only get 8k cids, so not having reuse is
a rather large problem.