On Thu, Apr 30, 2009 at 12:55 PM, Brian W. Barrett <firstname.lastname@example.org>
On Thu, 30 Apr 2009, Edgar Gabriel wrote:
well, that's only because the code's doing something it shouldn't. Have a look at comm_cid.c:185 - there's the check we added to the multi-threaded case (which was the only case when we added it). The cid generation should never generate a number larger than mca_pml.pml_max_contextid. I'm actually somewhat amazed this fails gracefully, as OB1 doesn't appear to check it got a valid cid in add_comm, which it should probably do.
Brian W. Barrett wrote:
When we added the CM PML, we added a pml_max_contextid field to the PML structure, which is the max size cid the PML can handle (because the matching interfaces don't allow 32 bits to be used for the cid. At the same time, the max cid for OB1 was shrunk significantly, so that the header on a short message would be packed tightly with no alignment padding.
At the time, we believed 32k simultaneous communicators was plenty, and that CIDs were reused (we checked, I'm pretty sure). It sounds like someone removed the CID reuse code, which seems rather bad to me.
yes, we added the block algorithm. Not reusing a CID actually doesn't bite me as that dramatic, and I am still not sure and convinced about that:-) We do not have an empty array or something like that, its just a number.
The reason for the block algorithm was that the performance of our communicator creation code sucked, and the cid allocation was one portion of that. We used to require *at least* 4 collective operations per communicator creation at that time. We are now potentially down to 0, among others thanks to the block algorithm.
However, let me think about reusing entire blocks, its probably doable just requires a little more bookkeeping...
There have to be unused CIDs in Ralph's example - is there a way to fallback out of the block algorithm when it can't find a new CID and find one it can reuse? Other than setting the multi-threaded case back on, that is?
remember that its not the communicator id allocation that is failing at this point, so the question is do we have to 'validate' a cid with the pml before we declare it to be ok?
Actually, as an FYI: it doesn't fail gracefully. It just hangs...ick.
Looking at the differences between v1.2 and v1.3, the max_contextid code was already in v1.2 and OB1 was setting it to 32k. So the cid blocking code removed a rather critical feature and probably should be fixed or removed for v1.3. On Portals, I only get 8k cids, so not having reuse is a rather large problem.