I'll file a ticket against it....oh joy!!! You all know how much I *love*
On Thu, Apr 30, 2009 at 1:11 PM, Ralph Castain <rhc_at_[hidden]> wrote:
> On Thu, Apr 30, 2009 at 12:55 PM, Brian W. Barrett <brbarret_at_[hidden]>wrote:
>> On Thu, 30 Apr 2009, Edgar Gabriel wrote:
>> Brian W. Barrett wrote:
>>>> When we added the CM PML, we added a pml_max_contextid field to the PML
>>>> structure, which is the max size cid the PML can handle (because the
>>>> matching interfaces don't allow 32 bits to be used for the cid. At the same
>>>> time, the max cid for OB1 was shrunk significantly, so that the header on a
>>>> short message would be packed tightly with no alignment padding.
>>>> At the time, we believed 32k simultaneous communicators was plenty, and
>>>> that CIDs were reused (we checked, I'm pretty sure). It sounds like someone
>>>> removed the CID reuse code, which seems rather bad to me.
>>> yes, we added the block algorithm. Not reusing a CID actually doesn't
>>> bite me as that dramatic, and I am still not sure and convinced about
>>> that:-) We do not have an empty array or something like that, its just a
>>> The reason for the block algorithm was that the performance of our
>>> communicator creation code sucked, and the cid allocation was one portion of
>>> that. We used to require *at least* 4 collective operations per communicator
>>> creation at that time. We are now potentially down to 0, among others thanks
>>> to the block algorithm.
>>> However, let me think about reusing entire blocks, its probably doable
>>> just requires a little more bookkeeping...
>>> There have to be unused CIDs in Ralph's example - is there a way to
>>>> fallback out of the block algorithm when it can't find a new CID and find
>>>> one it can reuse? Other than setting the multi-threaded case back on, that
>>> remember that its not the communicator id allocation that is failing at
>>> this point, so the question is do we have to 'validate' a cid with the pml
>>> before we declare it to be ok?
>> well, that's only because the code's doing something it shouldn't. Have a
>> look at comm_cid.c:185 - there's the check we added to the multi-threaded
>> case (which was the only case when we added it). The cid generation should
>> never generate a number larger than mca_pml.pml_max_contextid. I'm actually
>> somewhat amazed this fails gracefully, as OB1 doesn't appear to check it got
>> a valid cid in add_comm, which it should probably do.
> Actually, as an FYI: it doesn't fail gracefully. It just hangs...ick.
>> Looking at the differences between v1.2 and v1.3, the max_contextid code
>> was already in v1.2 and OB1 was setting it to 32k. So the cid blocking code
>> removed a rather critical feature and probably should be fixed or removed
>> for v1.3. On Portals, I only get 8k cids, so not having reuse is a rather
>> large problem.
>> devel mailing list