so I agree that we need to fix that, and we'll get a fix for that as
soon as possible. It still strikes me as wrong however to we have
fundamentally different types on two layers for the same 'item'.
I still think that going back to the original algorithm would be bad -
especially for an application that creates such a large number of
communicators potentially executed on a large number ( 1000s) of
processors. I'll look into how to reuse an entire block of communicator
cid respectively how to take the max_contextid into account.
Brian W. Barrett wrote:
> On Thu, 30 Apr 2009, Edgar Gabriel wrote:
>> Brian W. Barrett wrote:
>>> When we added the CM PML, we added a pml_max_contextid field to the
>>> PML structure, which is the max size cid the PML can handle (because
>>> the matching interfaces don't allow 32 bits to be used for the cid.
>>> At the same time, the max cid for OB1 was shrunk significantly, so
>>> that the header on a short message would be packed tightly with no
>>> alignment padding.
>>> At the time, we believed 32k simultaneous communicators was plenty,
>>> and that CIDs were reused (we checked, I'm pretty sure). It sounds
>>> like someone removed the CID reuse code, which seems rather bad to me.
>> yes, we added the block algorithm. Not reusing a CID actually doesn't
>> bite me as that dramatic, and I am still not sure and convinced about
>> that:-) We do not have an empty array or something like that, its just
>> a number.
>> The reason for the block algorithm was that the performance of our
>> communicator creation code sucked, and the cid allocation was one
>> portion of that. We used to require *at least* 4 collective operations
>> per communicator creation at that time. We are now potentially down to
>> 0, among others thanks to the block algorithm.
>> However, let me think about reusing entire blocks, its probably doable
>> just requires a little more bookkeeping...
>>> There have to be unused CIDs in Ralph's example - is there a way to
>>> fallback out of the block algorithm when it can't find a new CID and
>>> find one it can reuse? Other than setting the multi-threaded case
>>> back on, that is?
>> remember that its not the communicator id allocation that is failing
>> at this point, so the question is do we have to 'validate' a cid with
>> the pml before we declare it to be ok?
> well, that's only because the code's doing something it shouldn't. Have
> a look at comm_cid.c:185 - there's the check we added to the
> multi-threaded case (which was the only case when we added it). The cid
> generation should never generate a number larger than
> mca_pml.pml_max_contextid. I'm actually somewhat amazed this fails
> gracefully, as OB1 doesn't appear to check it got a valid cid in
> add_comm, which it should probably do.
> Looking at the differences between v1.2 and v1.3, the max_contextid code
> was already in v1.2 and OB1 was setting it to 32k. So the cid blocking
> code removed a rather critical feature and probably should be fixed or
> removed for v1.3. On Portals, I only get 8k cids, so not having reuse
> is a rather large problem.
> devel mailing list
Parallel Software Technologies Lab http://pstl.cs.uh.edu
Department of Computer Science University of Houston
Philip G. Hoffman Hall, Room 524 Houston, TX-77204, USA
Tel: +1 (713) 743-3857 Fax: +1 (713) 743-3335