Well, since I'm the "guy who wrote the code", I'll offer my $0.0001
(my dollars went the way of the market...).
Jeff's memory about why we went to 16 bits isn't quite accurate. The
fact is that we always had 32-bit jobids, and still do. Up to about a
year ago, all of that space was available for comm_spawn. What changed
at that time was a decision to make every mpirun independently create
a unique identifier so that two mpirun's could connect/accept without
requiring a persistent orted to coordinate their name at launch. This
was the subject of a lengthy discussion involving multiple
institutions that spanned several months last year.
As a result of that discussion, we claimed 16-bits of the 32-bits for
the mpirun identifier. We investigated using only 8-bits (thus leaving
24-bits for comm_spawn'd jobs), but the probability of duplicate
identifiers was too high.
Likewise, we looked at increasing the total size of the jobid to 64-
bits, but that seemed ridiculously high - and (due to the way memory
gets allocated for structures) meant that we had to also increase the
vpid size to 64-bits. Thus, the move to 64-bit id's would have
increased the size of the name struct from 32-bits to 128-bits - and
now you do start to see a non-zero impact on memory footprint for
extreme scale clusters involving several hundred thousand processes.
So we accepted the 16-bit limit on comm_spawn and moved on....until
someone now wants to do 100k comm_spawns.
I don't believe Jeff's proposed solution will solve that user's
request as he was dynamically constructing a very large server farm
(so the procs are not short lived). However, IMHO, I think this was a
poorly designed application - it didn't need to be done the way he was
doing it, and could easily (and more efficiently) be built to fit
within the 64k constraint.
So, my suggestion is to stick with the 64k limit, perhaps add this
reuse proposal, and just document that constraint.
On Oct 27, 2008, at 4:14 PM, Jeff Squyres wrote:
> On Oct 27, 2008, at 5:52 PM, Andreas Schäfer wrote:
>> I don't know any implementation details, but is making a 16-bit
>> counter a 32-bit counter really so much harder than this fancy
>> (overengineered? ;-) ) table construction? The way I see it, this
>> table which might become a real mess if there are multiple
>> MPI_Comm_spawn issued simultaneously in different communicators.
>> that be legal MPI?)
> FWIW, all the spawns are proxied back to the HNP (i.e., mpirun), so
> there would only be a need for 1 table. I don't think that a simple
> table lookup is overengineered. :-) It's a simple solution to the
> "need a global ID" issue. By limiting the size of the table, you
> can avoid scalability issues as MPI jobs are being run on more and
> more cores (e.g., growing without bound, particularly for 99% of the
> apps out there that never call comm_spawn).
> We actually went down to 16 bits recently (it used to be 32) as one
> item toward reducing the memory footprint of MPI processes (and
> mpirun and the orted's), particularly when running very large scale
> jobs. So while increasing this one value back to 32 bits may not be
> tragic, it would be nice to keep it down as 16 bits (IMHO).
> Regardless of how big the value is (8, 16, 32, 64...) you still need
> a unique value for comm_spawn. Therefore, some kind of duplicate
> detection mechanism is needed. If you increase the size of the
> container, you decrease the probability of collision, but it can
> still happen. And since machines are growing in size and # of
> cores, it could just delay the probability of collision until
> someone runs on a big enough machine. Regardless, I'd prefer to fix
> it the Right way rather than rely on probability to prevent a
> problem. In my experience, "that could *never* happen!" is just an
> invitation for a disaster, even if it's 1-5 years in the future.
> (didn't someone say that we'd never need more than 640k of RAM? :-) )
> Just my IMHO, of course... (and I'm not the guy writing the
> code!) :-)
> Jeff Squyres
> Cisco Systems
> devel mailing list