Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |  

This web mail archive is frozen.

This page is part of a frozen web archive of this mailing list.

You can still navigate around this archive, but know that no new mails have been added to it since July of 2016.

Click here to be taken to the new web archives of this list; it includes all the mails that are in this frozen archive plus all new mails that have been sent to the list since it was migrated to the new archives.

Subject: Re: [OMPI devel] Inherent limit on #communicators?
From: David Gunter (dog_at_[hidden])
Date: 2009-04-30 14:43:05

Just to throw out more info on this, the test code runs fine on
previous versions of OMPI. It only hangs on the 1.3 line when the cid
reaches 65536.


David Gunter
HPC-3: Parallel Tools Team
Los Alamos National Laboratory
On Apr 30, 2009, at 12:28 PM, Edgar Gabriel wrote:
> cid's are in fact not recycled in the block algorithm. The problem  
> is that comm_free is not collective, so you can not make any  
> assumptions whether other procs have also released that communicator.
> But nevertheless, a cid in the communicator structure is a uint32_t,  
> so it should not hit the 16k limit there yet. this is not new, so if  
> there is a discrepancy between what the comm structure assumes that  
> a cid is and what the pml assumes, than this was in the code since  
> the very first days of Open MPI...
> Thanks
> Edgar
> Brian W. Barrett wrote:
>> On Thu, 30 Apr 2009, Ralph Castain wrote:
>>> We seem to have hit a problem here - it looks like we are seeing a
>>> built-in limit on the number of communicators one can create in a
>>> program. The program basically does a loop, calling MPI_Comm_split  
>>> each
>>> time through the loop to create a sub-communicator, does a reduce
>>> operation on the members of the sub-communicator, and then calls
>>> MPI_Comm_free to release it (this is a minimized reproducer for  
>>> the real
>>> code). After 64k times through the loop, the program fails.
>>> This looks remarkably like a 16-bit index that hits a max value  
>>> and then
>>> blocks.
>>> I have looked at the communicator code, but I don't immediately  
>>> see such
>>> a field. Is anyone aware of some other place where we would have a  
>>> limit
>>> that would cause this problem?
>> There's a maximum of 32768 communicator ids when using OB1 (each  
>> PML can set the max contextid, although the communicator code is  
>> the part that actually assigns a cid).  Assuming that comm_free is  
>> actually properly called, there should be plenty of cids available  
>> for that pattern. However, I'm not sure I understand the block  
>> algorithm someone added to cid allocation - I'd have to guess that  
>> there's something funny with that routine and cids aren't being  
>> recycled properly.
>> Brian
>> _______________________________________________
>> devel mailing list
>> devel_at_[hidden]
> -- 
> Edgar Gabriel
> Assistant Professor
> Parallel Software Technologies Lab
> Department of Computer Science          University of Houston
> Philip G. Hoffman Hall, Room 524        Houston, TX-77204, USA
> Tel: +1 (713) 743-3857                  Fax: +1 (713) 743-3335
> _______________________________________________
> devel mailing list
> devel_at_[hidden]