Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] Inherent limit on #communicators?
From: David Gunter (dog_at_[hidden])
Date: 2009-04-30 14:43:05


Just to throw out more info on this, the test code runs fine on
previous versions of OMPI. It only hangs on the 1.3 line when the cid
reaches 65536.

-david

--
David Gunter
HPC-3: Parallel Tools Team
Los Alamos National Laboratory
On Apr 30, 2009, at 12:28 PM, Edgar Gabriel wrote:
> cid's are in fact not recycled in the block algorithm. The problem  
> is that comm_free is not collective, so you can not make any  
> assumptions whether other procs have also released that communicator.
>
>
> But nevertheless, a cid in the communicator structure is a uint32_t,  
> so it should not hit the 16k limit there yet. this is not new, so if  
> there is a discrepancy between what the comm structure assumes that  
> a cid is and what the pml assumes, than this was in the code since  
> the very first days of Open MPI...
>
> Thanks
> Edgar
>
> Brian W. Barrett wrote:
>> On Thu, 30 Apr 2009, Ralph Castain wrote:
>>> We seem to have hit a problem here - it looks like we are seeing a
>>> built-in limit on the number of communicators one can create in a
>>> program. The program basically does a loop, calling MPI_Comm_split  
>>> each
>>> time through the loop to create a sub-communicator, does a reduce
>>> operation on the members of the sub-communicator, and then calls
>>> MPI_Comm_free to release it (this is a minimized reproducer for  
>>> the real
>>> code). After 64k times through the loop, the program fails.
>>>
>>> This looks remarkably like a 16-bit index that hits a max value  
>>> and then
>>> blocks.
>>>
>>> I have looked at the communicator code, but I don't immediately  
>>> see such
>>> a field. Is anyone aware of some other place where we would have a  
>>> limit
>>> that would cause this problem?
>> There's a maximum of 32768 communicator ids when using OB1 (each  
>> PML can set the max contextid, although the communicator code is  
>> the part that actually assigns a cid).  Assuming that comm_free is  
>> actually properly called, there should be plenty of cids available  
>> for that pattern. However, I'm not sure I understand the block  
>> algorithm someone added to cid allocation - I'd have to guess that  
>> there's something funny with that routine and cids aren't being  
>> recycled properly.
>> Brian
>> _______________________________________________
>> devel mailing list
>> devel_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>
> -- 
> Edgar Gabriel
> Assistant Professor
> Parallel Software Technologies Lab      http://pstl.cs.uh.edu
> Department of Computer Science          University of Houston
> Philip G. Hoffman Hall, Room 524        Houston, TX-77204, USA
> Tel: +1 (713) 743-3857                  Fax: +1 (713) 743-3335
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel