Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] Inherent limit on #communicators?
From: David Gunter (dog_at_[hidden])
Date: 2009-04-30 14:56:11


Here is the test code reproducer:

       program test2
       implicit none
       include 'mpif.h'
       integer ierr, myid, numprocs,i1,i2,n,local_comm,
      $ icolor,ikey,rank,root

c
c... MPI set-up
       ierr = 0
       call MPI_INIT(IERR)
       ierr = 1
       CALL MPI_COMM_SIZE(MPI_COMM_WORLD, numprocs, ierr)
       print *, ierr

       ierr = -1

       CALL MPI_COMM_RANK(MPI_COMM_WORLD, myid, ierr)

       ierr = -5
       i1 = ierr
       if (myid.eq.0) i1 = 1
       call mpi_allreduce(i1, i2, 1,MPI_integer,MPI_MIN,
      $ MPI_COMM_WORLD,ierr)

       ikey = myid
       if (mod(myid,2).eq.0) then
          icolor = 0
       else
          icolor = MPI_UNDEFINED
       end if

       root = 0
       do n = 1, 100000

          call MPI_COMM_SPLIT(MPI_COMM_WORLD, icolor,
      $ ikey, local_comm, ierr)

          if (mod(myid,2).eq.0) then
             CALL MPI_COMM_RANK(local_comm, rank, ierr)
             i2 = i1
             call mpi_reduce(i1, i2, 1,MPI_integer,MPI_MIN,
      $ root, local_comm,ierr)

             if (myid.eq.0.and.mod(n,10).eq.0)
      $ print *, n, i1, i2,icolor,ikey

             call mpi_comm_free(local_comm, ierr)
          end if

       end do
c if (icolor.eq.0) call mpi_comm_free(local_comm, ierr)

       call MPI_barrier(MPi_COMM_WORLD,ierr)

       call MPI_FINALIZE(IERR)

       print *, myid, ierr

       end

-david

--
David Gunter
HPC-3: Parallel Tools Team
Los Alamos National Laboratory
On Apr 30, 2009, at 12:43 PM, David Gunter wrote:
> Just to throw out more info on this, the test code runs fine on  
> previous versions of OMPI.  It only hangs on the 1.3 line when the  
> cid reaches 65536.
>
> -david
> --
> David Gunter
> HPC-3: Parallel Tools Team
> Los Alamos National Laboratory
>
>
>
> On Apr 30, 2009, at 12:28 PM, Edgar Gabriel wrote:
>
>> cid's are in fact not recycled in the block algorithm. The problem  
>> is that comm_free is not collective, so you can not make any  
>> assumptions whether other procs have also released that communicator.
>>
>>
>> But nevertheless, a cid in the communicator structure is a  
>> uint32_t, so it should not hit the 16k limit there yet. this is not  
>> new, so if there is a discrepancy between what the comm structure  
>> assumes that a cid is and what the pml assumes, than this was in  
>> the code since the very first days of Open MPI...
>>
>> Thanks
>> Edgar
>>
>> Brian W. Barrett wrote:
>>> On Thu, 30 Apr 2009, Ralph Castain wrote:
>>>> We seem to have hit a problem here - it looks like we are seeing a
>>>> built-in limit on the number of communicators one can create in a
>>>> program. The program basically does a loop, calling  
>>>> MPI_Comm_split each
>>>> time through the loop to create a sub-communicator, does a reduce
>>>> operation on the members of the sub-communicator, and then calls
>>>> MPI_Comm_free to release it (this is a minimized reproducer for  
>>>> the real
>>>> code). After 64k times through the loop, the program fails.
>>>>
>>>> This looks remarkably like a 16-bit index that hits a max value  
>>>> and then
>>>> blocks.
>>>>
>>>> I have looked at the communicator code, but I don't immediately  
>>>> see such
>>>> a field. Is anyone aware of some other place where we would have  
>>>> a limit
>>>> that would cause this problem?
>>> There's a maximum of 32768 communicator ids when using OB1 (each  
>>> PML can set the max contextid, although the communicator code is  
>>> the part that actually assigns a cid).  Assuming that comm_free is  
>>> actually properly called, there should be plenty of cids available  
>>> for that pattern. However, I'm not sure I understand the block  
>>> algorithm someone added to cid allocation - I'd have to guess that  
>>> there's something funny with that routine and cids aren't being  
>>> recycled properly.
>>> Brian
>>> _______________________________________________
>>> devel mailing list
>>> devel_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>
>> -- 
>> Edgar Gabriel
>> Assistant Professor
>> Parallel Software Technologies Lab      http://pstl.cs.uh.edu
>> Department of Computer Science          University of Houston
>> Philip G. Hoffman Hall, Room 524        Houston, TX-77204, USA
>> Tel: +1 (713) 743-3857                  Fax: +1 (713) 743-3335
>> _______________________________________________
>> devel mailing list
>> devel_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel