Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] MPI_COMM_split hanging
From: Josh Hursey (jjhursey_at_[hidden])
Date: 2011-12-12 09:45:31


For MPI_Comm_split, all processes in the input communicator (oldcomm
or MPI_COMM_WORLD in your case) must call the operation since it is
collective over the input communicator. In your program rank 0 is not
calling the operation, so MPI_Comm_split is waiting for it to
participate.

If you want rank 0 to be excluded from the any of the communicators,
you can give it a special color that is distinct from all other ranks.
Upon return from MPI_Comm_split, rank 0 will be given a new
communicator containing just one processes, itself. If you do not
intend to use that communicator you can free it immediately
afterwards.

Hope that helps,
Josh

On Fri, Dec 9, 2011 at 6:52 PM, Gary Gorbet <gegorbet_at_[hidden]> wrote:
> I am attempting to split my application into multiple master+workers
> groups using MPI_COMM_split. My MPI revision is shown as:
>
> mpirun --tag-output ompi_info -v ompi full --parsable
> [1,0]<stdout>:package:Open MPI root_at_build-x86-64 Distribution
> [1,0]<stdout>:ompi:version:full:1.4.3
> [1,0]<stdout>:ompi:version:svn:r23834
> [1,0]<stdout>:ompi:version:release_date:Oct 05, 2010
> [1,0]<stdout>:orte:version:full:1.4.3
> [1,0]<stdout>:orte:version:svn:r23834
> [1,0]<stdout>:orte:version:release_date:Oct 05, 2010
> [1,0]<stdout>:opal:version:full:1.4.3
> [1,0]<stdout>:opal:version:svn:r23834
> [1,0]<stdout>:opal:version:release_date:Oct 05, 2010
> [1,0]<stdout>:ident:1.4.3
>
> The basic problem I am having is that none of processor instances ever
> returns from the MPI_COMM_split call. I am pretty new to MPI and it is
> likely I am not doing things quite correctly. I'd appreciate some guidance.
>
> I am working with an application that has functioned nicely for a while
> now. It only uses a single MPI_COMM_WORLD communicator. It is standard
> stuff:  a master that hands out tasks to many workers, receives output
> and keeps track of workers that are ready to receive another task. The
> tasks are quite compute-intensive. When running a variation of the
> process that uses Monte Carlo iterations, jobs can exceed the 30 hours
> they are limited to. The MC iterations are independent of each other -
> adding random noise to an input - so I would like to run multiple
> iterations simultaneously so that 4 times the cores runs in a fourth of
> the time. This would entail a supervisor interacting with multiple
> master+workers groups.
>
> I had thought that I would just have to declare a communicator for each
> group so that broadcasts and syncs would work within a single group.
>
>   MPI_Comm_size( MPI_COMM_WORLD, &total_proc_count );
>   MPI_Comm_rank( MPI_COMM_WORLD, &my_rank );
>   ...
>   cores_per_group = total_proc_count / groups_count;
>   my_group = my_rank / cores_per_group;     // e.g., 0, 1, ...
>   group_rank = my_rank - my_group * cores_per_group;  // rank within a
> group
>   if ( my_rank == 0 )    continue;    // Do not create group for supervisor
>   MPI_Comm oldcomm = MPI_COMM_WORLD;
>   MPI_Comm my_communicator;    // Actually declared as a class variable
>   int sstat = MPI_Comm_split( oldcomm, my_group, group_rank,
>         &my_communicator );
>
> There is never a return from the above _split() call. Do I need to do
> something else to set this up? I would have expected perhaps a non-zero
> status return, but not that I would get no return at all. I would
> appreciate any comments or guidance.
>
> - Gary
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>

-- 
Joshua Hursey
Postdoctoral Research Associate
Oak Ridge National Laboratory
http://users.nccs.gov/~jjhursey