Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: [OMPI users] MPI_COMM_split hanging
From: Gary Gorbet (gegorbet_at_[hidden])
Date: 2011-12-09 18:52:20

I am attempting to split my application into multiple master+workers
groups using MPI_COMM_split. My MPI revision is shown as:

mpirun --tag-output ompi_info -v ompi full --parsable
[1,0]<stdout>:package:Open MPI root_at_build-x86-64 Distribution
[1,0]<stdout>:ompi:version:release_date:Oct 05, 2010
[1,0]<stdout>:orte:version:release_date:Oct 05, 2010
[1,0]<stdout>:opal:version:release_date:Oct 05, 2010

The basic problem I am having is that none of processor instances ever
returns from the MPI_COMM_split call. I am pretty new to MPI and it is
likely I am not doing things quite correctly. I'd appreciate some guidance.

I am working with an application that has functioned nicely for a while
now. It only uses a single MPI_COMM_WORLD communicator. It is standard
stuff: a master that hands out tasks to many workers, receives output
and keeps track of workers that are ready to receive another task. The
tasks are quite compute-intensive. When running a variation of the
process that uses Monte Carlo iterations, jobs can exceed the 30 hours
they are limited to. The MC iterations are independent of each other -
adding random noise to an input - so I would like to run multiple
iterations simultaneously so that 4 times the cores runs in a fourth of
the time. This would entail a supervisor interacting with multiple
master+workers groups.

I had thought that I would just have to declare a communicator for each
group so that broadcasts and syncs would work within a single group.

    MPI_Comm_size( MPI_COMM_WORLD, &total_proc_count );
    MPI_Comm_rank( MPI_COMM_WORLD, &my_rank );
    cores_per_group = total_proc_count / groups_count;
    my_group = my_rank / cores_per_group; // e.g., 0, 1, ...
    group_rank = my_rank - my_group * cores_per_group; // rank within a
    if ( my_rank == 0 ) continue; // Do not create group for supervisor
    MPI_Comm oldcomm = MPI_COMM_WORLD;
    MPI_Comm my_communicator; // Actually declared as a class variable
    int sstat = MPI_Comm_split( oldcomm, my_group, group_rank,
          &my_communicator );

There is never a return from the above _split() call. Do I need to do
something else to set this up? I would have expected perhaps a non-zero
status return, but not that I would get no return at all. I would
appreciate any comments or guidance.

- Gary