Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] Problem with MPI_Barrier (Inter-communicator)
From: Edgar Gabriel (gabriel_at_[hidden])
Date: 2012-03-26 11:34:09


yes and no,. So first, here is a quick fix for you: if you start the
server using

mpirun -np 2 -mca coll ^inter ./server

your test code finishes (with one minor modification to your code,
namely the process that is being excluded on the client side does need a
condition to leave the while loop as well.).

That being said, here is what the problem seems to be when using the
inter communicator module. The inter-comm barrier is handled initially
by the basic module, and is implemented by calling an allreduce
operation. The inter-communicator allreduce per default uses the
implementation in the inter module, as a sequence of intra-reduce on the
local communicator, a point-to-point exchange of the results of the two
local groups by the local root processes (rank zero in the local groups
of the intercomm), and a broadcast of the results on the local group.
And it is this very last step that we are hanging.

So bottom line, the intra-communicator broadcast for a communicator size
of 1 is hanging, as far as I can see independent of whether we use tuned
or basic.

I do not recall on what the agreement was on how to treat the size=1
scenarios in coll. Looking at the routine in tuned ( e.g.
ompi_coll_tuned_bcast_intra_generic ) there is a statement which clearly
indicates that it should not be used for 1 proc

assert(size>1)

but I do not recall on which module or what the agreement was on how
that was supposed to be treated correctly. I am also no sure why the
bcast on 1 proc works on the server side but does not on the client
side. That's where I stand right now in the analysis.

Thanks
Edgar

On 3/26/2012 8:39 AM, Rodrigo Oliveira wrote:
> Hi Edgar,
>
> Did you take a look at my code? Any idea about what is happening? I did
> a lot of tests and it does not work.
>
> Thanks
>
> On Tue, Mar 20, 2012 at 3:43 PM, Rodrigo Oliveira
> <rsilva.oliveira_at_[hidden] <mailto:rsilva.oliveira_at_[hidden]>> wrote:
>
> The command I use to compile and run is:
>
> mpic++ server.cc -o server && mpic++ client.cc -o client && mpirun
> -np 1 ./server
>
> Rodrigo
>
>
> On Tue, Mar 20, 2012 at 3:40 PM, Rodrigo Oliveira
> <rsilva.oliveira_at_[hidden] <mailto:rsilva.oliveira_at_[hidden]>> wrote:
>
> Hi Edgar.
>
> Thanks for the response. The simplified code is attached:
> server, client and a .h containing some constants. I put some
> "prints" to show the behavior.
>
> Regards
>
> Rodrigo
>
>
> On Tue, Mar 20, 2012 at 11:47 AM, Edgar Gabriel
> <gabriel_at_[hidden] <mailto:gabriel_at_[hidden]>> wrote:
>
> do you have by any chance the actual or a small reproducer?
> It might be
> much easier to hunt the problem down...
>
> Thanks
> Edgar
>
> On 3/19/2012 8:12 PM, Rodrigo Oliveira wrote:
> > Hi there.
> >
> > I am facing a very strange problem when using MPI_Barrier
> over an
> > inter-communicator after some operations I describe bellow:
> >
> > 1) I start a server calling mpirun.
> > 2) The server spawns 2 copies of a client using
> MPI_Comm_spawn, creating
> > an inter-communicator between the two groups. The server
> group with 1
> > process (lets name it as A) and the client group with 2
> processes (group B).
> > 3) After that, I need to detach one of the processes (rank
> 0) in group B
> > from the inter-communicator AB. To do that I do the
> following steps:
> >
> > Server side:
> > .....
> > tmp_inter_comm = client_comm.Create (
> client_comm.Get_group ( ) );
> > client_comm.Free ( );
> > client_comm = tmp_inter_comm;
> > .....
> > client_comm.Barrier();
> > .....
> >
> > Client side:
> > ....
> > rank = 0;
> > tmp_inter_comm = server_comm.Create (
> server_comm.Get_group (
> > ).Excl ( 1, &rank ) );
> > server_comm.Free ( );
> > server_comm = tmp_inter_comm;
> > .....
> > if (server_comm != MPI::COMM_NULL)
> > server_comm.Barrier();
> >
> >
> > The problem: everything works fine until the call to
> Barrier. In that
> > point, the server exits the barrier, but the client at the
> group B does
> > not. Observe that we have only one process inside B,
> because I used Excl
> > to remove one process from the original group.
> >
> > p.s.: This occurs in the version 1.5.4 and the C++ API.
> >
> > I am very concerned about this problem because this
> solution plays a
> > very important role in my master thesis.
> >
> > Is this an ompi problem or am I doing something wrong?
> >
> > Thanks in advance
> >
> > Rodrigo Oliveira
> >
> >
> > _______________________________________________
> > users mailing list
> > users_at_[hidden] <mailto:users_at_[hidden]>
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
> _______________________________________________
> users mailing list
> users_at_[hidden] <mailto:users_at_[hidden]>
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
>
>
>
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users

-- 
Edgar Gabriel
Associate Professor
Parallel Software Technologies Lab      http://pstl.cs.uh.edu
Department of Computer Science          University of Houston
Philip G. Hoffman Hall, Room 524        Houston, TX-77204, USA
Tel: +1 (713) 743-3857                  Fax: +1 (713) 743-3335