Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] MPI_Barrier hangs on second attempt but only when multiple hosts used.
From: Ralph Castain (rhc_at_[hidden])
Date: 2014-05-03 09:39:20


Hmmm...just testing on my little cluster here on two nodes, it works just fine with 1.8.2:

[rhc_at_bend001 v1.8]$ mpirun -n 2 --map-by node ./a.out
 In rank 0 and host= bend001 Do Barrier call 1.
 In rank 0 and host= bend001 Do Barrier call 2.
 In rank 0 and host= bend001 Do Barrier call 3.
 In rank 1 and host= bend002 Do Barrier call 1.
 In rank 1 and host= bend002 Do Barrier call 2.
 In rank 1 and host= bend002 Do Barrier call 3.
[rhc_at_bend001 v1.8]$

How are you configuring OMPI?

On May 2, 2014, at 2:24 PM, Clay Kirkland <clay.kirkland_at_[hidden]> wrote:

> I have been using MPI for many many years so I have very well debugged mpi tests. I am
> having trouble on either openmpi-1.4.5 or openmpi-1.6.5 versions though with getting the
> MPI_Barrier calls to work. It works fine when I run all processes on one machine but when
> I run with two or more hosts the second call to MPI_Barrier always hangs. Not the first one,
> but always the second one. I looked at FAQ's and such but found nothing except for a comment
> that MPI_Barrier problems were often problems with fire walls. Also mentioned as a problem
> was not having the same version of mpi on both machines. I turned firewalls off and removed
> and reinstalled the same version on both hosts but I still see the same thing. I then installed
> lam mpi on two of my machines and that works fine. I can call the MPI_Barrier function when run on
> one of two machines by itself many times with no hangs. Only hangs if two or more hosts are involved.
> These runs are all being done on CentOS release 6.4. Here is test program I used.
>
> main (argc, argv)
> int argc;
> char **argv;
> {
> char message[20];
> char hoster[256];
> char nameis[256];
> int fd, i, j, jnp, iret, myrank, np, ranker, recker;
> MPI_Comm comm;
> MPI_Status status;
>
> MPI_Init( &argc, &argv );
> MPI_Comm_rank( MPI_COMM_WORLD, &myrank);
> MPI_Comm_size( MPI_COMM_WORLD, &np);
>
> gethostname(hoster,256);
>
> printf(" In rank %d and host= %s Do Barrier call 1.\n",myrank,hoster);
> MPI_Barrier(MPI_COMM_WORLD);
> printf(" In rank %d and host= %s Do Barrier call 2.\n",myrank,hoster);
> MPI_Barrier(MPI_COMM_WORLD);
> printf(" In rank %d and host= %s Do Barrier call 3.\n",myrank,hoster);
> MPI_Barrier(MPI_COMM_WORLD);
> MPI_Finalize();
> exit(0);
> }
>
> Here are three runs of test program. First with two processes on one host, then with
> two processes on another host, and finally with one process on each of two hosts. The
> first two runs are fine but the last run hangs on the second MPI_Barrier.
>
> [root_at_centos MPI]# /usr/local/bin/mpirun -np 2 --host centos a.out
> In rank 0 and host= centos Do Barrier call 1.
> In rank 1 and host= centos Do Barrier call 1.
> In rank 1 and host= centos Do Barrier call 2.
> In rank 1 and host= centos Do Barrier call 3.
> In rank 0 and host= centos Do Barrier call 2.
> In rank 0 and host= centos Do Barrier call 3.
> [root_at_centos MPI]# /usr/local/bin/mpirun -np 2 --host RAID a.out
> /root/.bashrc: line 14: unalias: ls: not found
> In rank 0 and host= RAID Do Barrier call 1.
> In rank 0 and host= RAID Do Barrier call 2.
> In rank 0 and host= RAID Do Barrier call 3.
> In rank 1 and host= RAID Do Barrier call 1.
> In rank 1 and host= RAID Do Barrier call 2.
> In rank 1 and host= RAID Do Barrier call 3.
> [root_at_centos MPI]# /usr/local/bin/mpirun -np 2 --host centos,RAID a.out
> /root/.bashrc: line 14: unalias: ls: not found
> In rank 0 and host= centos Do Barrier call 1.
> In rank 0 and host= centos Do Barrier call 2.
> In rank 1 and host= RAID Do Barrier call 1.
> In rank 1 and host= RAID Do Barrier call 2.
>
> Since it is such a simple test and problem and such a widely used MPI function, it must obviously
> be an installation or configuration problem. A pstack for each of the hung MPI_Barrier processes
> on the two machines shows this:
>
> [root_at_centos ~]# pstack 31666
> #0 0x0000003baf0e8ee3 in __epoll_wait_nocancel () from /lib64/libc.so.6
> #1 0x00007f5de06125eb in epoll_dispatch () from /usr/local/lib/libmpi.so.1
> #2 0x00007f5de061475a in opal_event_base_loop () from /usr/local/lib/libmpi.so.1
> #3 0x00007f5de0639229 in opal_progress () from /usr/local/lib/libmpi.so.1
> #4 0x00007f5de0586f75 in ompi_request_default_wait_all () from /usr/local/lib/libmpi.so.1
> #5 0x00007f5ddc59565e in ompi_coll_tuned_sendrecv_actual () from /usr/local/lib/openmpi/mca_coll_tuned.so
> #6 0x00007f5ddc59d8ff in ompi_coll_tuned_barrier_intra_two_procs () from /usr/local/lib/openmpi/mca_coll_tuned.so
> #7 0x00007f5de05941c2 in PMPI_Barrier () from /usr/local/lib/libmpi.so.1
> #8 0x0000000000400a43 in main ()
>
> [root_at_RAID openmpi-1.6.5]# pstack 22167
> #0 0x00000030302e8ee3 in __epoll_wait_nocancel () from /lib64/libc.so.6
> #1 0x00007f7ee46885eb in epoll_dispatch () from /usr/local/lib/libmpi.so.1
> #2 0x00007f7ee468a75a in opal_event_base_loop () from /usr/local/lib/libmpi.so.1
> #3 0x00007f7ee46af229 in opal_progress () from /usr/local/lib/libmpi.so.1
> #4 0x00007f7ee45fcf75 in ompi_request_default_wait_all () from /usr/local/lib/libmpi.so.1
> #5 0x00007f7ee060b65e in ompi_coll_tuned_sendrecv_actual () from /usr/local/lib/openmpi/mca_coll_tuned.so
> #6 0x00007f7ee06138ff in ompi_coll_tuned_barrier_intra_two_procs () from /usr/local/lib/openmpi/mca_coll_tuned.so
> #7 0x00007f7ee460a1c2 in PMPI_Barrier () from /usr/local/lib/libmpi.so.1
> #8 0x0000000000400a43 in main ()
>
> Which looks exactly the same on each machine. Any thoughts or ideas would be greatly appreciated as
> I am stuck.
>
> Clay Kirkland
>
>
>
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users