Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: [OMPI users] Program hangs on send when run with nodes on remote machine
From: Keith Manville (kmanville_at_[hidden])
Date: 2011-08-04 13:59:05


I am having trouble running my MPI program on multiple nodes. I can
run multiple processes on a single node, and I can spawn processes on
on remote nodes, but when I call Send from a remote node, the node
never returns, even though there is an appropriate Recv waiting. I'm
pretty sure this is an issue with my configuration, not my code. I've
tried some other sample programs I found and had the same problem of
hanging on a send from one host to another.

Here's an in depth description:

I wrote a quick test program where each process with rank > 1 sends an
int to the master (rank 0), and the master receives until it gets
something from every other process.

My test program works fine when I run multiple processes on a single machine.

either the local node:

$ ./mpirun -n 4 ./mpi-test
Hi I'm localhost:2
Hi I'm localhost:1
localhost:1 sending 11...
localhost:2 sending 12...
localhost:2 sent 12
localhost:1 sent 11
Hi I'm localhost:0
localhost:0 received 11 from 1
localhost:0 received 12 from 2
Hi I'm localhost:3
localhost:3 sending 13...
localhost:3 sent 13
localhost:0 received 13 from 3
all workers checked in!

or a remote one:

$ ./mpirun -np 2 -host remotehost ./mpi-test
Hi I'm remotehost:0
remotehost:0 received 11 from 1
all workers checked in!
Hi I'm remotehost:1
remotehost:1 sending 11...
remotehost:1 sent 11

But when I try to run the master locally and the worker(s) remotely
(this is the way I am actually interested in running it), Send never
returns and it hangs indefinitely.

$ ./mpirun -np 2 -host localhost,remotehost ./mpi-test
Hi I'm localhost:0
Hi I'm remotehost:1
remotehost:1 sending 11...

Just to see if it would work, I tried spawning the master on the
remotehost and the worker on the localhost.

$ ./mpirun -np 2 -host remotehost,localhost ./mpi-test
Hi I'm localhost:1
localhost:1 sending 11...
localhost:1 sent 11
Hi I'm remotehost:0
remotehost:0 received 0 from 1
all workers checked in!

It doesn't hang on Send, but the wrong value is received.

Any idea what's going on? I've attached my code, my config.log,
ifconfig output, and ompi_info output.

Thanks,
Keith



  • application/x-gzip attachment: mpi.tgz