Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] Program hangs on send when run with nodes on remote machine
From: Jeff Squyres (jsquyres_at_[hidden])
Date: 2011-08-04 20:46:16


I notice that in the worker, you have:

eth2 Link encap:Ethernet HWaddr 00:1b:21:77:c5:d4
          inet addr:192.168.1.155 Bcast:192.168.1.255 Mask:255.255.255.0
          inet6 addr: fe80::21b:21ff:fe77:c5d4/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
          RX packets:9225846 errors:0 dropped:75175 overruns:0 frame:0
          TX packets:8 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:1336628768 (1.3 GB) TX bytes:552 (552.0 B)

eth3 Link encap:Ethernet HWaddr 00:1b:21:77:c5:d5
          inet addr:192.168.1.156 Bcast:192.168.1.255 Mask:255.255.255.0
          inet6 addr: fe80::21b:21ff:fe77:c5d5/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
          RX packets:26481809 errors:0 dropped:75059 overruns:0 frame:0
          TX packets:18030236 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:70061260271 (70.0 GB) TX bytes:11844181778 (11.8 GB)

Two different NICs are on the same subnet -- that doesn't seem like a good idea...? I think this topic has come up on the users list before, and, IIRC, the general consensus is "don't do that" because it's not clear as to which NIC Linux will actually send outgoing traffic across bound for the 192.168.1.x subnet.

On Aug 4, 2011, at 1:59 PM, Keith Manville wrote:

> I am having trouble running my MPI program on multiple nodes. I can
> run multiple processes on a single node, and I can spawn processes on
> on remote nodes, but when I call Send from a remote node, the node
> never returns, even though there is an appropriate Recv waiting. I'm
> pretty sure this is an issue with my configuration, not my code. I've
> tried some other sample programs I found and had the same problem of
> hanging on a send from one host to another.
>
> Here's an in depth description:
>
> I wrote a quick test program where each process with rank > 1 sends an
> int to the master (rank 0), and the master receives until it gets
> something from every other process.
>
> My test program works fine when I run multiple processes on a single machine.
>
> either the local node:
>
> $ ./mpirun -n 4 ./mpi-test
> Hi I'm localhost:2
> Hi I'm localhost:1
> localhost:1 sending 11...
> localhost:2 sending 12...
> localhost:2 sent 12
> localhost:1 sent 11
> Hi I'm localhost:0
> localhost:0 received 11 from 1
> localhost:0 received 12 from 2
> Hi I'm localhost:3
> localhost:3 sending 13...
> localhost:3 sent 13
> localhost:0 received 13 from 3
> all workers checked in!
>
> or a remote one:
>
> $ ./mpirun -np 2 -host remotehost ./mpi-test
> Hi I'm remotehost:0
> remotehost:0 received 11 from 1
> all workers checked in!
> Hi I'm remotehost:1
> remotehost:1 sending 11...
> remotehost:1 sent 11
>
> But when I try to run the master locally and the worker(s) remotely
> (this is the way I am actually interested in running it), Send never
> returns and it hangs indefinitely.
>
> $ ./mpirun -np 2 -host localhost,remotehost ./mpi-test
> Hi I'm localhost:0
> Hi I'm remotehost:1
> remotehost:1 sending 11...
>
> Just to see if it would work, I tried spawning the master on the
> remotehost and the worker on the localhost.
>
> $ ./mpirun -np 2 -host remotehost,localhost ./mpi-test
> Hi I'm localhost:1
> localhost:1 sending 11...
> localhost:1 sent 11
> Hi I'm remotehost:0
> remotehost:0 received 0 from 1
> all workers checked in!
>
> It doesn't hang on Send, but the wrong value is received.
>
> Any idea what's going on? I've attached my code, my config.log,
> ifconfig output, and ompi_info output.
>
> Thanks,
> Keith
> <mpi.tgz>_______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users

-- 
Jeff Squyres
jsquyres_at_[hidden]
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/