Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] MPI over tcp
From: Rolf vandeVaart (rvandevaart_at_[hidden])
Date: 2012-05-04 08:26:18


>-----Original Message-----
>From: users-bounces_at_[hidden] [mailto:users-bounces_at_[hidden]]
>On Behalf Of Don Armstrong
>Sent: Thursday, May 03, 2012 5:43 PM
>To: users_at_[hidden]
>Subject: Re: [OMPI users] MPI over tcp
>
>On Thu, 03 May 2012, Rolf vandeVaart wrote:
>> I tried your program on a single node and it worked fine.
>
>It works fine on a single node, but deadlocks when it communicates in
>between nodes. Single node communication doesn't use tcp by default.
>
>> Yes, TCP message passing in Open MPI has been working well for some
>> time.
>
>Ok. Which version(s) of openmpi are you using successfully? [I'm assuming
>that this is in an environment which doesn't use IB.]

I was using a trunk version from a month or so ago. However, TCP has not changed too much over the years, so I would expect all versions to work just fine.

>
>> 1. Can you run something like hostname successfully (mpirun -np 10
>> -hostfile yourhostfile hostname)
>
>Yes, but this only shows that processes start and output is returned, which
>doesn't utilize the in-band message passing at all.

Yes, I agree. But it at least shows that TCP connections can work between the machines. We typically first make sure that something like hostname works.
Then we try something like the connectivity_c.c program in the examples directory to test out MPI communication.

>
>> 2. If that works, then you can also run with a debug switch to see
>> what connections are being made by MPI.
>
>You can see the connections being made in the attached log:
>
>[archimedes:29820] btl: tcp: attempting to connect() to [[60576,1],2] address
>138.23.141.162 on port 2001

Yes, I missed that. So, can we simplify the problem. Can you run with np=2 and one process on each node?
Also, maybe you can send the ifconfig output from each node. We sometimes see this type of hanging when
a node has two different interfaces on the same subnet.

Assuming there are multiple interfaces, can you experiment with the runtime flags outlined here?
http://www.open-mpi.org/faq/?category=tcp#tcp-selection

Maybe by restricting to specific interfaces you can figure out which network is the problem.

>
>> I would suggest reading through here for some ideas and for the debug
>> switch.
>
>Thanks. I checked the FAQ, and didn't see anything that shed any light,
>unfortunately.
>
>
>Don Armstrong
>
>--
>Fate and Temperament are two words for one and the same concept.
> -- Novalis [Hermann Hesse _Demian_]
>
>http://www.donarmstrong.com http://rzlab.ucr.edu
>_______________________________________________
>users mailing list
>users_at_[hidden]
>http://www.open-mpi.org/mailman/listinfo.cgi/users
-----------------------------------------------------------------------------------
This email message is for the sole use of the intended recipient(s) and may contain
confidential information. Any unauthorized review, use, disclosure or distribution
is prohibited. If you are not the intended recipient, please contact the sender by
reply email and destroy all copies of the original message.
-----------------------------------------------------------------------------------