Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

From: Adrian Knoth (adi_at_[hidden])
Date: 2007-09-13 13:25:26


On Thu, Sep 13, 2007 at 11:15:47AM -0500, Tim Campbell wrote:

> workstations. When mpirun tries to start the processes on certain
> nodes I get the following error output.
>
> [sr70][0,1,2][btl_tcp_endpoint.c:
> 572:mca_btl_tcp_endpoint_complete_connect] connect() failed with
> errno=111
> [sr71][0,1,3][btl_tcp_endpoint.c:
> 572:mca_btl_tcp_endpoint_complete_connect] connect() failed with
> errno=111
>
> Using perl -e 'die$!=111' I see that the error message is "Connection
> refused". I am able to connect to both nodes in question via ssh and/

This sounds pretty much like an IP setup issue. Perhaps some nodes have
more than one interface, i.e. internal and external network,
IP-over-FireWire, ppp-Devices or something else. Exporting these
addresses would clearly cause other nodes to be unable to connect.

If so, use btl_tcp_if_exclude (or _include) to specify the right
interface.

Second problem: local firewalls. Though ssh connections might be
allowed, the sysadmin could block almost any other (destination) port,
thus causing the same error messages. (in case of
icmp-port-unreachable).

> What are some possible issues that might be causing this? What can I
> do to get more information?

I agree that you surely need more information. Can you recompile with
--enable-debug and change

#define WANT_PEER_DUMP 0

in file ompi/mca/btl/tcp/btl_tcp_endpoint.c from "0" to "1" before
recompiling?

This should give you detailed information.

HTH

-- 
Cluster and Metacomputing Working Group
Friedrich-Schiller-Universität Jena, Germany
private: http://adi.thur.de