Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: [OMPI users] random error : btl_tcp_endpoint.c:636:mca_btl_tcp_endpoint_complete_connect connection refused (111)
From: Sudarshan Wadkar (wadkar_at_[hidden])
Date: 2010-10-30 02:26:01

Hello OpenMPI list !
I am trying to run "GROMACS" with openmpi 1.5 compiled from source
with Intel compilers using Torque/Maui scheduler
I am getting following error. The error indicates problem with OpenMPI
hence I am posting my query here.

connect() to failed: Connection refused (111)

The job hangs (no output for a long time). The strange thing about
this error is that I get this error on random occasions. Sometimes the
job finishes without any error messages, sometimes this error shows up
in middle of Gromacs' STDERR stream, and sometimes I only get
following -

NNODES=4, MYRANK=0, HOSTNAME=compute-0-4.local
NODEID=0 argc=12
connect() to failed: Connection refused (111)
NNODES=4, MYRANK=1, HOSTNAME=compute-0-4.local
NNODES=4, MYRANK=2, HOSTNAME=compute-0-130.local
NODEID=2 argc=12
NNODES=4, MYRANK=3, HOSTNAME=compute-0-130.local
NODEID=1 argc=12
NODEID=3 argc=12

I can attach full logs of successful jobs, but it doesn't contain any
OpenMPI related messages.

When I searched for
btl_tcp_endpoint.c:636:mca_btl_tcp_endpoint_complete_connect , I found
following link -
which says "This is probably due to a weakness of the system when the
job is assigned to nodes with and without infiniband at the same time"
However, our system doesn't have any infiniband fabric. We do have two
GIGE networks eth0 and eth1 both of which are working fine.

Please help.

Thank you

Sudarshan Wadkar
System Administrator

"Success is getting what you want. Happiness is wanting what you get."
- Dale Carnegie
"It's always our decision who we are"
- Robert Solomon in Waking Life