Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: [OMPI users] Connection timed out on TCP and notify question
From: Vince Grimes (tom.grimes_at_[hidden])
Date: 2014-04-24 11:41:13


Dear all:

        In the ongoing investigation into why a particular in-house program is
not working in parallel over multiple nodes using OpenMPI, running with
"--mca btl self,sm,tcp" I have been running into the following error:

[compute-6-15.local][[8185,1],0
[btl_tcp_endpoint.c:653:mca_btl_tcp_endpoint_complete_connect] connect()
to 10.7.36.247 failed: Connection timed out (110)

I thought at first it was due to running out of file handles (sockets
are considered files), but I have amended limits.d to allow 102400 files
(up from the default of 1024), which should be more than enough.

        What is going on? Trying to connect to 4/20 nodes gave the error above.

        My second question involves the notify system for btl openib. What does
the syslog notifier require in order to work? I want to see if the
errors running the same program with openib are due to dropped IB
connections.

-- 
T. Vince Grimes, Ph.D.
CCC System Administrator
Texas Tech University
Dept. of Chemistry and Biochemistry (10A)
Box 41061
Lubbock, TX 79409-1061
(806) 834-0813 (voice);     (806) 742-1289 (fax)