In the ongoing investigation into why a particular in-house program is
not working in parallel over multiple nodes using OpenMPI, running with
"--mca btl self,sm,tcp" I have been running into the following error:
to 10.7.36.247 failed: Connection timed out (110)
I thought at first it was due to running out of file handles (sockets
are considered files), but I have amended limits.d to allow 102400 files
(up from the default of 1024), which should be more than enough.
What is going on? Trying to connect to 4/20 nodes gave the error above.
My second question involves the notify system for btl openib. What does
the syslog notifier require in order to work? I want to see if the
errors running the same program with openib are due to dropped IB
T. Vince Grimes, Ph.D.
CCC System Administrator
Texas Tech University
Dept. of Chemistry and Biochemistry (10A)
Lubbock, TX 79409-1061
(806) 834-0813 (voice); (806) 742-1289 (fax)