Open-MPI Users,
I’ve been using OpenMPI for a while now and am very
pleased with it. I use the OpenMPI system across eight Red Hat Linux nodes (8
cores each) on 1 Gbps Ethernet behind a dedicated switch. After working out
kinks in the beginning, we’ve been using it periodically anywhere from 8
cores to 64 cores. We use a finite element software named LS-DYNA. We do not
have source code for this program, it is compiled to work with OpenMPI 1.4.1 (I
use 1.4.2) and we cannot make changes or request code to see how it performs certain
functions.
From time to time, I will be simulating a particular “job”
in LS-DYNA and for some reason, it will quit OpenMPI issuing a MPI_ABORT
command stating that “connect to address xx.xxx.xxx.xxx port xxx:
Connection refused; trying normal rsh (/usr/bin/rsh).” This error comes
after running for hours, which means that connections to the node it’s
citing have already been made previously. The particular node it names is
random and changes from simulation to simulation. We use SSH to communicate and
we have the ports open for node-to-node communications on any port.
Does any user have experience with this error where a
connection is established, and used for several hours, but after a seemingly
random period of time the program dies stating it can’t make a
connection?
Thanks,
Robert Walters