I’ve been using OpenMPI for a while now and am very pleased with it. I use the OpenMPI system across eight Red Hat Linux nodes (8 cores each) on 1 Gbps Ethernet behind a dedicated switch. After working out kinks in the beginning, we’ve been using it periodically anywhere from 8 cores to 64 cores. We use a finite element software named LS-DYNA. We do not have source code for this program, it is compiled to work with OpenMPI 1.4.1 (I use 1.4.2) and we cannot make changes or request code to see how it performs certain functions.
From time to time, I will be simulating a particular “job” in LS-DYNA and for some reason, it will quit OpenMPI issuing a MPI_ABORT command stating that “connect to address xx.xxx.xxx.xxx port xxx: Connection refused; trying normal rsh (/usr/bin/rsh).” This error comes after running for hours, which means that connections to the node it’s citing have already been made previously. The particular node it names is random and changes from simulation to simulation. We use SSH to communicate and we have the ports open for node-to-node communications on any port.
Does any user have experience with this error where a connection is established, and used for several hours, but after a seemingly random period of time the program dies stating it can’t make a connection?