I am curious what makes you think the connections to the node its citing have been made? Are you sure the connection between two processes have been made?
I’ve been using OpenMPI for a while now and am very pleased with it. I use the OpenMPI system across eight Red Hat Linux nodes (8 cores each) on 1 Gbps Ethernet behind a dedicated switch. After working out kinks in the beginning, we’ve been using it periodically anywhere from 8 cores to 64 cores. We use a finite element software named LS-DYNA. We do not have source code for this program, it is compiled to work with OpenMPI 1.4.1 (I use 1.4.2) and we cannot make changes or request code to see how it performs certain functions.
From time to time, I will be simulating a particular “job” in LS-DYNA and for some reason, it will quit OpenMPI issuing a MPI_ABORT command stating that “connect to address xx.xxx.xxx.xxx port xxx: Connection refused; trying normal rsh (/usr/bin/rsh).” This error comes after running for hours, which means that connections to the node it’s citing have already been made previously. The particular node it names is random and changes from simulation to simulation. We use SSH to communicate and we have the ports open for node-to-node communications on any port.
Have you tried running the code giving mpirun the "-mca mpi_preconnect_mpi 1" option? This will try (it isn't complete but close) to establish all connections at the start of the job.
Does any user have experience with this error where a connection is established, and used for several hours, but after a seemingly random period of time the program dies stating it can’t make a connection?