On 05/02/2011 01:27 PM, Robert Walters wrote:
> Open-MPI Users,
> I've been using OpenMPI for a while now and am very pleased with it. I
> use the OpenMPI system across eight Red Hat Linux nodes (8 cores each)
> on 1 Gbps Ethernet behind a dedicated switch. After working out kinks
> in the beginning, we've been using it periodically anywhere from 8
> cores to 64 cores. We use a finite element software named LS-DYNA. We
> do not have source code for this program, it is compiled to work with
> OpenMPI 1.4.1 (I use 1.4.2) and we cannot make changes or request code
> to see how it performs certain functions.
> From time to time, I will be simulating a particular "job" in LS-DYNA
> and for some reason, it will quit OpenMPI issuing a MPI_ABORT command
> stating that "connect to address xx.xxx.xxx.xxx port xxx: Connection
> refused; trying normal rsh (/usr/bin/rsh)." This error comes after
> running for hours, which means that connections to the node it's
> citing have already been made previously. The particular node it names
> is random and changes from simulation to simulation. We use SSH to
> communicate and we have the ports open for node-to-node communications
> on any port.
I am curious what makes you think the connections to the node its citing
have been made? Are you sure the connection between two processes have
> Does any user have experience with this error where a connection is
> established, and used for several hours, but after a seemingly random
> period of time the program dies stating it can't make a connection?
Have you tried running the code giving mpirun the "-mca
mpi_preconnect_mpi 1" option? This will try (it isn't complete but
close) to establish all connections at the start of the job.
Terry D. Dontje | Principal Software Engineer
Developer Tools Engineering | +1.781.442.2631
Oracle *- Performance Technologies*
95 Network Drive, Burlington, MA 01803
Email terry.dontje_at_[hidden] <mailto:terry.dontje_at_[hidden]>