Terry,
I was under the impression that all
connections are made because of the nature of the program that OpenMPI is
invoking. LS-DYNA is a finite element solver and for any given simulation I run,
the cores on each node must constantly communicate with one another to check
for various occurrences (contact with various pieces/parts, updating nodal
coordinates, etc…).
I’ve run the program using --mca
mpi_preconnect_mpi 1 and the simulation has started itself up successfully
which I think means that the mpi_preconnect passed since all of the child
processes have started up on each individual node. Thanks for the suggestion
though, it’s a good place to start.
I’ve been worried (though I have no
basis for it) that messages may be getting queued up and hitting some kind of
ceiling or timeout. As a finite element code, I think the communication occurs
on a large scale. Lots of very small packets going back and forth quickly. A
few studies have been done by the High Performance Computing Advisory Council (http://www.hpcadvisorycouncil.com/pdf/LS-DYNA%20_analysis.pdf)
and they’ve suggested that LS-DYNA communicates at very, very high rates
(Not sure but from pg.15 of that document they’re suggesting hundreds of
millions of messages in only a few hours). Is there any kind of buffer or queue
that OpenMPI develops if messages are created too quickly? Does it dispatch
them immediately or does it attempt to apply some kind of traffic flow control?
Regards,
Robert Walters
From: users-bounces@open-mpi.org
[mailto:users-bounces@open-mpi.org] On Behalf
Of Terry Dontje
Sent: Monday, May 02, 2011 1:45 PM
To: users@open-mpi.org
Subject: Re: [OMPI users] OpenMPI
LS-DYNA Connection refused
On 05/02/2011 01:27 PM, Robert Walters wrote:
Open-MPI Users,
I’ve been using OpenMPI for a while now and am
very pleased with it. I use the OpenMPI system across eight Red Hat Linux nodes
(8 cores each) on 1 Gbps Ethernet behind a dedicated switch. After working out
kinks in the beginning, we’ve been using it periodically anywhere from 8
cores to 64 cores. We use a finite element software named LS-DYNA. We do not
have source code for this program, it is compiled to work with OpenMPI 1.4.1 (I
use 1.4.2) and we cannot make changes or request code to see how it performs
certain functions.
From time to time, I will be simulating a particular
“job” in LS-DYNA and for some reason, it will quit OpenMPI issuing
a MPI_ABORT command stating that “connect to address xx.xxx.xxx.xxx port
xxx: Connection refused; trying normal rsh (/usr/bin/rsh).” This error
comes after running for hours, which means that connections to the node
it’s citing have already been made previously. The particular node it
names is random and changes from simulation to simulation. We use SSH to
communicate and we have the ports open for node-to-node communications on any
port.
I am curious what makes you think the connections to
the node its citing have been made? Are you sure the connection between
two processes have been made?
Does any user have experience with this error where a
connection is established, and used for several hours, but after a seemingly
random period of time the program dies stating it can’t make a
connection?
Have you tried running
the code giving mpirun the "-mca mpi_preconnect_mpi 1" option?
This will try (it isn't complete but close) to establish all connections at the
start of the job.
--
![]()
Terry D. Dontje | Principal Software Engineer
Developer Tools Engineering |
+1.781.442.2631
Oracle - Performance Technologies
Email terry.dontje@oracle.com