Terry,

 

I was under the impression that all connections are made because of the nature of the program that OpenMPI is invoking. LS-DYNA is a finite element solver and for any given simulation I run, the cores on each node must constantly communicate with one another to check for various occurrences (contact with various pieces/parts, updating nodal coordinates, etc…).

 

I’ve run the program using --mca mpi_preconnect_mpi 1 and the simulation has started itself up successfully which I think means that the mpi_preconnect passed since all of the child processes have started up on each individual node. Thanks for the suggestion though, it’s a good place to start.

 

I’ve been worried (though I have no basis for it) that messages may be getting queued up and hitting some kind of ceiling or timeout. As a finite element code, I think the communication occurs on a large scale. Lots of very small packets going back and forth quickly. A few studies have been done by the High Performance Computing Advisory Council (http://www.hpcadvisorycouncil.com/pdf/LS-DYNA%20_analysis.pdf) and they’ve suggested that LS-DYNA communicates at very, very high rates (Not sure but from pg.15 of that document they’re suggesting hundreds of millions of messages in only a few hours). Is there any kind of buffer or queue that OpenMPI develops if messages are created too quickly? Does it dispatch them immediately or does it attempt to apply some kind of traffic flow control?

 

Regards,

Robert Walters

 


From: users-bounces@open-mpi.org [mailto:users-bounces@open-mpi.org] On Behalf Of Terry Dontje
Sent: Monday, May 02, 2011 1:45 PM
To: users@open-mpi.org
Subject: Re: [OMPI users] OpenMPI LS-DYNA Connection refused

 

On 05/02/2011 01:27 PM, Robert Walters wrote:

Open-MPI Users,

 

I’ve been using OpenMPI for a while now and am very pleased with it. I use the OpenMPI system across eight Red Hat Linux nodes (8 cores each) on 1 Gbps Ethernet behind a dedicated switch. After working out kinks in the beginning, we’ve been using it periodically anywhere from 8 cores to 64 cores. We use a finite element software named LS-DYNA. We do not have source code for this program, it is compiled to work with OpenMPI 1.4.1 (I use 1.4.2) and we cannot make changes or request code to see how it performs certain functions.

 

From time to time, I will be simulating a particular “job” in LS-DYNA and for some reason, it will quit OpenMPI issuing a MPI_ABORT command stating that “connect to address xx.xxx.xxx.xxx port xxx: Connection refused; trying normal rsh (/usr/bin/rsh).” This error comes after running for hours, which means that connections to the node it’s citing have already been made previously. The particular node it names is random and changes from simulation to simulation. We use SSH to communicate and we have the ports open for node-to-node communications on any port.

I am curious what makes you think the connections to the node its citing have been made?  Are you sure the connection between two processes have been made?

 

Does any user have experience with this error where a connection is established, and used for several hours, but after a seemingly random period of time the program dies stating it can’t make a connection?

Have you tried running the code giving mpirun the "-mca mpi_preconnect_mpi 1" option?  This will try (it isn't complete but close) to establish all connections at the start of the job.

--
Oracle
Terry D. Dontje | Principal Software Engineer
Developer Tools Engineering | +1.781.442.2631
Oracle - Performance Technologies
95 Network Drive, Burlington, MA 01803
Email terry.dontje@oracle.com