I’ve attached the typical error
message I’ve been getting. This is from a run I initiated this morning.
The first few lines or so are related to the LS-DYNA program and are just there
to let you know its running successfully for an hour and a half.
What’s interesting is this doesn’t
happen on every job I run, and will recur for the same simulation. For
instance, Simulation A will run for 40 hours, and complete successfully.
Simulation B will run for 6 hours, and die from an error. Any further attempts
to run simulation B will always end from an error. This makes me think there is
some kind of bad calculation happening that OpenMPI doesn’t know how to
handle, or LS-DYNA doesn’t know how to pass to OpenMPI. On the other
hand, this particular simulation is one of those “benchmarks” and
everyone runs it. I should not be getting errors from the FE code itself. Odd…
I think I’ll try this as an SMP job as well as an MPP job over a single
node and see if the issue continues. That way I can figure out if its OpenMPI
related or FE code related, but as I mentioned, I don’t think it is FE
code related since others have successfully run this particular benchmarking
simulation.
Error
Message:
Parallel execution
with 56 MPP proc
NLQ
used/max
152/ 152
Start time 05/02/2011
10:02:20
End time
05/02/2011 11:24:46
Elapsed time 4946
seconds( 1 hours 22 min. 26 sec.) for 9293 cycles
E r r o r t e r m i n a
t i o n
--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 0 in
communicator MPI_COMM_WORLD
with errorcode -1525207032.
NOTE: invoking MPI_ABORT causes Open MPI
to kill all MPI processes.
You may or may not see output from other
processes, depending on
exactly when Open MPI kills them.
--------------------------------------------------------------------------
connect to address xx.xxx.xx.xxx port 544:
Connection refused
connect to address xx.xxx.xx.xxx port 544:
Connection refused
trying normal rsh (/usr/bin/rsh)
--------------------------------------------------------------------------
mpirun has exited due to process rank 0
with PID 24488 on
node allision exiting without calling
"finalize". This may
have caused other processes in the
application to be
terminated by signals sent by mpirun (as
reported here).
--------------------------------------------------------------------------
Regards,
Robert Walters
From: users-bounces@open-mpi.org
[mailto:users-bounces@open-mpi.org] On Behalf
Of Terry Dontje
Sent: Monday, May 02, 2011 2:50 PM
To: users@open-mpi.org
Subject: Re: [OMPI users] OpenMPI
LS-DYNA Connection refused
On 05/02/2011 02:04 PM, Robert Walters wrote:
I was under the impression that all connections
are made because of the nature of the program that OpenMPI is invoking. LS-DYNA
is a finite element solver and for any given simulation I run, the cores on
each node must constantly communicate with one another to check for various
occurrences (contact with various pieces/parts, updating nodal coordinates,
etc…).
You
might be right, the connections might have been established but the error
message you state (connection refused) seems out of place if the connection was
already established.
Was there more error messages from OMPI other than "connection
refused"? If so could you possibly provide that output to us, maybe
it will give us a hint where in the library things are messing up.
Yeah,
it possibly could be telling if things do work with this setting.
I’ve been worried (though I have no
basis for it) that messages may be getting queued up and hitting some kind of
ceiling or timeout. As a finite element code, I think the communication occurs
on a large scale. Lots of very small packets going back and forth quickly. A
few studies have been done by the High Performance Computing Advisory Council (http://www.hpcadvisorycouncil.com/pdf/LS-DYNA%20_analysis.pdf)
and they’ve suggested that LS-DYNA communicates at very, very high rates
(Not sure but from pg.15 of that document they’re suggesting hundreds of
millions of messages in only a few hours). Is there any kind of buffer or queue
that OpenMPI develops if messages are created too quickly? Does it dispatch
them immediately or does it attempt to apply some kind of traffic flow control?
The
queuing really depends on what type of calls the application is making.
If it is doing blocking sends then I wouldn't expect too much queuing happening
using the tcp btl. As far as traffic flow control is concerned I believe
the tcp btl doesn't do any for the most part and lets tcp handle that.
Maybe someone else on the list could chime in if I am wrong here.
In the past I have seen where lots of traffic on the network and to a
particular node has cause some connections not to be established. But I
don't know of any outstanding issue with such issues right now.
--
![]()
Terry D. Dontje | Principal Software Engineer
Developer Tools Engineering |
+1.781.442.2631
Oracle - Performance Technologies
Email terry.dontje@oracle.com