Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] OpenMPI LS-DYNA Connection refused
From: Terry Dontje (terry.dontje_at_[hidden])
Date: 2011-05-03 09:38:15

Looking at your output more the below "Connect to address" doesn't match
any messages I see in the source code. Also "trying normal
/usr/bin/rsh" looks odd to me.

You may want to set the mca parameter mpi_abort_delay and attach a
debugger to the abortive process and dump out a stack trace. That
should give a better idea where the failure is being triggered. You can
look at question 4 for
more info on the parameter.


On 05/02/2011 03:40 PM, Robert Walters wrote:
> I've attached the typical error message I've been getting. This is
> from a run I initiated this morning. The first few lines or so are
> related to the LS-DYNA program and are just there to let you know its
> running successfully for an hour and a half.
> What's interesting is this doesn't happen on every job I run, and will
> recur for the same simulation. For instance, Simulation A will run for
> 40 hours, and complete successfully. Simulation B will run for 6
> hours, and die from an error. Any further attempts to run simulation B
> will always end from an error. This makes me think there is some kind
> of bad calculation happening that OpenMPI doesn't know how to handle,
> or LS-DYNA doesn't know how to pass to OpenMPI. On the other hand,
> this particular simulation is one of those "benchmarks" and everyone
> runs it. I should not be getting errors from the FE code itself.
> Odd... I think I'll try this as an SMP job as well as an MPP job over
> a single node and see if the issue continues. That way I can figure
> out if its OpenMPI related or FE code related, but as I mentioned, I
> don't think it is FE code related since others have successfully run
> this particular benchmarking simulation.
> *_Error Message:_*
> Parallel execution with 56 MPP proc
> NLQ used/max 152/ 152
> Start time 05/02/2011 10:02:20
> End time 05/02/2011 11:24:46
> Elapsed time 4946 seconds( 1 hours 22 min. 26 sec.) for 9293
> cycles
> E r r o r t e r m i n a t i o n
> --------------------------------------------------------------------------
> MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD
> with errorcode -1525207032.
> NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
> You may or may not see output from other processes, depending on
> exactly when Open MPI kills them.
> --------------------------------------------------------------------------
> connect to address port 544: Connection refused
> connect to address port 544: Connection refused
> trying normal rsh (/usr/bin/rsh)
> --------------------------------------------------------------------------
> mpirun has exited due to process rank 0 with PID 24488 on
> node allision exiting without calling "finalize". This may
> have caused other processes in the application to be
> terminated by signals sent by mpirun (as reported here).
> --------------------------------------------------------------------------
> Regards,
> Robert Walters
> ------------------------------------------------------------------------
> *From:*users-bounces_at_[hidden] [mailto:users-bounces_at_[hidden]]
> *On Behalf Of *Terry Dontje
> *Sent:* Monday, May 02, 2011 2:50 PM
> *To:* users_at_[hidden]
> *Subject:* Re: [OMPI users] OpenMPI LS-DYNA Connection refused
> On 05/02/2011 02:04 PM, Robert Walters wrote:
> Terry,
> I was under the impression that all connections are made because of
> the nature of the program that OpenMPI is invoking. LS-DYNA is a
> finite element solver and for any given simulation I run, the cores on
> each node must constantly communicate with one another to check for
> various occurrences (contact with various pieces/parts, updating nodal
> coordinates, etc...).
> You might be right, the connections might have been established but
> the error message you state (connection refused) seems out of place if
> the connection was already established.
> Was there more error messages from OMPI other than "connection
> refused"? If so could you possibly provide that output to us, maybe
> it will give us a hint where in the library things are messing up.
> I've run the program using --mca mpi_preconnect_mpi 1 and the
> simulation has started itself up successfully which I think means that
> the mpi_preconnect passed since all of the child processes have
> started up on each individual node. Thanks for the suggestion though,
> it's a good place to start.
> Yeah, it possibly could be telling if things do work with this setting.
> I've been worried (though I have no basis for it) that messages may be
> getting queued up and hitting some kind of ceiling or timeout. As a
> finite element code, I think the communication occurs on a large
> scale. Lots of very small packets going back and forth quickly. A few
> studies have been done by the High Performance Computing Advisory
> Council
> ( and
> they've suggested that LS-DYNA communicates at very, very high rates
> (Not sure but from pg.15 of that document they're suggesting hundreds
> of millions of messages in only a few hours). Is there any kind of
> buffer or queue that OpenMPI develops if messages are created too
> quickly? Does it dispatch them immediately or does it attempt to apply
> some kind of traffic flow control?
> The queuing really depends on what type of calls the application is
> making. If it is doing blocking sends then I wouldn't expect too much
> queuing happening using the tcp btl. As far as traffic flow control
> is concerned I believe the tcp btl doesn't do any for the most part
> and lets tcp handle that. Maybe someone else on the list could chime
> in if I am wrong here.
> In the past I have seen where lots of traffic on the network and to a
> particular node has cause some connections not to be established. But
> I don't know of any outstanding issue with such issues right now.
> --
> Oracle
> Terry D. Dontje | Principal Software Engineer
> Developer Tools Engineering | +1.781.442.2631
> Oracle *- Performance Technologies*
> 95 Network Drive, Burlington, MA 01803
> Email terry.dontje_at_[hidden] <mailto:terry.dontje_at_[hidden]>
> _______________________________________________
> users mailing list
> users_at_[hidden]

Terry D. Dontje | Principal Software Engineer
Developer Tools Engineering | +1.781.442.2631
Oracle *- Performance Technologies*
95 Network Drive, Burlington, MA 01803
Email terry.dontje_at_[hidden] <mailto:terry.dontje_at_[hidden]>