Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] OpenMPI LS-DYNA Connection refused
From: Terry Dontje (terry.dontje_at_[hidden])
Date: 2011-05-03 09:38:15


Looking at your output more the below "Connect to address" doesn't match
any messages I see in the source code. Also "trying normal
/usr/bin/rsh" looks odd to me.

You may want to set the mca parameter mpi_abort_delay and attach a
debugger to the abortive process and dump out a stack trace. That
should give a better idea where the failure is being triggered. You can
look at http://www.open-mpi.org/faq/?category=debugging question 4 for
more info on the parameter.

--td

On 05/02/2011 03:40 PM, Robert Walters wrote:
>
> I've attached the typical error message I've been getting. This is
> from a run I initiated this morning. The first few lines or so are
> related to the LS-DYNA program and are just there to let you know its
> running successfully for an hour and a half.
>
> What's interesting is this doesn't happen on every job I run, and will
> recur for the same simulation. For instance, Simulation A will run for
> 40 hours, and complete successfully. Simulation B will run for 6
> hours, and die from an error. Any further attempts to run simulation B
> will always end from an error. This makes me think there is some kind
> of bad calculation happening that OpenMPI doesn't know how to handle,
> or LS-DYNA doesn't know how to pass to OpenMPI. On the other hand,
> this particular simulation is one of those "benchmarks" and everyone
> runs it. I should not be getting errors from the FE code itself.
> Odd... I think I'll try this as an SMP job as well as an MPP job over
> a single node and see if the issue continues. That way I can figure
> out if its OpenMPI related or FE code related, but as I mentioned, I
> don't think it is FE code related since others have successfully run
> this particular benchmarking simulation.
>
> *_Error Message:_*
>
> Parallel execution with 56 MPP proc
>
> NLQ used/max 152/ 152
>
> Start time 05/02/2011 10:02:20
>
> End time 05/02/2011 11:24:46
>
> Elapsed time 4946 seconds( 1 hours 22 min. 26 sec.) for 9293
> cycles
>
> E r r o r t e r m i n a t i o n
>
> --------------------------------------------------------------------------
>
> MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD
>
> with errorcode -1525207032.
>
> NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
>
> You may or may not see output from other processes, depending on
>
> exactly when Open MPI kills them.
>
> --------------------------------------------------------------------------
>
> connect to address xx.xxx.xx.xxx port 544: Connection refused
>
> connect to address xx.xxx.xx.xxx port 544: Connection refused
>
> trying normal rsh (/usr/bin/rsh)
>
> --------------------------------------------------------------------------
>
> mpirun has exited due to process rank 0 with PID 24488 on
>
> node allision exiting without calling "finalize". This may
>
> have caused other processes in the application to be
>
> terminated by signals sent by mpirun (as reported here).
>
> --------------------------------------------------------------------------
>
> Regards,
>
> Robert Walters
>
> ------------------------------------------------------------------------
>
> *From:*users-bounces_at_[hidden] [mailto:users-bounces_at_[hidden]]
> *On Behalf Of *Terry Dontje
> *Sent:* Monday, May 02, 2011 2:50 PM
> *To:* users_at_[hidden]
> *Subject:* Re: [OMPI users] OpenMPI LS-DYNA Connection refused
>
> On 05/02/2011 02:04 PM, Robert Walters wrote:
>
> Terry,
>
> I was under the impression that all connections are made because of
> the nature of the program that OpenMPI is invoking. LS-DYNA is a
> finite element solver and for any given simulation I run, the cores on
> each node must constantly communicate with one another to check for
> various occurrences (contact with various pieces/parts, updating nodal
> coordinates, etc...).
>
> You might be right, the connections might have been established but
> the error message you state (connection refused) seems out of place if
> the connection was already established.
>
> Was there more error messages from OMPI other than "connection
> refused"? If so could you possibly provide that output to us, maybe
> it will give us a hint where in the library things are messing up.
>
> I've run the program using --mca mpi_preconnect_mpi 1 and the
> simulation has started itself up successfully which I think means that
> the mpi_preconnect passed since all of the child processes have
> started up on each individual node. Thanks for the suggestion though,
> it's a good place to start.
>
> Yeah, it possibly could be telling if things do work with this setting.
>
> I've been worried (though I have no basis for it) that messages may be
> getting queued up and hitting some kind of ceiling or timeout. As a
> finite element code, I think the communication occurs on a large
> scale. Lots of very small packets going back and forth quickly. A few
> studies have been done by the High Performance Computing Advisory
> Council
> (http://www.hpcadvisorycouncil.com/pdf/LS-DYNA%20_analysis.pdf) and
> they've suggested that LS-DYNA communicates at very, very high rates
> (Not sure but from pg.15 of that document they're suggesting hundreds
> of millions of messages in only a few hours). Is there any kind of
> buffer or queue that OpenMPI develops if messages are created too
> quickly? Does it dispatch them immediately or does it attempt to apply
> some kind of traffic flow control?
>
> The queuing really depends on what type of calls the application is
> making. If it is doing blocking sends then I wouldn't expect too much
> queuing happening using the tcp btl. As far as traffic flow control
> is concerned I believe the tcp btl doesn't do any for the most part
> and lets tcp handle that. Maybe someone else on the list could chime
> in if I am wrong here.
>
> In the past I have seen where lots of traffic on the network and to a
> particular node has cause some connections not to be established. But
> I don't know of any outstanding issue with such issues right now.
>
> --
> Oracle
> Terry D. Dontje | Principal Software Engineer
> Developer Tools Engineering | +1.781.442.2631
> Oracle *- Performance Technologies*
> 95 Network Drive, Burlington, MA 01803
> Email terry.dontje_at_[hidden] <mailto:terry.dontje_at_[hidden]>
>
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users

-- 
Oracle
Terry D. Dontje | Principal Software Engineer
Developer Tools Engineering | +1.781.442.2631
Oracle *- Performance Technologies*
95 Network Drive, Burlington, MA 01803
Email terry.dontje_at_[hidden] <mailto:terry.dontje_at_[hidden]>



picture
picture