You might be right, the connections might have been established but the error message you state (connection refused) seems out of place if the connection was already established.
I was under the impression that all connections are made because of the nature of the program that OpenMPI is invoking. LS-DYNA is a finite element solver and for any given simulation I run, the cores on each node must constantly communicate with one another to check for various occurrences (contact with various pieces/parts, updating nodal coordinates, etc…).
Yeah, it possibly could be telling if things do work with this setting.
I’ve run the program using --mca mpi_preconnect_mpi 1 and the simulation has started itself up successfully which I think means that the mpi_preconnect passed since all of the child processes have started up on each individual node. Thanks for the suggestion though, it’s a good place to start.
The queuing really depends on what type of calls the application is making. If it is doing blocking sends then I wouldn't expect too much queuing happening using the tcp btl. As far as traffic flow control is concerned I believe the tcp btl doesn't do any for the most part and lets tcp handle that. Maybe someone else on the list could chime in if I am wrong here.
I’ve been worried (though I have no basis for it) that messages may be getting queued up and hitting some kind of ceiling or timeout. As a finite element code, I think the communication occurs on a large scale. Lots of very small packets going back and forth quickly. A few studies have been done by the High Performance Computing Advisory Council (http://www.hpcadvisorycouncil.com/pdf/LS-DYNA%20_analysis.pdf) and they’ve suggested that LS-DYNA communicates at very, very high rates (Not sure but from pg.15 of that document they’re suggesting hundreds of millions of messages in only a few hours). Is there any kind of buffer or queue that OpenMPI develops if messages are created too quickly? Does it dispatch them immediately or does it attempt to apply some kind of traffic flow control?