On Apr 27, 2011, at 2:46 PM, Ralph Castain wrote:
> Actually, I understood you correctly. I'm just saying that I find no evidence in the code that we try three times before giving up. What I see is a single attempt to bind the port - if it fails, then we abort. There is no parameter to control that behavior.
> So if the OS hasn't released the port by the time a new job starts on that node, then it will indeed abort if the job was unfortunately given the same port reservation.
FWIW, the OS may be trying multiple times under the covers, but from as far as OMPI is concerned, we're just trying once.
OMPI asks for whatever port the OS has open (i.e., we pass in 0 when asking for a specific port number, and the OS fills it in for us). If it gives us back a port that isn't actually available, that would be really surprising.
If you have a bajiollion short jobs running, I wonder if there's some kind of race condition occurring that some MPI processes are getting messages from the wrong mpirun. And then things go downhill from there.
I can't immediately imagine how that would happen, but maybe there's some kind of weird race condition in there somewhere...? We pass specific IP addresses and ports around on the command line, though, so I don't quite see how that would happen...
For corporate legal information go to: