On Apr 27, 2011, at 1:27 PM, Jeff Squyres wrote:
> On Apr 27, 2011, at 2:46 PM, Ralph Castain wrote:
>> Actually, I understood you correctly. I'm just saying that I find no evidence in the code that we try three times before giving up. What I see is a single attempt to bind the port - if it fails, then we abort. There is no parameter to control that behavior.
>> So if the OS hasn't released the port by the time a new job starts on that node, then it will indeed abort if the job was unfortunately given the same port reservation.
> FWIW, the OS may be trying multiple times under the covers, but from as far as OMPI is concerned, we're just trying once.
> OMPI asks for whatever port the OS has open (i.e., we pass in 0 when asking for a specific port number, and the OS fills it in for us). If it gives us back a port that isn't actually available, that would be really surprising.
Nope, nope nope...in this mode of operation, we are using -static- ports.
The problem here is that srun is incorrectly handing out the same port reservation to the next job, causing the port binding to fail because the last job's binding hasn't yet timed out.
> If you have a bajiollion short jobs running, I wonder if there's some kind of race condition occurring that some MPI processes are getting messages from the wrong mpirun. And then things go downhill from there.
> I can't immediately imagine how that would happen, but maybe there's some kind of weird race condition in there somewhere...? We pass specific IP addresses and ports around on the command line, though, so I don't quite see how that would happen...
> Jeff Squyres
> For corporate legal information go to:
> users mailing list