Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |  

This web mail archive is frozen.

This page is part of a frozen web archive of this mailing list.

You can still navigate around this archive, but know that no new mails have been added to it since July of 2016.

Click here to be taken to the new web archives of this list; it includes all the mails that are in this frozen archive plus all new mails that have been sent to the list since it was migrated to the new archives.

Subject: Re: [OMPI users] srun and openmpi
From: Ralph Castain (rhc_at_[hidden])
Date: 2011-04-27 14:46:28

On Apr 27, 2011, at 12:38 PM, Michael Di Domenico wrote:

> On Wed, Apr 27, 2011 at 2:25 PM, Ralph Castain <rhc_at_[hidden]> wrote:
>> On Apr 27, 2011, at 10:09 AM, Michael Di Domenico wrote:
>>> Was this ever committed to the OMPI src as something not having to be
>>> run outside of OpenMPI, but as part of the PSM setup that OpenMPI
>>> does?
>> Not that I know of - I don't think the PSM developers ever looked at it.
>>> I'm having some trouble getting Slurm/OpenMPI to play nice with the
>>> setup of this key. Namely, with slurm you cannot export variables
>>> from the --prolog of an srun, only from an --task-prolog,
>>> unfortunately, if you use a task-prolog each rank gets a different
>>> key, which doesn't work.
>>> I'm also guessing that each unique mpirun needs it's own psm key, not
>>> one for the whole system, so i can't just make it a permanent
>>> parameter somewhere else.
>>> Also, i recall reading somewhere that the --resv-ports parameter that
>>> OMPI uses from slurm to choose a list of ports to use for TCP comm's,
>>> tries to lock a port from the pool three times before giving up.
>> Had to look back at the code - I think you misread this. I can find no evidence in the code that we try to bind that port more than once.
> Perhaps i misstated, i don't believe you're trying to bind to the same
> port twice during the same session. i believe the code re-uses
> similar ports from session to session. what i believe happens (but
> could be totally wrong) the previous session releases the port, but
> linux isn't quite done with it when the new session tries to bind to
> the port, in which case it tries three times and then fails the job

Actually, I understood you correctly. I'm just saying that I find no evidence in the code that we try three times before giving up. What I see is a single attempt to bind the port - if it fails, then we abort. There is no parameter to control that behavior.

So if the OS hasn't released the port by the time a new job starts on that node, then it will indeed abort if the job was unfortunately given the same port reservation.

> _______________________________________________
> users mailing list
> users_at_[hidden]