Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] [OMPI svn] svn:open-mpi r18303
From: Ralph Castain (rhc_at_[hidden])
Date: 2008-04-25 19:52:47


On 4/25/08 5:38 PM, "Aurélien Bouteiller" <bouteill_at_[hidden]> wrote:

> To bounce on last George remark, currently when a job dies without
> unsubscribing a port with Unpublish(due to poor user programming,
> failure or abort), ompi-server keeps the reference forever and a new
> application can therefore not publish under the same name again. So I
> guess this is a good point to cleanup correctly all published/opened
> ports, when the application is ended (for whatever reason).

That's a good point - in my other note, all I had addressed was closing my
local port. We should ensure that the pubsub framework does an unpublish of
anything we put out there. I'll have to create a command to do that since
pubsub doesn't actually track what it was asked to publish - we'll need
something that tells both local and global data servers to "unpublish
anything that came from me".

>
> Another cool feature could be to have mpirun behave as an ompi-server,
> and publish a suitable URI if requested to do so (if the urifile does
> not exist yet ?). I know from the source code that mpirun is already
> including anything needed to offer this feature, exept the ability to
> provide a suitable URI.

Just to be sure I understand, since I think this is doable. Mpirun already
does serve as your "ompi-server" for any job it spawns - that is the purpose
of the MPI_Info flag "local" instead of "global" when you publish
information. You can always publish/lookup against your own mpirun.

What you are suggesting here is that we have each mpirun put its local data
server port info somewhere that another job can find it, either in the
already existing contact_info file, or perhaps in a separate "data server
uri" file?

The only reason for concern here is the obvious race condition. Since mpirun
only exists during the time a job is running, you could lookup its contact
info and attempt to publish/lookup to that mpirun, only to find it doesn't
respond because it either is already dead or on its way out. Hence the
notion of restricting inter-job operations to the system-level ompi-server.

If we can think of a way to deal with the race condition, I'm certainly
willing to publish the contact info. I'm just concerned that you may find
yourself "hung" if that mpirun goes away unexpectedly - say right in the
middle of a publish/lookup operation.

Ralph

>
> Aurelien
>
> Le 25 avr. 08 à 19:19, George Bosilca a écrit :
>
>> Ralph,
>>
>> Thanks for your concern regarding the level of compliance of our
>> implementation of the MPI standard. I don't know who were the MPI
>> gurus you talked with about this issue, but I can tell that for once
>> the MPI standard is pretty clear about this.
>>
>> As stated by Aurelien in his last email, using the plural in several
>> sentences, strongly suggest that the status of port should not be
>> implicitly modified by MPI_Comm_accept or MPI_Comm_connect.
>> Moreover, in the beginning of the chapter in the MPI standard, it is
>> specified that comm/accept work exactly as in TCP. In other words,
>> once the port is opened it stay open until the user explicitly close
>> it.
>>
>> However, not all corner cases are addressed by the MPI standard.
>> What happens on MPI_Finalize ... it's a good question. Personally, I
>> think we should stick with the TCP similarities. The port should be
>> not only closed by unpublished. This will solve all issues with
>> people trying to lookup a port once the originator is gone.
>>
>> george.
>>
>> On Apr 25, 2008, at 5:25 PM, Ralph Castain wrote:
>>
>>> As I said, it makes no difference to me. I just want to ensure that
>>> everyone
>>> agrees on the interpretation of the MPI standard. We have had these
>>> discussion in the past, with differing views. My guess here is that
>>> the port
>>> was left open mostly because the person who wrote the C-binding
>>> forgot to
>>> close it. ;-)
>>>
>>> So, you MPI folks: do we allow multiple connections against a
>>> single port,
>>> and leave the port open until explicitly closed? If so, then do we
>>> generate
>>> an error if someone calls MPI_Finalize without first closing the
>>> port? Or do
>>> we automatically close any open ports when finalize is called?
>>>
>>> Or do we automatically close the port after the connect/accept is
>>> completed?
>>>
>>> Thanks
>>> Ralph
>>>
>>>
>>>
>>> On 4/25/08 3:13 PM, "Aurélien Bouteiller" <bouteill_at_[hidden]>
>>> wrote:
>>>
>>>> Actually, the port was still left open forever before the change.
>>>> The
>>>> bug damaged the port string, and it was not usable anymore, not only
>>>> in subsequent Comm_accept, but also in Close_port or Unpublish_name.
>>>>
>>>> To more specifically answer to your open port concern, if the user
>>>> does not want to have an open port anymore, he should specifically
>>>> call MPI_Close_port and not rely on MPI_Comm_accept to close it.
>>>> Actually the standard suggests the exact contrary: section 5.4.2
>>>> states "it must call MPI_Open_port to establish a port [...] it must
>>>> call MPI_Comm_accept to accept connections from clients". Because
>>>> there is multiple clients AND multiple connections in that
>>>> sentence, I
>>>> assume the port can be used in multiple accepts.
>>>>
>>>> Aurelien
>>>>
>>>> Le 25 avr. 08 à 16:53, Ralph Castain a écrit :
>>>>
>>>>> Hmmm...just to clarify, this wasn't a "bug". It was my
>>>>> understanding
>>>>> per the
>>>>> MPI folks that a separate, unique port had to be created for every
>>>>> invocation of Comm_accept. They didn't want a port hanging around
>>>>> open, and
>>>>> their plan was to close the port immediately after the connection
>>>>> was
>>>>> established.
>>>>>
>>>>> So dpm_orte was written to that specification. When I reorganized
>>>>> the code,
>>>>> I left the logic as it had been written - which was actually done
>>>>> by
>>>>> the MPI
>>>>> side of the house, not me.
>>>>>
>>>>> I have no problem with making the change. However, since the
>>>>> specification
>>>>> was created on the MPI side, I just want to make sure that the MPI
>>>>> folks all
>>>>> realize this has now been changed. Obviously, if this change in
>>>>> spec
>>>>> is
>>>>> adopted, someone needs to make sure that the C and Fortran
>>>>> bindings -
>>>>> do not-
>>>>> close that port any more!
>>>>>
>>>>> Ralph
>>>>>
>>>>>
>>>>>
>>>>> On 4/25/08 2:41 PM, "bouteill_at_[hidden]" <bouteill_at_[hidden]>
>>>>> wrote:
>>>>>
>>>>>> Author: bouteill
>>>>>> Date: 2008-04-25 16:41:44 EDT (Fri, 25 Apr 2008)
>>>>>> New Revision: 18303
>>>>>> URL: https://svn.open-mpi.org/trac/ompi/changeset/18303
>>>>>>
>>>>>> Log:
>>>>>> Fix a bug that rpevented to use the same port (as returned by
>>>>>> Open_port) for
>>>>>> several Comm_accept)
>>>>>>
>>>>>>
>>>>>> Text files modified:
>>>>>> trunk/ompi/mca/dpm/orte/dpm_orte.c | 19 ++++++++++---------
>>>>>> 1 files changed, 10 insertions(+), 9 deletions(-)
>>>>>>
>>>>>> Modified: trunk/ompi/mca/dpm/orte/dpm_orte.c
>>>>>> =
>>>>>> =
>>>>>> =
>>>>>> =
>>>>>> =
>>>>>> =
>>>>>> =
>>>>>> =
>>>>>> =
>>>>>> =
>>>>>> =
>>>>>> =
>>>>>> ==================================================================
>>>>>> --- trunk/ompi/mca/dpm/orte/dpm_orte.c (original)
>>>>>> +++ trunk/ompi/mca/dpm/orte/dpm_orte.c 2008-04-25 16:41:44 EDT
>>>>>> (Fri, 25 Apr
>>>>>> 2008)
>>>>>> @@ -848,8 +848,14 @@
>>>>>> {
>>>>>> char *tmp_string, *ptr;
>>>>>>
>>>>>> + /* copy the RML uri so we can return a malloc'd value
>>>>>> + * that can later be free'd
>>>>>> + */
>>>>>> + tmp_string = strdup(port_name);
>>>>>> +
>>>>>> /* find the ':' demarking the RML tag we added to the end */
>>>>>> - if (NULL == (ptr = strrchr(port_name, ':'))) {
>>>>>> + if (NULL == (ptr = strrchr(tmp_string, ':'))) {
>>>>>> + free(tmp_string);
>>>>>> return NULL;
>>>>>> }
>>>>>>
>>>>>> @@ -863,15 +869,10 @@
>>>>>> /* see if the length of the RML uri is too long - if so,
>>>>>> * truncate it
>>>>>> */
>>>>>> - if (strlen(port_name) > MPI_MAX_PORT_NAME) {
>>>>>> - port_name[MPI_MAX_PORT_NAME] = '\0';
>>>>>> + if (strlen(tmp_string) > MPI_MAX_PORT_NAME) {
>>>>>> + tmp_string[MPI_MAX_PORT_NAME] = '\0';
>>>>>> }
>>>>>> -
>>>>>> - /* copy the RML uri so we can return a malloc'd value
>>>>>> - * that can later be free'd
>>>>>> - */
>>>>>> - tmp_string = strdup(port_name);
>>>>>> -
>>>>>> +
>>>>>> return tmp_string;
>>>>>> }
>>>>>>
>>>>>> _______________________________________________
>>>>>> svn mailing list
>>>>>> svn_at_[hidden]
>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/svn
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> devel mailing list
>>>>> devel_at_[hidden]
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>
>>>>
>>>> _______________________________________________
>>>> devel mailing list
>>>> devel_at_[hidden]
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>
>>>
>>>
>>> _______________________________________________
>>> devel mailing list
>>> devel_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>
>> _______________________________________________
>> devel mailing list
>> devel_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>
>
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel