Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] [OMPI svn] svn:open-mpi r18303
From: Ralph Castain (rhc_at_[hidden])
Date: 2008-04-25 21:46:26


True - and I'm all for simple!

Unless someone objects, let's just leave it that way for now.

I'll put on my list to look at this later - maybe count how many publishes
we do vs unpublishes, and if there is a residual at finalize, then send the
"unpublish all" message. Still leaves a race condition, though, so the fail
on timeout is always going to have to be there anyway.

Will ponder and revisit later...

Thanks!
Ralph

On 4/25/08 7:26 PM, "George Bosilca" <bosilca_at_[hidden]> wrote:

> We always have the possibility to fail the MPI_Comm_connect. There is
> a specific error for this MPI_ERR_PORT. We can detect that the port is
> not available anymore (whatever the reason is), by simply using the
> TCP timeout on the connection. It's the best we can, and this will
> give us a simplified way of handling things ...
>
> george.
>
> On Apr 25, 2008, at 7:52 PM, Ralph Castain wrote:
>
>> On 4/25/08 5:38 PM, "Aurélien Bouteiller" <bouteill_at_[hidden]>
>> wrote:
>>
>>> To bounce on last George remark, currently when a job dies without
>>> unsubscribing a port with Unpublish(due to poor user programming,
>>> failure or abort), ompi-server keeps the reference forever and a new
>>> application can therefore not publish under the same name again. So I
>>> guess this is a good point to cleanup correctly all published/opened
>>> ports, when the application is ended (for whatever reason).
>>
>> That's a good point - in my other note, all I had addressed was
>> closing my
>> local port. We should ensure that the pubsub framework does an
>> unpublish of
>> anything we put out there. I'll have to create a command to do that
>> since
>> pubsub doesn't actually track what it was asked to publish - we'll
>> need
>> something that tells both local and global data servers to "unpublish
>> anything that came from me".
>>
>>>
>>> Another cool feature could be to have mpirun behave as an ompi-
>>> server,
>>> and publish a suitable URI if requested to do so (if the urifile does
>>> not exist yet ?). I know from the source code that mpirun is already
>>> including anything needed to offer this feature, exept the ability to
>>> provide a suitable URI.
>>
>> Just to be sure I understand, since I think this is doable. Mpirun
>> already
>> does serve as your "ompi-server" for any job it spawns - that is the
>> purpose
>> of the MPI_Info flag "local" instead of "global" when you publish
>> information. You can always publish/lookup against your own mpirun.
>>
>> What you are suggesting here is that we have each mpirun put its
>> local data
>> server port info somewhere that another job can find it, either in the
>> already existing contact_info file, or perhaps in a separate "data
>> server
>> uri" file?
>>
>> The only reason for concern here is the obvious race condition.
>> Since mpirun
>> only exists during the time a job is running, you could lookup its
>> contact
>> info and attempt to publish/lookup to that mpirun, only to find it
>> doesn't
>> respond because it either is already dead or on its way out. Hence the
>> notion of restricting inter-job operations to the system-level ompi-
>> server.
>>
>> If we can think of a way to deal with the race condition, I'm
>> certainly
>> willing to publish the contact info. I'm just concerned that you may
>> find
>> yourself "hung" if that mpirun goes away unexpectedly - say right in
>> the
>> middle of a publish/lookup operation.
>>
>> Ralph
>>
>>>
>>> Aurelien
>>>
>>> Le 25 avr. 08 à 19:19, George Bosilca a écrit :
>>>
>>>> Ralph,
>>>>
>>>> Thanks for your concern regarding the level of compliance of our
>>>> implementation of the MPI standard. I don't know who were the MPI
>>>> gurus you talked with about this issue, but I can tell that for once
>>>> the MPI standard is pretty clear about this.
>>>>
>>>> As stated by Aurelien in his last email, using the plural in several
>>>> sentences, strongly suggest that the status of port should not be
>>>> implicitly modified by MPI_Comm_accept or MPI_Comm_connect.
>>>> Moreover, in the beginning of the chapter in the MPI standard, it is
>>>> specified that comm/accept work exactly as in TCP. In other words,
>>>> once the port is opened it stay open until the user explicitly close
>>>> it.
>>>>
>>>> However, not all corner cases are addressed by the MPI standard.
>>>> What happens on MPI_Finalize ... it's a good question. Personally, I
>>>> think we should stick with the TCP similarities. The port should be
>>>> not only closed by unpublished. This will solve all issues with
>>>> people trying to lookup a port once the originator is gone.
>>>>
>>>> george.
>>>>
>>>> On Apr 25, 2008, at 5:25 PM, Ralph Castain wrote:
>>>>
>>>>> As I said, it makes no difference to me. I just want to ensure that
>>>>> everyone
>>>>> agrees on the interpretation of the MPI standard. We have had these
>>>>> discussion in the past, with differing views. My guess here is that
>>>>> the port
>>>>> was left open mostly because the person who wrote the C-binding
>>>>> forgot to
>>>>> close it. ;-)
>>>>>
>>>>> So, you MPI folks: do we allow multiple connections against a
>>>>> single port,
>>>>> and leave the port open until explicitly closed? If so, then do we
>>>>> generate
>>>>> an error if someone calls MPI_Finalize without first closing the
>>>>> port? Or do
>>>>> we automatically close any open ports when finalize is called?
>>>>>
>>>>> Or do we automatically close the port after the connect/accept is
>>>>> completed?
>>>>>
>>>>> Thanks
>>>>> Ralph
>>>>>
>>>>>
>>>>>
>>>>> On 4/25/08 3:13 PM, "Aurélien Bouteiller" <bouteill_at_[hidden]>
>>>>> wrote:
>>>>>
>>>>>> Actually, the port was still left open forever before the change.
>>>>>> The
>>>>>> bug damaged the port string, and it was not usable anymore, not
>>>>>> only
>>>>>> in subsequent Comm_accept, but also in Close_port or
>>>>>> Unpublish_name.
>>>>>>
>>>>>> To more specifically answer to your open port concern, if the user
>>>>>> does not want to have an open port anymore, he should specifically
>>>>>> call MPI_Close_port and not rely on MPI_Comm_accept to close it.
>>>>>> Actually the standard suggests the exact contrary: section 5.4.2
>>>>>> states "it must call MPI_Open_port to establish a port [...] it
>>>>>> must
>>>>>> call MPI_Comm_accept to accept connections from clients". Because
>>>>>> there is multiple clients AND multiple connections in that
>>>>>> sentence, I
>>>>>> assume the port can be used in multiple accepts.
>>>>>>
>>>>>> Aurelien
>>>>>>
>>>>>> Le 25 avr. 08 à 16:53, Ralph Castain a écrit :
>>>>>>
>>>>>>> Hmmm...just to clarify, this wasn't a "bug". It was my
>>>>>>> understanding
>>>>>>> per the
>>>>>>> MPI folks that a separate, unique port had to be created for
>>>>>>> every
>>>>>>> invocation of Comm_accept. They didn't want a port hanging around
>>>>>>> open, and
>>>>>>> their plan was to close the port immediately after the connection
>>>>>>> was
>>>>>>> established.
>>>>>>>
>>>>>>> So dpm_orte was written to that specification. When I reorganized
>>>>>>> the code,
>>>>>>> I left the logic as it had been written - which was actually done
>>>>>>> by
>>>>>>> the MPI
>>>>>>> side of the house, not me.
>>>>>>>
>>>>>>> I have no problem with making the change. However, since the
>>>>>>> specification
>>>>>>> was created on the MPI side, I just want to make sure that the
>>>>>>> MPI
>>>>>>> folks all
>>>>>>> realize this has now been changed. Obviously, if this change in
>>>>>>> spec
>>>>>>> is
>>>>>>> adopted, someone needs to make sure that the C and Fortran
>>>>>>> bindings -
>>>>>>> do not-
>>>>>>> close that port any more!
>>>>>>>
>>>>>>> Ralph
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On 4/25/08 2:41 PM, "bouteill_at_[hidden]" <bouteill_at_[hidden]>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Author: bouteill
>>>>>>>> Date: 2008-04-25 16:41:44 EDT (Fri, 25 Apr 2008)
>>>>>>>> New Revision: 18303
>>>>>>>> URL: https://svn.open-mpi.org/trac/ompi/changeset/18303
>>>>>>>>
>>>>>>>> Log:
>>>>>>>> Fix a bug that rpevented to use the same port (as returned by
>>>>>>>> Open_port) for
>>>>>>>> several Comm_accept)
>>>>>>>>
>>>>>>>>
>>>>>>>> Text files modified:
>>>>>>>> trunk/ompi/mca/dpm/orte/dpm_orte.c | 19 ++++++++++---------
>>>>>>>> 1 files changed, 10 insertions(+), 9 deletions(-)
>>>>>>>>
>>>>>>>> Modified: trunk/ompi/mca/dpm/orte/dpm_orte.c
>>>>>>>> =
>>>>>>>> =
>>>>>>>> =
>>>>>>>> =
>>>>>>>> =
>>>>>>>> =
>>>>>>>> =
>>>>>>>> =
>>>>>>>> =
>>>>>>>> =
>>>>>>>> =
>>>>>>>> =
>>>>>>>> =
>>>>>>>> =
>>>>>>>> ================================================================
>>>>>>>> --- trunk/ompi/mca/dpm/orte/dpm_orte.c (original)
>>>>>>>> +++ trunk/ompi/mca/dpm/orte/dpm_orte.c 2008-04-25 16:41:44 EDT
>>>>>>>> (Fri, 25 Apr
>>>>>>>> 2008)
>>>>>>>> @@ -848,8 +848,14 @@
>>>>>>>> {
>>>>>>>> char *tmp_string, *ptr;
>>>>>>>>
>>>>>>>> + /* copy the RML uri so we can return a malloc'd value
>>>>>>>> + * that can later be free'd
>>>>>>>> + */
>>>>>>>> + tmp_string = strdup(port_name);
>>>>>>>> +
>>>>>>>> /* find the ':' demarking the RML tag we added to the end */
>>>>>>>> - if (NULL == (ptr = strrchr(port_name, ':'))) {
>>>>>>>> + if (NULL == (ptr = strrchr(tmp_string, ':'))) {
>>>>>>>> + free(tmp_string);
>>>>>>>> return NULL;
>>>>>>>> }
>>>>>>>>
>>>>>>>> @@ -863,15 +869,10 @@
>>>>>>>> /* see if the length of the RML uri is too long - if so,
>>>>>>>> * truncate it
>>>>>>>> */
>>>>>>>> - if (strlen(port_name) > MPI_MAX_PORT_NAME) {
>>>>>>>> - port_name[MPI_MAX_PORT_NAME] = '\0';
>>>>>>>> + if (strlen(tmp_string) > MPI_MAX_PORT_NAME) {
>>>>>>>> + tmp_string[MPI_MAX_PORT_NAME] = '\0';
>>>>>>>> }
>>>>>>>> -
>>>>>>>> - /* copy the RML uri so we can return a malloc'd value
>>>>>>>> - * that can later be free'd
>>>>>>>> - */
>>>>>>>> - tmp_string = strdup(port_name);
>>>>>>>> -
>>>>>>>> +
>>>>>>>> return tmp_string;
>>>>>>>> }
>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> svn mailing list
>>>>>>>> svn_at_[hidden]
>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/svn
>>>>>>>
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> devel mailing list
>>>>>>> devel_at_[hidden]
>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> devel mailing list
>>>>>> devel_at_[hidden]
>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> devel mailing list
>>>>> devel_at_[hidden]
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>
>>>> _______________________________________________
>>>> devel mailing list
>>>> devel_at_[hidden]
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>
>>>
>>> _______________________________________________
>>> devel mailing list
>>> devel_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>
>>
>>
>> _______________________________________________
>> devel mailing list
>> devel_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel