Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] MPI based HLA/RTI ?
From: John Chludzinski (john.chludzinski_at_[hidden])
Date: 2013-04-22 11:36:51


Mainly responding to Ralph's comments.

In HLA a federate (MPI process) can join and leave a federation (MPI
collective) independently from other federates. And rejoin later.

---John

On Mon, Apr 22, 2013 at 11:20 AM, George Bosilca <bosilca_at_[hidden]>wrote:

> On Apr 19, 2013, at 17:00 , John Chludzinski <john.chludzinski_at_[hidden]>
> wrote:
>
> So the apparent conclusion to this thread is that an (Open)MPI based RTI
> is very doable - if we allow for the future development of dynamic joining
> and leaving of the MPI collective?
>
>
> John,
>
> What do you mean by dynamically joining and leaving of the MPI collective?
>
> There are quite a few functions in MPI to dynamically join and disconnect
> processes (MPI_Spawn, MPI_Connect, MPI_Comm_join). So if your processes
> __always__ leave cleanly (using the defined MPI pattern of comm_disconnect
> + comm_free), you might be lucky enough to have this working today. If you
> want to support processes leaving for reasons outside of your control (such
> as crash) you do not have an option today in MPI, you need to use some
> extension (such as ULFM).
>
> George.
>
>
>
>
> ---John
>
>
> On Wed, Apr 17, 2013 at 12:45 PM, Ralph Castain <rhc_at_[hidden]> wrote:
>
>> Thanks for the clarification - very interesting indeed! I'll look at it
>> more closely.
>>
>>
>> On Apr 17, 2013, at 9:20 AM, George Bosilca <bosilca_at_[hidden]> wrote:
>>
>> On Apr 16, 2013, at 15:51 , Ralph Castain <rhc_at_[hidden]> wrote:
>>
>> Just curious: I thought ULFM dealt with recovering an MPI job where one
>> or more processes fail. Is this correct?
>>
>>
>> It depends what is the definition of "recovering" you take. ULFM is about
>> leaving the processes that remains (after a fault or a disconnect) in a
>> state that allow them to continue to make progress. It is not about
>> recovering processes, or user data, but it does provide the minimalistic
>> set of functionalities to allow application to do this, if needed (revoke,
>> agreement and shrink).
>>
>> HLA/RTI consists of processes that start at random times, run to
>> completion, and then exit normally. While a failure could occur, most
>> process terminations are normal and there is no need/intent to revive them.
>>
>>
>> As I said above, there is no revival of processes in ULFM, and it was
>> never our intent to have such feature. The dynamic world is to be
>> constructed using MPI-2 constructs (MPI_Spawn or MPI_Connect/Accept or even
>> MPI_Join).
>>
>> So it's mostly a case of massively exercising MPI's dynamic
>> connect/accept/disconnect functions.
>>
>> Do ULFM's structures have some utility for that purpose?
>>
>>
>> Absolutely. If the process that leaves instead of calling MPI_Finalize
>> calls exit() this will be interpreted by the version of the runtime in ULFM
>> as an event triggering a report. All the ensuing mechanisms are then
>> activated and the application can react to this event with the most
>> meaningful approach it can envision.
>>
>> George.
>>
>>
>>
>> On Apr 16, 2013, at 3:20 AM, George Bosilca <bosilca_at_[hidden]> wrote:
>>
>> There is an ongoing effort to address the potential volatility of
>> processes in MPI called ULFM. There is a working version available at
>> http://fault-tolerance.org. It supports TCP, sm and IB (mostly). You
>> will find some examples, and the document explaining the additional
>> constructs needed in MPI to achieve this.
>>
>> George.
>>
>> On Apr 15, 2013, at 17:29 , John Chludzinski <john.chludzinski_at_[hidden]>
>> wrote:
>>
>> That would seem to preclude its use for an RTI. Unless you have a card
>> up your sleeve?
>>
>> ---John
>>
>>
>> On Mon, Apr 15, 2013 at 11:23 AM, Ralph Castain <rhc_at_[hidden]> wrote:
>>
>>> It isn't the fact that there are multiple programs being used - we
>>> support that just fine. The problem with HLA/RTI is that it allows programs
>>> to come/go at will - i.e., not every program has to start at the same time,
>>> nor complete at the same time. MPI requires that all programs be executing
>>> at the beginning, and that all call finalize prior to anyone exiting.
>>>
>>>
>>> On Apr 15, 2013, at 8:14 AM, John Chludzinski <
>>> john.chludzinski_at_[hidden]> wrote:
>>>
>>> I just received an e-mail notifying me that MPI-2 supports MPMD. This
>>> would seen to be just what the doctor ordered?
>>>
>>> ---John
>>>
>>>
>>> On Mon, Apr 15, 2013 at 11:10 AM, Ralph Castain <rhc_at_[hidden]>wrote:
>>>
>>>> FWIW: some of us are working on a variant of MPI that would indeed
>>>> support what you describe - it would support send/recv (i.e., MPI-1), but
>>>> not collectives, and so would allow communication between arbitrary
>>>> programs.
>>>>
>>>> Not specifically targeting HLA/RTI, though I suppose a wrapper that
>>>> conformed to that standard could be created.
>>>>
>>>> On Apr 15, 2013, at 7:50 AM, John Chludzinski <
>>>> john.chludzinski_at_[hidden]> wrote:
>>>>
>>>> > This would be a departure from the SPMD paradigm that seems central to
>>>> > MPI's design. Each process would be a completely different program
>>>> > (piece of code) and I'm not sure how well that would working using
>>>> > MPI?
>>>> >
>>>> > BTW, MPI is commonly used in the parallel discrete even world for
>>>> > communication between LPs (federates in HLA). But these LPs are
>>>> > usually the same program.
>>>> >
>>>> > ---John
>>>> >
>>>> > On Mon, Apr 15, 2013 at 10:22 AM, John Chludzinski
>>>> > <john.chludzinski_at_[hidden]> wrote:
>>>> >> Is anyone aware of an MPI based HLA/RTI (DoD High Level Architecture
>>>> >> (HLA) / Runtime Infrastructure)?
>>>> >>
>>>> >> ---John
>>>> > _______________________________________________
>>>> > users mailing list
>>>> > users_at_[hidden]
>>>> > http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>
>>>>
>>>> _______________________________________________
>>>> users mailing list
>>>> users_at_[hidden]
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>
>>>
>>> _______________________________________________
>>> users mailing list
>>> users_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>
>>>
>>>
>>> _______________________________________________
>>> users mailing list
>>> users_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>>
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>