Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] MPI based HLA/RTI ?
From: George Bosilca (bosilca_at_[hidden])
Date: 2013-04-17 12:20:48


On Apr 16, 2013, at 15:51 , Ralph Castain <rhc_at_[hidden]> wrote:

> Just curious: I thought ULFM dealt with recovering an MPI job where one or more processes fail. Is this correct?

It depends what is the definition of "recovering" you take. ULFM is about leaving the processes that remains (after a fault or a disconnect) in a state that allow them to continue to make progress. It is not about recovering processes, or user data, but it does provide the minimalistic set of functionalities to allow application to do this, if needed (revoke, agreement and shrink).

> HLA/RTI consists of processes that start at random times, run to completion, and then exit normally. While a failure could occur, most process terminations are normal and there is no need/intent to revive them.

As I said above, there is no revival of processes in ULFM, and it was never our intent to have such feature. The dynamic world is to be constructed using MPI-2 constructs (MPI_Spawn or MPI_Connect/Accept or even MPI_Join).

> So it's mostly a case of massively exercising MPI's dynamic connect/accept/disconnect functions.
>
> Do ULFM's structures have some utility for that purpose?

Absolutely. If the process that leaves instead of calling MPI_Finalize calls exit() this will be interpreted by the version of the runtime in ULFM as an event triggering a report. All the ensuing mechanisms are then activated and the application can react to this event with the most meaningful approach it can envision.

  George.

>
>
> On Apr 16, 2013, at 3:20 AM, George Bosilca <bosilca_at_[hidden]> wrote:
>
>> There is an ongoing effort to address the potential volatility of processes in MPI called ULFM. There is a working version available at http://fault-tolerance.org. It supports TCP, sm and IB (mostly). You will find some examples, and the document explaining the additional constructs needed in MPI to achieve this.
>>
>> George.
>>
>> On Apr 15, 2013, at 17:29 , John Chludzinski <john.chludzinski_at_[hidden]> wrote:
>>
>>> That would seem to preclude its use for an RTI. Unless you have a card up your sleeve?
>>>
>>> ---John
>>>
>>>
>>> On Mon, Apr 15, 2013 at 11:23 AM, Ralph Castain <rhc_at_[hidden]> wrote:
>>> It isn't the fact that there are multiple programs being used - we support that just fine. The problem with HLA/RTI is that it allows programs to come/go at will - i.e., not every program has to start at the same time, nor complete at the same time. MPI requires that all programs be executing at the beginning, and that all call finalize prior to anyone exiting.
>>>
>>>
>>> On Apr 15, 2013, at 8:14 AM, John Chludzinski <john.chludzinski_at_[hidden]> wrote:
>>>
>>>> I just received an e-mail notifying me that MPI-2 supports MPMD. This would seen to be just what the doctor ordered?
>>>>
>>>> ---John
>>>>
>>>>
>>>> On Mon, Apr 15, 2013 at 11:10 AM, Ralph Castain <rhc_at_[hidden]> wrote:
>>>> FWIW: some of us are working on a variant of MPI that would indeed support what you describe - it would support send/recv (i.e., MPI-1), but not collectives, and so would allow communication between arbitrary programs.
>>>>
>>>> Not specifically targeting HLA/RTI, though I suppose a wrapper that conformed to that standard could be created.
>>>>
>>>> On Apr 15, 2013, at 7:50 AM, John Chludzinski <john.chludzinski_at_[hidden]> wrote:
>>>>
>>>> > This would be a departure from the SPMD paradigm that seems central to
>>>> > MPI's design. Each process would be a completely different program
>>>> > (piece of code) and I'm not sure how well that would working using
>>>> > MPI?
>>>> >
>>>> > BTW, MPI is commonly used in the parallel discrete even world for
>>>> > communication between LPs (federates in HLA). But these LPs are
>>>> > usually the same program.
>>>> >
>>>> > ---John
>>>> >
>>>> > On Mon, Apr 15, 2013 at 10:22 AM, John Chludzinski
>>>> > <john.chludzinski_at_[hidden]> wrote:
>>>> >> Is anyone aware of an MPI based HLA/RTI (DoD High Level Architecture
>>>> >> (HLA) / Runtime Infrastructure)?
>>>> >>
>>>> >> ---John
>>>> > _______________________________________________
>>>> > users mailing list
>>>> > users_at_[hidden]
>>>> > http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>
>>>>
>>>> _______________________________________________
>>>> users mailing list
>>>> users_at_[hidden]
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>
>>>> _______________________________________________
>>>> users mailing list
>>>> users_at_[hidden]
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>
>>>
>>> _______________________________________________
>>> users mailing list
>>> users_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>
>>> _______________________________________________
>>> users mailing list
>>> users_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users