Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] Fake Modex
From: Hugo Meyer (meyer.hugo_at_[hidden])
Date: 2011-06-03 10:03:09


Hello Ralph.

Are you talking about an MPI communication? If so, then you need to update
every proc's modex info for the proc that moved - this is something stored
in each MPI proc's memory, so it isn't something that you can just get from
the daemon on-demand. You'll have to provide the update to every single proc
directly so that it has the info if/when it should decide to send an MPI
message to the proc that moved.

Yes, about MPI communications.

See the modex database interface in
orte/mca/grpcomm/base/grpcomm_base_modex.c. You'll have to create new code
to send/recv an update message, but the code to update the database entry
exists.

What you mean with a send/recv update message i think that has to be
something similar to pack/unpack info maybe using also the allgather like
it's done in grpcomm_base_modex.c

I took a look to the code and i found
the orte_grpcomm_base_update_modex_entries(&proc_name, &rbuf) function, and
then i printed the attr_name and i get *btl.tcp.1.7 *and others attributes,
but i'm not finding any information about the uri, address or something that
allows me to communicate with another peer.

I'm thinking that i have to (in some way) update the endpoint in some place,
but i don't know frome where i can do this, and if there is a function that
allows me to do that kind of update.

Thanks again.

Hugo

2011/6/3 Ralph Castain <rhc_at_[hidden]>

> Are you talking about an MPI communication? If so, then you need to update
> every proc's modex info for the proc that moved - this is something stored
> in each MPI proc's memory, so it isn't something that you can just get from
> the daemon on-demand. You'll have to provide the update to every single proc
> directly so that it has the info if/when it should decide to send an MPI
> message to the proc that moved.
>
> This is why we do a modex upon restart - sending the change to every MPI
> proc is hardly scalable minus a collective operation.
>
> See the modex database interface in
> orte/mca/grpcomm/base/grpcomm_base_modex.c. You'll have to create new code
> to send/recv an update message, but the code to update the database entry
> exists.
>
>
> On Jun 2, 2011, at 7:52 AM, Hugo Meyer wrote:
>
> Hello again.
>
> My actual problem is that i don't know where is the struct that has the
> information that is used to send messages to the procs.
>
> Something like:
>
> Rank URI
> 0 21222:tcp:192.168.1.1:1250
> 1 21223:tcp:192.168.1.2:1250
> ..... .....
>
>
> Because what i need is to update it when i move a process from its original
> site, is there something like this??
>
> Thanks a lot.
>
> Hugo
>
> 2011/5/31 Hugo Meyer <meyer.hugo_at_[hidden]>
>
>> Hello @ll.
>>
>> I'm needing some help to restart the communication with a process that i
>> restore in a different node. My situation is as follows:
>>
>> The process fails and it's restored in another node succesfully from a
>> previous checkpoint that i sent there. Now, when a process try to send a
>> message to this restored process it will fail, or at least, it will be
>> locked in *ompi_request_wait_completion. *
>> *
>> *
>> So, when this happens i have to send a message to the daemon of the sender
>> that will have the uri of where the process has been restored and answer to
>> the proc with this and it will update this info.
>>
>> So, i need to know where in the code i can capture this attempt to send
>> and then send the message to his daemon and where and how i can update this
>> info to send the message to the right place (Same rank but new uri).
>>
>> I have to do it in this way to avoid a collective communication.
>>
>> If you give me a hand on this, it will be great.
>>
>> Best regards.
>>
>> Hugo
>>
>
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>
>
>
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>