Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] Fake Modex
From: Josh Hursey (jjhursey_at_[hidden])
Date: 2011-06-13 09:00:26


I don't think this will help much, but I can tell you how we handled
this for the coordinated C/R functionality.

When we added automatic recovery and process migration using
coordinated checkpoints to the Open MPI trunk (spring/summer 2010) we
were able to take advantage of the coordinated nature of the activity.
Since all processes were doing the recovery together (with possibly
only a subset of the processes actually restarting - in the case of
process migration) we were able to flush the modex and repost
connection information to all processes that wanted it. The restarted
processes will pull the updated modex information, and the existing
processes (if any) will pull the modex information from the restarted
processes once it is posted. The coordinated nature of the recovery
activity made it easy to define a point in time in which the modex was
accurate - similar to MPI_Init.

It sounds like you are trying to do something less coordinated in
nature. So you will most likely need to extend the modex, since I do
not think it has good support for sending updated contact information
(and invalidating old contact information) in the current trunk.

George should know this code path better than I do, so he might be
able to help a bit more. For their uncoordinated C/R approach they
would have had to deal with this when restarting processes mid-run
without halting other processes. So maybe you can use a similar
approach.

-- Josh

On Sat, Jun 4, 2011 at 10:55 AM, Ralph Castain <rhc_at_[hidden]> wrote:
>
> On Jun 4, 2011, at 5:21 AM, Hugo Meyer wrote:
>
> Thanks for your replies.
>>After doing that, the MPI_Init procedure calls grpcomm.modex to distribute
>> the data across all procs in the job. Unfortunately, being a collective, all
>> procs must participate. In your case, you'll have to find a different way to
>> do it. Upon receipt, each proc updates its own modex db to include the new
>> info.
>>Look in orte/mca/grpcomm/bad/grpcomm_bad_module.c at the modex function and
>> follow that code thru the grpcomm/base functions to see how the modex info
>> is retrieved, passed, and decoded on the far end.
> I will take a look to this Ralph and let you know how it goes. But today
> looking at the code with a partner, he suggested to me to try to capture an
> error when sending data through the btl_tcp_endpoint, more precisely
> in mca_btl_tcp_frag_send and capture there an error when we try to write to
> the fd of the socket. I've tried this but when a process moves and try to
> send a message, or someone try to send a message for him, i cannot capture
> the moment of the failure in the mca_btl_tcp_frag_send, but i don't know
> why, it is supposed to fail when someone try to send, is there any other
> place where this is capture? If i do in this way, i can reset connections on
> demand i suppose. What do you think of this? it's a good idea? And after i
> detect this failure, i will try to update de modex db of that process from
> here it's ok?
>
> I'm no expert on the tcp btl - perhaps George can answer?
> The run-time has no visibility into MPI connections, and has no
> understanding of the modex contents. So if a proc detects that it cannot
> make the btl connection, I guess it could send an orte message to the proc
> it's trying to reach, and have that proc return a copy of its modex data?
> I guess that could work. You may be running into the MPI layer's own
> attempts to ensure comm success via retry...I know you won't get a send
> failure just because the socket is closed - it'll keep retrying the
> connection for awhile before giving up.
>
>
> Thanks
> Hugo
>
>
> 2011/6/3 Jeff Squyres <jsquyres_at_[hidden]>
>>
>> On Jun 3, 2011, at 10:12 AM, Ralph Castain wrote:
>>
>> > When an MPI proc calls MPI_Init, each btl pushes its contact info into
>> > the modex database - one example is the btl.tcp.1.7 info you found there.
>> > That entry is for the TCP btl, which is probably what you are looking for.
>> > There is no way for you to edit that data - each btl encodes it in its own
>> > way and then adds it to the modex.
>>
>> More specifically, whatever each entity puts into the modex is a blob that
>> is only readable by other entities just like itself.  For example, what one
>> TCP BTL puts in the modex can really only be read by another TCP BTL. The
>> contents of what the TCP BTL puts in there is an opaque binary blob from the
>> modex's point of view.
>>
>> --
>> Jeff Squyres
>> jsquyres_at_[hidden]
>> For corporate legal information go to:
>> http://www.cisco.com/web/about/doing_business/legal/cri/
>>
>>
>> _______________________________________________
>> devel mailing list
>> devel_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>

-- 
Joshua Hursey
Postdoctoral Research Associate
Oak Ridge National Laboratory
http://users.nccs.gov/~jjhursey