Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] [OMPI svn] svn:open-mpi r31577 - trunk/ompi/mca/rte/base
From: Ralph Castain (rhc_at_[hidden])
Date: 2014-05-01 09:43:48

The problem we'll have with BTLs in opal is going to revolve around that ompi_process_name_t and will occur in a number of places. I've been trying to grok George's statement about accessors and can't figure out a clean way to make that work IF every RTE gets to define the process name a different way.

For example, suppose I define ompi_process_name_t to be a string. I can hash the string down to an opal_identifier_t, but that is a structureless 64-bit value - there is no concept of a jobid or vpid in it. So if you now want to extract a jobid for that identifier, the only way you can do it is to "up-call" back to the RTE to parse it.

This means that every RTE would have to initialize OPAL with a registration of its opal_identifier parser function(s), which seems like a really ugly solution.

Maybe it is time to shift the process identifier down to the opal layer? If we define opal_identifier_t to include the required jobid/vpid, perhaps adding a void* so someone can put whatever they want in it?

Note that I'm not wild about extending the identifier size beyond 64-bits as the memory footprint issue is growing in concern, and I still haven't seen any real use-case proposed for extending it.

On May 1, 2014, at 3:41 AM, Jeff Squyres (jsquyres) <jsquyres_at_[hidden]> wrote:

> On Apr 30, 2014, at 10:01 PM, George Bosilca <bosilca_at_[hidden]> wrote:
>> Why do you need the ompi_process_name_t? Isn’t the opal_identifier_t enough to dig for the info of the peer into the opal_db?
> At the moment, I use the ompi_process_name_t for RML sends/receives in the usnic BTL. I know this will have to change when the BTLs move down to OPAL (when is that going to happen, BTW?). So my future use case may be somewhat moot.
> More detail
> ===========
> "Why does the usnic BTL use RML sends/receives?", you ask.
> The reason is rooted in the fact that the usnic BTL uses an unreliable, connectionless transport under the covert. We had some customers have network misconfigurations that resulted in usnic traffic not flowing properly (e.g., MTU mismatches in the network). But since we don't have a connection-oriented underlying API that will eventually timeout/fail to connect/etc. when there's a problem with the network configuration, we added a "connection validation" service in the usnic BTL that fires up in a thread in the local rank 0 on each server. This thread provides service to all the MPI processes on its server.
> In short: the service thread sends UDP pings and ACKs to peer service threads on other servers (upon demand/upon first send between servers) to verify network connectivity. If the pings eventually fail/timeout (i.e., don't get ACKs back), the service thread does a show_help and kills the job.
> There's more details, but that's the gist of it.
> This basically gives us the ability to highlight problems in the network and kill the MPI job rather than spin infinitely while trying to deliver MPI/BTL messages to a peer that will never get there.
> Since this is really a server-to-server network connectivity issue (vs. an MPI peer-to-peer connectivity issue), we only need to have one service thread for a whole server. The other MPI procs on the server use RML to talk to it. E.g., "Please ping the server where MPI proc X lives," and so on. This seemed better than having a service thread in each MPI process.
> We've thought a bit about what to do when the BTLs move down to OPAL (since they won't be able to use RML any more), but don't have a final solution yet... We do still want to be able to utilize this capability even after the BTL move.
> --
> Jeff Squyres
> jsquyres_at_[hidden]
> For corporate legal information go to:
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> Subscription:
> Link to this post: