Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] [OMPI svn] svn:open-mpi r31577 - trunk/ompi/mca/rte/base
From: Jeff Squyres (jsquyres) (jsquyres_at_[hidden])
Date: 2014-05-01 06:41:35


On Apr 30, 2014, at 10:01 PM, George Bosilca <bosilca_at_[hidden]> wrote:

> Why do you need the ompi_process_name_t? Isn’t the opal_identifier_t enough to dig for the info of the peer into the opal_db?

At the moment, I use the ompi_process_name_t for RML sends/receives in the usnic BTL. I know this will have to change when the BTLs move down to OPAL (when is that going to happen, BTW?). So my future use case may be somewhat moot.

More detail
===========

"Why does the usnic BTL use RML sends/receives?", you ask.

The reason is rooted in the fact that the usnic BTL uses an unreliable, connectionless transport under the covert. We had some customers have network misconfigurations that resulted in usnic traffic not flowing properly (e.g., MTU mismatches in the network). But since we don't have a connection-oriented underlying API that will eventually timeout/fail to connect/etc. when there's a problem with the network configuration, we added a "connection validation" service in the usnic BTL that fires up in a thread in the local rank 0 on each server. This thread provides service to all the MPI processes on its server.

In short: the service thread sends UDP pings and ACKs to peer service threads on other servers (upon demand/upon first send between servers) to verify network connectivity. If the pings eventually fail/timeout (i.e., don't get ACKs back), the service thread does a show_help and kills the job.

There's more details, but that's the gist of it.

This basically gives us the ability to highlight problems in the network and kill the MPI job rather than spin infinitely while trying to deliver MPI/BTL messages to a peer that will never get there.

Since this is really a server-to-server network connectivity issue (vs. an MPI peer-to-peer connectivity issue), we only need to have one service thread for a whole server. The other MPI procs on the server use RML to talk to it. E.g., "Please ping the server where MPI proc X lives," and so on. This seemed better than having a service thread in each MPI process.

We've thought a bit about what to do when the BTLs move down to OPAL (since they won't be able to use RML any more), but don't have a final solution yet... We do still want to be able to utilize this capability even after the BTL move.

-- 
Jeff Squyres
jsquyres_at_[hidden]
For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/