Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] Network connection check
From: Durga Choudhury (dpchoudh_at_[hidden])
Date: 2009-07-23 10:41:45

The 'system' command will fork a separate process to run. If I
remember correctly, forking within MPI can lead to undefined behavior.
Can someone in OpenMPI development team clarify?

What I don't understand is: why is your TCP network so unstable that
you are worried about reachability? For MPI to run, they should be
connected on a local switch with a high bandwidth interconnect and not
dispersed across the internet. Perhaps you should look at the
underlying cause of network instability. If your network is actually
stable, then your problem is only theoretical.

Also, keep in mind that TCP itself offers a keepalive mechanism. Three
parameters may be specified: the amount of inactivity after which the
first probe is sent, the number of unanswered probes after which the
connection is dropped and the interval between the probes. Typing
'sysctl -a' will print the entire IP MIB that has these names (I don't
remember them off the top of my head). However, you say that you
*don't* want to drop the connection, simply want to know about
connectivity. What you can do, without causing 'undefined' MPI
behaviour is to implement a similar mechanism in your MPI application.


On Thu, Jul 23, 2009 at 10:25 AM, vipin kumar<vipinkumar41_at_[hidden]> wrote:
> Thank you all Jeff, Jody, Prentice and Bogdan for your invaluable
> clarification, solution and suggestion,
>> Open MPI should return a failure if TCP connectivity is lost, even with a
>> non-blocking point-to-point operation.  The failure should be returned in
>> the call to MPI_TEST (and friends).
> even if MPI_TEST is a local operation?
>>  So I'm not sure your timeout has meaning here -- if you reach the
>> timeout, I think it simply means that the MPI communication has not
>> completed yet.  It does not necessarily mean that the MPI communication has
>> failed.
> you are absolutely correct., but the job should be done before it expires.
> that's the reason I am using TIMEOUT.
> So the conclusion is :
>> MPI doesn't provide any standard way to check reachability and/or health
>> of a peer process.
> That's what I wanted to confirm. And to find out the solution, if any, or
> any alternative.
> So now I think, I should go for Jody's approach
>> How about you start your MPI program from a shell script that does the
>> following:
>> 1. Reads a text file containing the names of all the possible candidates
>>  for MPI nodes
>> 2. Loops through the list of names from (1) and pings each machine to
>> see if it's alive. If the host is pingable, then write it's name to a
>> different text file which will be host as the machine file for the
>> mpirun command
>> 3. Call mpirun using the machine file generated in (2).
> I am assuming processes have been launched successfully.
> --
> Vipin K.
> Research Engineer,
> C-DOTB, India
> _______________________________________________
> users mailing list
> users_at_[hidden]