On Thu, 26 Oct 2006 15:11:46 -0600, George Bosilca <bosilca_at_[hidden]>
> The Open MPI behavior is the same independently of the network used
> for the job. At least the behavior dictated by our internal message
> passing layer.
Which is one of the things I like about Open MPI.
> There is nothing (that has a reasonable cost) we can do about this.
Nor do I think there should be something done. In all honesty, I think
it's a good thing that TCP & Myrinet have such a long timeout. It makes
administration a bit less scary; if you accidentally unplug the network
cable from the wrong node during maintenance, neither the MPI nor the
administrator loses a job.
I'm also confident that both TCP & Myrinet would throw an error when they
time out; it's just that I haven't felt the need to verify it. (And with
some-odd 20 minutes for Myrinet, it takes a bit of attention span. The
last time I tried it I had forgotten about it for about 3-4 hours).
> If none are available, then Open
> MPI is supposed to abort the job. For your particular run did you had
> Ethernet between the nodes ? If yes, I'm quite sure the MPI run
> wasn't stopped ... it continued using the TCP device (if not disabled
> by hand at mpirun time).
This brings up an interesting question: The job was simply Intel's MPI
benchmark (IMB), which is fairly chatty (ie. lots of screen output).
On the first try, I used '--mca btl ^gm,^mx' to start the job. Ethernet
was connected (eth0=10/100, eth1=gigabit), but after the IB cable was
disconnected, everything stopped. The link lights (ethernet & IB) were
not blinking, nor do any of the system monitors show much TCP traffic;
certainly not the sort of traffic one would expect from an IMB run.
I've also tried using --mca openib,sm,self,tcp (specifically adding TCP)
and didn't see any sort of difference; the job still 'stuck' as soon as
the IB cable was removed. I'll let that job continue to run overnight
(ie. --mca btl tcp,openib,sm,self) to see if the job ever wakes up.
> --mca btl ^tcp (or --mca btl opnib,sm,self).
I get the messages that something is amiss with the IB fabric (as
expected). However, the job does *not* abort. Every (MPI) process on
every node in the job is still active, and sucking 100% of its cpu (I
> PS: There are several internal message passing modules available for
> Open MPI. The default one, look more for performance than
> reliability. If reliability it's what you need you should use the DR
> PML. For this, you can specify --mca pml dr at mpirun time. This (DR)
> PML has data reliability and timeout (Open MPI internal timeout that
> are configurable), allowing to recover faster from a network failure.
I don't have such a component. Hopefully it's just the version of Open
MPI I'm using (1.1), or an option needed by ./configure I didn't use. (If
it should be in 1.1, I'll take a deeper look, and can provide things like
the config.log, etc. I just don't want to flood the list at the moment.)