Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] MPI hangs on multiple nodes
From: Rolf vandeVaart (rvandevaart_at_[hidden])
Date: 2011-09-20 08:34:51

>> 1: After a reboot of two nodes I ran again, and the inter-node freeze didn't
>happen until the third iteration. I take that to mean that the basic
>communication works, but that something is saturating. Is there some notion
>of buffer size somewhere in the MPI system that could explain this?
>Hmm. This is not a good sign; it somewhat indicates a problem with your OS.
>Based on this email and your prior emails, I'm guessing you're using TCP for
>communication, and that the problem is based on inter-node communication
>(e.g., the problem would occur even if you only run 1 process per machine,
>but does not occur if you run all N processes on a single machine, per your #4,

I agree with Jeff here. Open MPI uses lazy connections to establish connections and round robins through the interfaces.
So, the first few communications could work as they are using interfaces that could communicate between the nodes, but the third iteration uses an interface that for some reason cannot establish the connection.

One flag you can use that may help is --mca btl_base_verbose 20, like this;

mpirun --mca btl_base_verbose 20 connectivity_c

It will dump out a bunch of stuff, but there will be a few lines that look like this:

[dt:09880] btl: tcp: attempting to connect() to [[58627,1],1] address on port 1025


This email message is for the sole use of the intended recipient(s) and may contain
confidential information. Any unauthorized review, use, disclosure or distribution
is prohibited. If you are not the intended recipient, please contact the sender by
reply email and destroy all copies of the original message.