Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] example program "ring" hangs when running across multiple hardware nodes (SOLVED)
From: Jed O. Kaplan (jedokaplan_at_[hidden])
Date: 2013-07-05 10:44:10


Dear Gus,

Thanks for your help - your clue solved my problem!

The ultimate solution was to limit mpi communications to the local,
unrouted subnet. I made this the default behavior of all users of my
cluster by adding the following line to the bottom of my
$prefix/etc/openmpi-mca-params.conf file

btl_tcp_if_include = 10.0.0.0/8

Thanks again - what a relief!

Jed

On Fri, Jul 5, 2013, at 01:25 AM, Gustavo Correa wrote:
> Hi Jed
>
> You could try to select only ethernet interface that match your node's IP
> addresses,
> which seems to be en2.
>
> The en1 interface seems to be an external IP.
> Not sure about en3, but it is awkward that it has a
> different IP than en2, but in the same subnet.
> I wonder if this may be the reason for the program hanging.
>
> You may need to search all nodes ifconfig for a consistent set of
> interfaces/IP addresses,
> and tailor your mpiexec command line and your hostfile accordingly.
>
> Say, something like this:
>
> mpiexec -mca btl_tcp_if_include en2 -hostfile your_hostfile -np 43
> ./ring_c
>
> See this FAQ (actually, all of them are very informative):
> http://www.open-mpi.org/faq/?category=tcp#tcp-selection
>
> I hope this helps,
> Gus Correa