Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] OpenMPI hangs across multiple nodes.
From: Jeff Squyres (jsquyres_at_[hidden])
Date: 2009-02-06 17:27:34


Open MPI requires that there be no TCP firewall between hosts that are
used in a single parallel job -- it uses random TCP ports between peers.

On Feb 5, 2009, at 2:39 AM, Robertson Burgess wrote:

> I have checked with IT. It is TCP. I have been told that there's a
> firewall on the nodes. Should I open some ports on the firewall, and
> if so, which ones?
>
> Robertson
>
>>>> Robertson Burgess 5/02/2009 5:09 pm >>>
> Thankyou for your help.
> I tried the command
> mpirun -np 4 -host node1,node2 -mca btl tcp,self random
> but still got the same result.
>
> I'm pretty sure that the communication between the nodes is TCP but
> I'm not sure, I've emailedIT support to ask them, but am yet to hear
> back from them.
> Other than that I'm running the latest release of OMPI (1.3) and I
> installed it on both nodes. And yes they are in the same absolute
> paths.
> My configuration was very standard:
>
> shell$ gunzip -c openmpi-1.3.tar.gz | tar xf -
> shell$ cd openmpi-1.3
> shell$ ./configure CC=icc CXX=icpc F77=ifort FC=ifort --prefix=/
> home/bburgess/bin/bin
> shell$ make all install
>
> Again thankyou for your help, I'll have to investigate whether my
> assumption about my connections being TCP are correct. When I was
> setting it up at first, and before I'd configured the nodes to log
> into each other without a password, I did get the message
>
> user@ node.newcastle.edu.au's password:
>
> In my log files, so it did at least seem to be reaching the other
> node. Does that mean that my connections are working, or could it be
> more to it than that?
>
> Robertson Burgess
>
>
> Message: 2
> Date: Wed, 4 Feb 2009 15:37:44 +0200
> From: Lenny Verkhovsky <lenny.verkhovsky_at_[hidden]>
> Subject: Re: [OMPI users] OpenMPI hangs across multiple nodes.
> To: Open MPI Users <users_at_[hidden]>
> Message-ID:
> <453d39990902040537o45137abbh2f12db423d971eb4_at_[hidden]>
> Content-Type: text/plain; charset=ISO-8859-1
>
> what kind of communication between nodes do you have - tcp, openib (
> IB/IWARP ) ?
> you can try
>
> mpirun -np 4 -host node1,node2 -mca btl tcp,self random
>
>
>
> On Wed, Feb 4, 2009 at 1:21 AM, Ralph Castain <rhc_at_[hidden]> wrote:
>> Could you tell us which version of OpenMPI you are using, and how
>> it was
>> configured?
>>
>> Did you install the OMPI libraries and binaries on both nodes? Are
>> they in
>> the same absolute path locations?
>>
>> Thanks
>> Ralph
>>
>>
>> On Feb 3, 2009, at 3:46 PM, Robertson Burgess wrote:
>>
>>> Dear users,
>>> I am quite new to OpenMPI, I have compiled it on two nodes, each
>>> node with
>>> 8 CPU cores. The two nodes are identical. The code I am using
>>> works in
>>> parallel across the 8 cores on a single node. However, whenever I
>>> try to run
>>> across both nodes, OpenMPI simply hangs. There is no output
>>> whatsoever, when
>>> I run it in background, outputting to a log file, the log file is
>>> always
>>> empty. The cores do not appear to be doing anything at all, either
>>> on the
>>> host node or on the remote node. This happens whether I am running
>>> my code,
>>> or even if I when I tell it to run a process that doesn't even
>>> exist, for
>>> instance
>>>
>>> mpirun -np 4 -host node1,node2 random
>>>
>>> Simply results in the terminal hanging, so all I can do is close the
>>> terminal and open up a new one.
>>>
>>> mpirun -np 4 -host node1,node2 random >& log.log &
>>>
>>> simply produces and empty log.log file
>>>
>>> I am running Redhat Linux on the systems, and compiled OpenMPI
>>> with the
>>> Intel Compilers 10.1. As I've said, it works fine on one node. I
>>> have set up
>>> both nodes such that they can log into each other via ssh without
>>> the need
>>> for a password, and I have altered my .bashrc file so the PATH and
>>> LD_LIBRARY_PATH include the appropriate folders.
>>> I have looked through the FAQ and mailing lists, but I was unable
>>> to find
>>> anything that really matched my problem. Any help would be greatly
>>> appreciated.
>>>
>>> Sincerely,
>>> Robertson Burgess
>>> University of Newcastle
>>>
>>> _______________________________________________
>>> users mailing list
>>> users_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>
> **************************************
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users

-- 
Jeff Squyres
Cisco Systems