Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] OpenMPI hangs across multiple nodes.
From: Ralph Castain (rhc_at_[hidden])
Date: 2009-02-11 22:19:45


We plan to release a version soon that will use static ports, which
should help with this problem as the IT folks will only have to open
specified ports that they can select.

Unfortunately, that isn't possible with the current version :-/

Ralph

On Feb 11, 2009, at 7:49 PM, Robertson Burgess wrote:

> My apologies for not changing the subject to something suitable just
> then.
>
> Thankyou for that. I have not yet been able to get the IT department
> to help me with disabling the firewalls, but hopefully that is the
> problem. Sorry for the late response, I was hoping the IT department
> would be faster.
>
> Robertson
>
> Message: 2
> Date: Fri, 6 Feb 2009 17:27:34 -0500
> From: Jeff Squyres <jsquyres_at_[hidden]>
> Subject: Re: [OMPI users] OpenMPI hangs across multiple nodes.
> To: Open MPI Users <users_at_[hidden]>
> Message-ID: <8BA0E4A5-FA7C-430B-8731-231ED6E672BE_at_[hidden]>
> Content-Type: text/plain; charset=US-ASCII; format=flowed; delsp=yes
>
> Open MPI requires that there be no TCP firewall between hosts that are
> used in a single parallel job -- it uses random TCP ports between
> peers.
>
>
> On Feb 5, 2009, at 2:39 AM, Robertson Burgess wrote:
>
>> I have checked with IT. It is TCP. I have been told that there's a
>> firewall on the nodes. Should I open some ports on the firewall, and
>> if so, which ones?
>>
>> Robertson
>>
>>>>> Robertson Burgess 5/02/2009 5:09 pm >>>
>> Thankyou for your help.
>> I tried the command
>> mpirun -np 4 -host node1,node2 -mca btl tcp,self random
>> but still got the same result.
>>
>> I'm pretty sure that the communication between the nodes is TCP but
>> I'm not sure, I've emailedIT support to ask them, but am yet to hear
>> back from them.
>> Other than that I'm running the latest release of OMPI (1.3) and I
>> installed it on both nodes. And yes they are in the same absolute
>> paths.
>> My configuration was very standard:
>>
>> shell$ gunzip -c openmpi-1.3.tar.gz | tar xf -
>> shell$ cd openmpi-1.3
>> shell$ ./configure CC=icc CXX=icpc F77=ifort FC=ifort --prefix=/
>> home/bburgess/bin/bin
>> shell$ make all install
>>
>> Again thankyou for your help, I'll have to investigate whether my
>> assumption about my connections being TCP are correct. When I was
>> setting it up at first, and before I'd configured the nodes to log
>> into each other without a password, I did get the message
>>
>> user@ node.newcastle.edu.au's password:
>>
>> In my log files, so it did at least seem to be reaching the other
>> node. Does that mean that my connections are working, or could it be
>> more to it than that?
>>
>> Robertson Burgess
>>
>>
>> Message: 2
>> Date: Wed, 4 Feb 2009 15:37:44 +0200
>> From: Lenny Verkhovsky <lenny.verkhovsky_at_[hidden]>
>> Subject: Re: [OMPI users] OpenMPI hangs across multiple nodes.
>> To: Open MPI Users <users_at_[hidden]>
>> Message-ID:
>> <453d39990902040537o45137abbh2f12db423d971eb4_at_[hidden]>
>> Content-Type: text/plain; charset=ISO-8859-1
>>
>> what kind of communication between nodes do you have - tcp, openib (
>> IB/IWARP ) ?
>> you can try
>>
>> mpirun -np 4 -host node1,node2 -mca btl tcp,self random
>>
>>
>>
>> On Wed, Feb 4, 2009 at 1:21 AM, Ralph Castain <rhc_at_[hidden]> wrote:
>>> Could you tell us which version of OpenMPI you are using, and how
>>> it was
>>> configured?
>>>
>>> Did you install the OMPI libraries and binaries on both nodes? Are
>>> they in
>>> the same absolute path locations?
>>>
>>> Thanks
>>> Ralph
>>>
>>>
>>> On Feb 3, 2009, at 3:46 PM, Robertson Burgess wrote:
>>>
>>>> Dear users,
>>>> I am quite new to OpenMPI, I have compiled it on two nodes, each
>>>> node with
>>>> 8 CPU cores. The two nodes are identical. The code I am using
>>>> works in
>>>> parallel across the 8 cores on a single node. However, whenever I
>>>> try to run
>>>> across both nodes, OpenMPI simply hangs. There is no output
>>>> whatsoever, when
>>>> I run it in background, outputting to a log file, the log file is
>>>> always
>>>> empty. The cores do not appear to be doing anything at all, either
>>>> on the
>>>> host node or on the remote node. This happens whether I am running
>>>> my code,
>>>> or even if I when I tell it to run a process that doesn't even
>>>> exist, for
>>>> instance
>>>>
>>>> mpirun -np 4 -host node1,node2 random
>>>>
>>>> Simply results in the terminal hanging, so all I can do is close
>>>> the
>>>> terminal and open up a new one.
>>>>
>>>> mpirun -np 4 -host node1,node2 random >& log.log &
>>>>
>>>> simply produces and empty log.log file
>>>>
>>>> I am running Redhat Linux on the systems, and compiled OpenMPI
>>>> with the
>>>> Intel Compilers 10.1. As I've said, it works fine on one node. I
>>>> have set up
>>>> both nodes such that they can log into each other via ssh without
>>>> the need
>>>> for a password, and I have altered my .bashrc file so the PATH and
>>>> LD_LIBRARY_PATH include the appropriate folders.
>>>> I have looked through the FAQ and mailing lists, but I was unable
>>>> to find
>>>> anything that really matched my problem. Any help would be greatly
>>>> appreciated.
>>>>
>>>> Sincerely,
>>>> Robertson Burgess
>>>> University of Newcastle
>>>>
>>>> _______________________________________________
>>>> users mailing list
>>>> users_at_[hidden]
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>
>>> _______________________________________________
>>> users mailing list
>>> users_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>
>>
>> **************************************
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
> --
> Jeff Squyres
> Cisco Systems
>
>
>
> ------------------------------
>
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users