Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] tcp connectivity OS X and 1.3.3
From: Jody Klymak (jklymak_at_[hidden])
Date: 2009-08-12 14:51:58


Hi Ralph,

That gives me something more to work with...

On Aug 12, 2009, at 9:44 AM, Ralph Castain wrote:

> I believe TCP works fine, Jody, as it is used on Macs fairly widely.
> I suspect this is something funny about your installation.
>
> One thing I have found is that you can get this error message when
> you have multiple NICs installed, each with a different subnet, and
> the procs try to connect across different ones. Do you by chance
> have multiple NICs?

The head node has two active NICs:
en0: public
en1: private

The server nodes only have one connection
en0:private

>
> Have you tried telling OMPI which TCP interface to use? You can do
> so with -mca btl_tcp_if_include eth0 (or whatever you want to use).

If I try this, I get the same results. (though I need to use "en0" on
my machine)...

If I include -mca btl_base_verbose 30 I get for n=2:

++++++++++
[xserve03.local:00841] select: init of component tcp returned success
Done MPI init
checking connection between rank 0 on xserve02.local and rank 1
Done MPI init
[xserve02.local:01094] btl: tcp: attempting to connect() to address
192.168.2.103 on port 4
Done checking connection between rank 0 on xserve02.local and rank 1
Connectivity test on 2 processes PASSED.
++++++++++

If I try n=3 the job hangs and I have to kill:

++++++++++
Done MPI init
checking connection between rank 0 on xserve02.local and rank 1
[xserve02.local:01110] btl: tcp: attempting to connect() to address
192.168.2.103 on port 4
Done MPI init
Done MPI init
checking connection between rank 1 on xserve03.local and rank 2
[xserve03.local:00860] btl: tcp: attempting to connect() to address
192.168.2.102 on port 4
Done checking connection between rank 0 on xserve02.local and rank 1
checking connection between rank 0 on xserve02.local and rank 2
Done checking connection between rank 0 on xserve02.local and rank 2
mpirun: killing job...
++++++++++

Those ip addresses are correct, no idea if port 4 make sense.
Sometimes I get port 260. Should xserve03 and xserve02 be trying to
use the same port for these comms?

Thanks, Jody

>
>
> On Wed, Aug 12, 2009 at 10:01 AM, Jody Klymak <jklymak_at_[hidden]> wrote:
>
> On Aug 11, 2009, at 18:55 PM, Gus Correa wrote:
>
>
> Did you wipe off the old directories before reinstalling?
>
> Check.
>
> I prefer to install on a NFS mounted directory,
>
> Check
>
>
> Have you tried to ssh from node to node on all possible pairs?
>
> check - fixed this today, works fine with the spawning user...
>
> How could you roll back to 1.1.5,
> now that you overwrote the directories?
>
> Oh, I still have it on another machine off the cluster in /usr/local/
> openmpi. Will take just 5 mintues to reinstall.
>
> Launching jobs with Torque is way much better than
> using barebones mpirun.
>
> And you don't want to stay behind with the OpenMPI versions
> and improvements either.
>
> Sure, but I'd like the jobs to be able to run at all..
>
> Is there any sense in rolling back to to 1.2.3 since that is known
> to work with OS X (its the one that comes with 10.5)? My only guess
> at this point is other OS X users are using non-tcpip communication,
> and the tcp stuff just doesn't work in 1.3.3.
>
> Thanks, Jody
>
> --
> Jody Klymak
> http://web.uvic.ca/~jklymak/
>
>
>
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users

--
Jody Klymak
http://web.uvic.ca/~jklymak/