Hi Ralph,
That gives me something more to work with...
On Aug 12, 2009, at 9:44 AM, Ralph Castain wrote:
> I believe TCP works fine, Jody, as it is used on Macs fairly widely.
> I suspect this is something funny about your installation.
>
> One thing I have found is that you can get this error message when
> you have multiple NICs installed, each with a different subnet, and
> the procs try to connect across different ones. Do you by chance
> have multiple NICs?
The head node has two active NICs:
en0: public
en1: private
The server nodes only have one connection
en0:private
>
> Have you tried telling OMPI which TCP interface to use? You can do
> so with -mca btl_tcp_if_include eth0 (or whatever you want to use).
If I try this, I get the same results. (though I need to use "en0" on
my machine)...
If I include -mca btl_base_verbose 30 I get for n=2:
++++++++++
[xserve03.local:00841] select: init of component tcp returned success
Done MPI init
checking connection between rank 0 on xserve02.local and rank 1
Done MPI init
[xserve02.local:01094] btl: tcp: attempting to connect() to address
192.168.2.103 on port 4
Done checking connection between rank 0 on xserve02.local and rank 1
Connectivity test on 2 processes PASSED.
++++++++++
If I try n=3 the job hangs and I have to kill:
++++++++++
Done MPI init
checking connection between rank 0 on xserve02.local and rank 1
[xserve02.local:01110] btl: tcp: attempting to connect() to address
192.168.2.103 on port 4
Done MPI init
Done MPI init
checking connection between rank 1 on xserve03.local and rank 2
[xserve03.local:00860] btl: tcp: attempting to connect() to address
192.168.2.102 on port 4
Done checking connection between rank 0 on xserve02.local and rank 1
checking connection between rank 0 on xserve02.local and rank 2
Done checking connection between rank 0 on xserve02.local and rank 2
mpirun: killing job...
++++++++++
Those ip addresses are correct, no idea if port 4 make sense.
Sometimes I get port 260. Should xserve03 and xserve02 be trying to
use the same port for these comms?
Thanks, Jody
>
>
> On Wed, Aug 12, 2009 at 10:01 AM, Jody Klymak <jklymak_at_[hidden]> wrote:
>
> On Aug 11, 2009, at 18:55 PM, Gus Correa wrote:
>
>
> Did you wipe off the old directories before reinstalling?
>
> Check.
>
> I prefer to install on a NFS mounted directory,
>
> Check
>
>
> Have you tried to ssh from node to node on all possible pairs?
>
> check - fixed this today, works fine with the spawning user...
>
> How could you roll back to 1.1.5,
> now that you overwrote the directories?
>
> Oh, I still have it on another machine off the cluster in /usr/local/
> openmpi. Will take just 5 mintues to reinstall.
>
> Launching jobs with Torque is way much better than
> using barebones mpirun.
>
> And you don't want to stay behind with the OpenMPI versions
> and improvements either.
>
> Sure, but I'd like the jobs to be able to run at all..
>
> Is there any sense in rolling back to to 1.2.3 since that is known
> to work with OS X (its the one that comes with 10.5)? My only guess
> at this point is other OS X users are using non-tcpip communication,
> and the tcp stuff just doesn't work in 1.3.3.
>
> Thanks, Jody
>
> --
> Jody Klymak
> http://web.uvic.ca/~jklymak/
>
>
>
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users
--
Jody Klymak
http://web.uvic.ca/~jklymak/
|