Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] tcp connectivity OS X and 1.3.3
From: Jody Klymak (jklymak_at_[hidden])
Date: 2009-08-12 15:46:52


On Aug 12, 2009, at 12:31 PM, Ralph Castain wrote:

> Well, it is getting better! :-)
>
> On your cmd line, what btl's are you specifying? You should try -mca
> btl sm,tcp,self for this to work. Reason: sometimes systems block
> tcp loopback on the node. What I see below indicates that inter-node
> comm was fine, but the two procs that share a node couldn't
> communicate. Including shared memory should remove that problem.

It looks like sm,tcp,self are being initialized automatically - this
repeats for each node:

[xserve03.local:01008] mca: base: components_open: Looking for btl
components
[xserve03.local:01008] mca: base: components_open: opening btl
components
[xserve03.local:01008] mca: base: components_open: found loaded
component self
[xserve03.local:01008] mca: base: components_open: component self has
no register function
[xserve03.local:01008] mca: base: components_open: component self open
function successful
[xserve03.local:01008] mca: base: components_open: found loaded
component sm
[xserve03.local:01008] mca: base: components_open: component sm has no
register function
[xserve03.local:01008] mca: base: components_open: component sm open
function successful
[xserve03.local:01008] mca: base: components_open: found loaded
component tcp
[xserve03.local:01008] mca: base: components_open: component tcp has
no register function
[xserve03.local:01008] mca: base: components_open: component tcp open
function successful
[xserve03.local:01008] select: initializing btl component self
[xserve03.local:01008] select: init of component self returned success
[xserve03.local:01008] select: initializing btl component sm
[xserve03.local:01008] select: init of component sm returned success
[xserve03.local:01008] select: initializing btl component tcp
[xserve03.local:01008] select: init of component tcp returned success

I should have reminded you of the command line:

usr/local/openmpi/bin/mpirun -n 3 -mca btl_base_verbose 30 -mca
btl_tcp_if_include en0 --bynode -host xserve02,xserve03 connectivity_c
>& connectivity_c3_2host.txt

So I think ranks 0 and 2 are on xserve02 and rank 1 is on xserve01, in
which case I still think it is tcp communication...

Done MPI init
checking connection between rank 0 on xserve02.local and rank 1
Done MPI init
[xserve02.local:01382] btl: tcp: attempting to connect() to address
192.168.2.103 on port 4
Done MPI init
checking connection between rank 1 on xserve03.local and rank 2
[xserve03.local:01008] btl: tcp: attempting to connect() to address
192.168.2.102 on port 4
Done checking connection between rank 0 on xserve02.local and rank 1
checking connection between rank 0 on xserve02.local and rank 2
Done checking connection between rank 0 on xserve02.local and rank 2
mpirun: killing job...

--------------------------------------------------------------------------
mpirun noticed that process rank 1 with PID 1008 on node xserve03
exited on signal 0 (Signal 0).
--------------------------------------------------------------------------
mpirun: clean termination accomplished

Thanks, Jody

>
> The port numbers are fine and can be different or the same - it is
> totally random. The procs exchange their respective port info during
> wireup.
>
>
> On Wed, Aug 12, 2009 at 12:51 PM, Jody Klymak <jklymak_at_[hidden]> wrote:
> Hi Ralph,
>
> That gives me something more to work with...
>
>
> On Aug 12, 2009, at 9:44 AM, Ralph Castain wrote:
>
>> I believe TCP works fine, Jody, as it is used on Macs fairly
>> widely. I suspect this is something funny about your installation.
>>
>> One thing I have found is that you can get this error message when
>> you have multiple NICs installed, each with a different subnet, and
>> the procs try to connect across different ones. Do you by chance
>> have multiple NICs?
>
> The head node has two active NICs:
> en0: public
> en1: private
>
> The server nodes only have one connection
> en0:private
>
>>
>> Have you tried telling OMPI which TCP interface to use? You can do
>> so with -mca btl_tcp_if_include eth0 (or whatever you want to use).
>
> If I try this, I get the same results. (though I need to use "en0"
> on my machine)...
>
> If I include -mca btl_base_verbose 30 I get for n=2:
>
> ++++++++++
> [xserve03.local:00841] select: init of component tcp returned success
> Done MPI init
> checking connection between rank 0 on xserve02.local and rank 1
> Done MPI init
> [xserve02.local:01094] btl: tcp: attempting to connect() to address
> 192.168.2.103 on port 4
> Done checking connection between rank 0 on xserve02.local and rank 1
> Connectivity test on 2 processes PASSED.
> ++++++++++
>
> If I try n=3 the job hangs and I have to kill:
>
> ++++++++++
> Done MPI init
> checking connection between rank 0 on xserve02.local and rank 1
> [xserve02.local:01110] btl: tcp: attempting to connect() to address
> 192.168.2.103 on port 4
> Done MPI init
> Done MPI init
> checking connection between rank 1 on xserve03.local and rank 2
> [xserve03.local:00860] btl: tcp: attempting to connect() to address
> 192.168.2.102 on port 4
> Done checking connection between rank 0 on xserve02.local and rank 1
> checking connection between rank 0 on xserve02.local and rank 2
> Done checking connection between rank 0 on xserve02.local and rank 2
> mpirun: killing job...
> ++++++++++
>
> Those ip addresses are correct, no idea if port 4 make sense.
> Sometimes I get port 260. Should xserve03 and xserve02 be trying to
> use the same port for these comms?
>
>
> Thanks, Jody
>
>
>
>>
>>
>> On Wed, Aug 12, 2009 at 10:01 AM, Jody Klymak <jklymak_at_[hidden]>
>> wrote:
>>
>> On Aug 11, 2009, at 18:55 PM, Gus Correa wrote:
>>
>>
>> Did you wipe off the old directories before reinstalling?
>>
>> Check.
>>
>> I prefer to install on a NFS mounted directory,
>>
>> Check
>>
>>
>> Have you tried to ssh from node to node on all possible pairs?
>>
>> check - fixed this today, works fine with the spawning user...
>>
>> How could you roll back to 1.1.5,
>> now that you overwrote the directories?
>>
>> Oh, I still have it on another machine off the cluster in /usr/
>> local/openmpi. Will take just 5 mintues to reinstall.
>>
>> Launching jobs with Torque is way much better than
>> using barebones mpirun.
>>
>> And you don't want to stay behind with the OpenMPI versions
>> and improvements either.
>>
>> Sure, but I'd like the jobs to be able to run at all..
>>
>> Is there any sense in rolling back to to 1.2.3 since that is known
>> to work with OS X (its the one that comes with 10.5)? My only
>> guess at this point is other OS X users are using non-tcpip
>> communication, and the tcp stuff just doesn't work in 1.3.3.
>>
>> Thanks, Jody
>>
>> --
>> Jody Klymak
>> http://web.uvic.ca/~jklymak/
>>
>>
>>
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
> --
> Jody Klymak
> http://web.uvic.ca/~jklymak/
>
>
>
>
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users

--
Jody Klymak
http://web.uvic.ca/~jklymak/