Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] tcp connectivity OS X and 1.3.3
From: Ralph Castain (rhc_at_[hidden])
Date: 2009-08-12 22:09:11


Hmmm...well, I'm going to ask our TCP friends for some help here.

Meantime, I do see one thing that stands out. Port 4 is an awfully low
port number that usually sits in the reserved range. I checked the /
etc/services file on my Mac, and it was commented out as unassigned,
which should mean it was okay.

Still, that is an unusual number. The default minimum port number is
1024, so I'm puzzled how you wound up down there. Of course, could
just be an error in the print statement, but let's try moving it to be
safe? Set

-mca btl_tcp_port_min_v4 36900 -mca btl_tcp_port_range_v4 32

and see what happens.

Ralph

On Aug 12, 2009, at 1:46 PM, Jody Klymak wrote:

>
> On Aug 12, 2009, at 12:31 PM, Ralph Castain wrote:
>
>> Well, it is getting better! :-)
>>
>> On your cmd line, what btl's are you specifying? You should try -
>> mca btl sm,tcp,self for this to work. Reason: sometimes systems
>> block tcp loopback on the node. What I see below indicates that
>> inter-node comm was fine, but the two procs that share a node
>> couldn't communicate. Including shared memory should remove that
>> problem.
>
> It looks like sm,tcp,self are being initialized automatically - this
> repeats for each node:
>
> [xserve03.local:01008] mca: base: components_open: Looking for btl
> components
> [xserve03.local:01008] mca: base: components_open: opening btl
> components
> [xserve03.local:01008] mca: base: components_open: found loaded
> component self
> [xserve03.local:01008] mca: base: components_open: component self
> has no register function
> [xserve03.local:01008] mca: base: components_open: component self
> open function successful
> [xserve03.local:01008] mca: base: components_open: found loaded
> component sm
> [xserve03.local:01008] mca: base: components_open: component sm has
> no register function
> [xserve03.local:01008] mca: base: components_open: component sm open
> function successful
> [xserve03.local:01008] mca: base: components_open: found loaded
> component tcp
> [xserve03.local:01008] mca: base: components_open: component tcp has
> no register function
> [xserve03.local:01008] mca: base: components_open: component tcp
> open function successful
> [xserve03.local:01008] select: initializing btl component self
> [xserve03.local:01008] select: init of component self returned success
> [xserve03.local:01008] select: initializing btl component sm
> [xserve03.local:01008] select: init of component sm returned success
> [xserve03.local:01008] select: initializing btl component tcp
> [xserve03.local:01008] select: init of component tcp returned success
>
> I should have reminded you of the command line:
>
> usr/local/openmpi/bin/mpirun -n 3 -mca btl_base_verbose 30 -mca
> btl_tcp_if_include en0 --bynode -host xserve02,xserve03
> connectivity_c >& connectivity_c3_2host.txt
>
> So I think ranks 0 and 2 are on xserve02 and rank 1 is on xserve01,
> in which case I still think it is tcp communication...
>
>
> Done MPI init
> checking connection between rank 0 on xserve02.local and rank 1
> Done MPI init
> [xserve02.local:01382] btl: tcp: attempting to connect() to address
> 192.168.2.103 on port 4
> Done MPI init
> checking connection between rank 1 on xserve03.local and rank 2
> [xserve03.local:01008] btl: tcp: attempting to connect() to address
> 192.168.2.102 on port 4
> Done checking connection between rank 0 on xserve02.local and rank 1
> checking connection between rank 0 on xserve02.local and rank 2
> Done checking connection between rank 0 on xserve02.local and rank 2
> mpirun: killing job...
>
> --------------------------------------------------------------------------
> mpirun noticed that process rank 1 with PID 1008 on node xserve03
> exited on signal 0 (Signal 0).
> --------------------------------------------------------------------------
> mpirun: clean termination accomplished
>
>
> Thanks, Jody
>
>
>>
>> The port numbers are fine and can be different or the same - it is
>> totally random. The procs exchange their respective port info
>> during wireup.
>>
>>
>> On Wed, Aug 12, 2009 at 12:51 PM, Jody Klymak <jklymak_at_[hidden]>
>> wrote:
>> Hi Ralph,
>>
>> That gives me something more to work with...
>>
>>
>> On Aug 12, 2009, at 9:44 AM, Ralph Castain wrote:
>>
>>> I believe TCP works fine, Jody, as it is used on Macs fairly
>>> widely. I suspect this is something funny about your installation.
>>>
>>> One thing I have found is that you can get this error message when
>>> you have multiple NICs installed, each with a different subnet,
>>> and the procs try to connect across different ones. Do you by
>>> chance have multiple NICs?
>>
>> The head node has two active NICs:
>> en0: public
>> en1: private
>>
>> The server nodes only have one connection
>> en0:private
>>
>>>
>>> Have you tried telling OMPI which TCP interface to use? You can do
>>> so with -mca btl_tcp_if_include eth0 (or whatever you want to use).
>>
>> If I try this, I get the same results. (though I need to use "en0"
>> on my machine)...
>>
>> If I include -mca btl_base_verbose 30 I get for n=2:
>>
>> ++++++++++
>> [xserve03.local:00841] select: init of component tcp returned success
>> Done MPI init
>> checking connection between rank 0 on xserve02.local and rank 1
>> Done MPI init
>> [xserve02.local:01094] btl: tcp: attempting to connect() to address
>> 192.168.2.103 on port 4
>> Done checking connection between rank 0 on xserve02.local and rank 1
>> Connectivity test on 2 processes PASSED.
>> ++++++++++
>>
>> If I try n=3 the job hangs and I have to kill:
>>
>> ++++++++++
>> Done MPI init
>> checking connection between rank 0 on xserve02.local and rank 1
>> [xserve02.local:01110] btl: tcp: attempting to connect() to address
>> 192.168.2.103 on port 4
>> Done MPI init
>> Done MPI init
>> checking connection between rank 1 on xserve03.local and rank 2
>> [xserve03.local:00860] btl: tcp: attempting to connect() to address
>> 192.168.2.102 on port 4
>> Done checking connection between rank 0 on xserve02.local and rank 1
>> checking connection between rank 0 on xserve02.local and rank 2
>> Done checking connection between rank 0 on xserve02.local and rank 2
>> mpirun: killing job...
>> ++++++++++
>>
>> Those ip addresses are correct, no idea if port 4 make sense.
>> Sometimes I get port 260. Should xserve03 and xserve02 be trying
>> to use the same port for these comms?
>>
>>
>> Thanks, Jody
>>
>>
>>
>>>
>>>
>>> On Wed, Aug 12, 2009 at 10:01 AM, Jody Klymak <jklymak_at_[hidden]>
>>> wrote:
>>>
>>> On Aug 11, 2009, at 18:55 PM, Gus Correa wrote:
>>>
>>>
>>> Did you wipe off the old directories before reinstalling?
>>>
>>> Check.
>>>
>>> I prefer to install on a NFS mounted directory,
>>>
>>> Check
>>>
>>>
>>> Have you tried to ssh from node to node on all possible pairs?
>>>
>>> check - fixed this today, works fine with the spawning user...
>>>
>>> How could you roll back to 1.1.5,
>>> now that you overwrote the directories?
>>>
>>> Oh, I still have it on another machine off the cluster in /usr/
>>> local/openmpi. Will take just 5 mintues to reinstall.
>>>
>>> Launching jobs with Torque is way much better than
>>> using barebones mpirun.
>>>
>>> And you don't want to stay behind with the OpenMPI versions
>>> and improvements either.
>>>
>>> Sure, but I'd like the jobs to be able to run at all..
>>>
>>> Is there any sense in rolling back to to 1.2.3 since that is known
>>> to work with OS X (its the one that comes with 10.5)? My only
>>> guess at this point is other OS X users are using non-tcpip
>>> communication, and the tcp stuff just doesn't work in 1.3.3.
>>>
>>> Thanks, Jody
>>>
>>> --
>>> Jody Klymak
>>> http://web.uvic.ca/~jklymak/
>>>
>>>
>>>
>>>
>>> _______________________________________________
>>> users mailing list
>>> users_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>
>>> _______________________________________________
>>> users mailing list
>>> users_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>>
>> --
>> Jody Klymak
>> http://web.uvic.ca/~jklymak/
>>
>>
>>
>>
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
> --
> Jody Klymak
> http://web.uvic.ca/~jklymak/
>
>
>
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users