Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] tcp connectivity OS X and 1.3.3
From: Jeff Squyres (jsquyres_at_[hidden])
Date: 2009-08-13 11:28:48


Agreed -- ports 4 and 260 should be in the reserved ports range. Are
you running as root, perchance?

On Aug 12, 2009, at 10:09 PM, Ralph Castain wrote:

> Hmmm...well, I'm going to ask our TCP friends for some help here.
>
> Meantime, I do see one thing that stands out. Port 4 is an awfully
> low port number that usually sits in the reserved range. I checked
> the /etc/services file on my Mac, and it was commented out as
> unassigned, which should mean it was okay.
>
> Still, that is an unusual number. The default minimum port number is
> 1024, so I'm puzzled how you wound up down there. Of course, could
> just be an error in the print statement, but let's try moving it to
> be safe? Set
>
> -mca btl_tcp_port_min_v4 36900 -mca btl_tcp_port_range_v4 32
>
> and see what happens.
>
> Ralph
>
> On Aug 12, 2009, at 1:46 PM, Jody Klymak wrote:
>
>>
>> On Aug 12, 2009, at 12:31 PM, Ralph Castain wrote:
>>
>>> Well, it is getting better! :-)
>>>
>>> On your cmd line, what btl's are you specifying? You should try -
>>> mca btl sm,tcp,self for this to work. Reason: sometimes systems
>>> block tcp loopback on the node. What I see below indicates that
>>> inter-node comm was fine, but the two procs that share a node
>>> couldn't communicate. Including shared memory should remove that
>>> problem.
>>
>> It looks like sm,tcp,self are being initialized automatically -
>> this repeats for each node:
>>
>> [xserve03.local:01008] mca: base: components_open: Looking for btl
>> components
>> [xserve03.local:01008] mca: base: components_open: opening btl
>> components
>> [xserve03.local:01008] mca: base: components_open: found loaded
>> component self
>> [xserve03.local:01008] mca: base: components_open: component self
>> has no register function
>> [xserve03.local:01008] mca: base: components_open: component self
>> open function successful
>> [xserve03.local:01008] mca: base: components_open: found loaded
>> component sm
>> [xserve03.local:01008] mca: base: components_open: component sm has
>> no register function
>> [xserve03.local:01008] mca: base: components_open: component sm
>> open function successful
>> [xserve03.local:01008] mca: base: components_open: found loaded
>> component tcp
>> [xserve03.local:01008] mca: base: components_open: component tcp
>> has no register function
>> [xserve03.local:01008] mca: base: components_open: component tcp
>> open function successful
>> [xserve03.local:01008] select: initializing btl component self
>> [xserve03.local:01008] select: init of component self returned
>> success
>> [xserve03.local:01008] select: initializing btl component sm
>> [xserve03.local:01008] select: init of component sm returned success
>> [xserve03.local:01008] select: initializing btl component tcp
>> [xserve03.local:01008] select: init of component tcp returned success
>>
>> I should have reminded you of the command line:
>>
>> usr/local/openmpi/bin/mpirun -n 3 -mca btl_base_verbose 30 -mca
>> btl_tcp_if_include en0 --bynode -host xserve02,xserve03
>> connectivity_c >& connectivity_c3_2host.txt
>>
>> So I think ranks 0 and 2 are on xserve02 and rank 1 is on xserve01,
>> in which case I still think it is tcp communication...
>>
>>
>> Done MPI init
>> checking connection between rank 0 on xserve02.local and rank 1
>> Done MPI init
>> [xserve02.local:01382] btl: tcp: attempting to connect() to address
>> 192.168.2.103 on port 4
>> Done MPI init
>> checking connection between rank 1 on xserve03.local and rank 2
>> [xserve03.local:01008] btl: tcp: attempting to connect() to address
>> 192.168.2.102 on port 4
>> Done checking connection between rank 0 on xserve02.local and rank 1
>> checking connection between rank 0 on xserve02.local and rank 2
>> Done checking connection between rank 0 on xserve02.local and rank 2
>> mpirun: killing job...
>>
>> --------------------------------------------------------------------------
>> mpirun noticed that process rank 1 with PID 1008 on node xserve03
>> exited on signal 0 (Signal 0).
>> --------------------------------------------------------------------------
>> mpirun: clean termination accomplished
>>
>>
>> Thanks, Jody
>>
>>
>>>
>>> The port numbers are fine and can be different or the same - it is
>>> totally random. The procs exchange their respective port info
>>> during wireup.
>>>
>>>
>>> On Wed, Aug 12, 2009 at 12:51 PM, Jody Klymak <jklymak_at_[hidden]>
>>> wrote:
>>> Hi Ralph,
>>>
>>> That gives me something more to work with...
>>>
>>>
>>> On Aug 12, 2009, at 9:44 AM, Ralph Castain wrote:
>>>
>>>> I believe TCP works fine, Jody, as it is used on Macs fairly
>>>> widely. I suspect this is something funny about your installation.
>>>>
>>>> One thing I have found is that you can get this error message
>>>> when you have multiple NICs installed, each with a different
>>>> subnet, and the procs try to connect across different ones. Do
>>>> you by chance have multiple NICs?
>>>
>>> The head node has two active NICs:
>>> en0: public
>>> en1: private
>>>
>>> The server nodes only have one connection
>>> en0:private
>>>
>>>>
>>>> Have you tried telling OMPI which TCP interface to use? You can
>>>> do so with -mca btl_tcp_if_include eth0 (or whatever you want to
>>>> use).
>>>
>>> If I try this, I get the same results. (though I need to use "en0"
>>> on my machine)...
>>>
>>> If I include -mca btl_base_verbose 30 I get for n=2:
>>>
>>> ++++++++++
>>> [xserve03.local:00841] select: init of component tcp returned
>>> success
>>> Done MPI init
>>> checking connection between rank 0 on xserve02.local and rank 1
>>> Done MPI init
>>> [xserve02.local:01094] btl: tcp: attempting to connect() to
>>> address 192.168.2.103 on port 4
>>> Done checking connection between rank 0 on xserve02.local and rank 1
>>> Connectivity test on 2 processes PASSED.
>>> ++++++++++
>>>
>>> If I try n=3 the job hangs and I have to kill:
>>>
>>> ++++++++++
>>> Done MPI init
>>> checking connection between rank 0 on xserve02.local and rank 1
>>> [xserve02.local:01110] btl: tcp: attempting to connect() to
>>> address 192.168.2.103 on port 4
>>> Done MPI init
>>> Done MPI init
>>> checking connection between rank 1 on xserve03.local and rank 2
>>> [xserve03.local:00860] btl: tcp: attempting to connect() to
>>> address 192.168.2.102 on port 4
>>> Done checking connection between rank 0 on xserve02.local and rank 1
>>> checking connection between rank 0 on xserve02.local and rank 2
>>> Done checking connection between rank 0 on xserve02.local and rank 2
>>> mpirun: killing job...
>>> ++++++++++
>>>
>>> Those ip addresses are correct, no idea if port 4 make sense.
>>> Sometimes I get port 260. Should xserve03 and xserve02 be trying
>>> to use the same port for these comms?
>>>
>>>
>>> Thanks, Jody
>>>
>>>
>>>
>>>>
>>>>
>>>> On Wed, Aug 12, 2009 at 10:01 AM, Jody Klymak <jklymak_at_[hidden]>
>>>> wrote:
>>>>
>>>> On Aug 11, 2009, at 18:55 PM, Gus Correa wrote:
>>>>
>>>>
>>>> Did you wipe off the old directories before reinstalling?
>>>>
>>>> Check.
>>>>
>>>> I prefer to install on a NFS mounted directory,
>>>>
>>>> Check
>>>>
>>>>
>>>> Have you tried to ssh from node to node on all possible pairs?
>>>>
>>>> check - fixed this today, works fine with the spawning user...
>>>>
>>>> How could you roll back to 1.1.5,
>>>> now that you overwrote the directories?
>>>>
>>>> Oh, I still have it on another machine off the cluster in /usr/
>>>> local/openmpi. Will take just 5 mintues to reinstall.
>>>>
>>>> Launching jobs with Torque is way much better than
>>>> using barebones mpirun.
>>>>
>>>> And you don't want to stay behind with the OpenMPI versions
>>>> and improvements either.
>>>>
>>>> Sure, but I'd like the jobs to be able to run at all..
>>>>
>>>> Is there any sense in rolling back to to 1.2.3 since that is
>>>> known to work with OS X (its the one that comes with 10.5)? My
>>>> only guess at this point is other OS X users are using non-tcpip
>>>> communication, and the tcp stuff just doesn't work in 1.3.3.
>>>>
>>>> Thanks, Jody
>>>>
>>>> --
>>>> Jody Klymak
>>>> http://web.uvic.ca/~jklymak/
>>>>
>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> users mailing list
>>>> users_at_[hidden]
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>
>>>> _______________________________________________
>>>> users mailing list
>>>> users_at_[hidden]
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>
>>>
>>> --
>>> Jody Klymak
>>> http://web.uvic.ca/~jklymak/
>>>
>>>
>>>
>>>
>>>
>>> _______________________________________________
>>> users mailing list
>>> users_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>
>>> _______________________________________________
>>> users mailing list
>>> users_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>> --
>> Jody Klymak
>> http://web.uvic.ca/~jklymak/
>>
>>
>>
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users

-- 
Jeff Squyres
jsquyres_at_[hidden]