Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] [OMPI devel] Open-MPI between Mac and Linux (ubuntu 9.04) over wireless
From: Jeff Squyres (jsquyres_at_[hidden])
Date: 2009-09-24 12:46:03


Have you tried running without limiting the port range?

On Sep 24, 2009, at 12:39 PM, Pallab Datta wrote:

> Hi All,
>
> Yes I can ping and ssh from apex-backpack to my Mac (fuji.local).
> I fixed the wireless broadcast to reflect the same on both ends
> (10.11.14.255) but still the problem persists.
>
> I have tried other wireless adapters as well. But no luck till far.
> Please let me know what can be done...
> regards, pallab
>
>> (putting this back on the list where others can reply as well, and if
>> we solve it, the solution will be google-ized)
>>
>> According to your debug output:
>>
>>>> [apex-backpack:31956] btl: tcp: attempting to connect() to address
>>>> 10.11.14.203 on port 9360
>>
>> It *is* trying to connect to the right IP address. Are you able to
>> ping to .203 from apex-backpack?
>>
>> I also notice that you ethernet configuration does not exactly match
>> between linux and osx:
>>
>> en0: flags=8863<UP,BROADCAST,SMART,RUNNING,SIMPLEX,MULTICAST> mtu
>> 1500
>> inet 10.11.14.203 netmask 0xfffff000 broadcast 10.11.15.255
>>
>> wlan0 Link encap:Ethernet HWaddr 00:21:79:c2:54:c7
>> inet addr:10.11.14.205 Bcast:10.11.14.255 Mask:
>> 255.255.240.0
>>
>>
>> On Sep 22, 2009, at 9:26 PM, Pallab Datta wrote:
>>
>>> There is no firewall running between the machines. I tried using the
>>> IP
>>> address instead of localhost but it gave me the same output. MPI is
>>> not
>>> even timing out..it keeps eternally hanging on..:(
>>>
>>> I have disabled the ethernet interface on the linux box, keeping
>>> only the
>>> wireless up. On the mac i only have the ethernet turned on. My mac
>>> is a 8
>>> core mac pro.
>>>
>>> Please help me debug this..
>>> thanks in advance, regards,
>>> pallab
>>>
>>>
>>>> (only replying to users list)
>>>>
>>>> Some suggestions:
>>>>
>>>> - MPI seems to startup but the additional TCP connections required
>>>> for
>>>> MPI connections seem to be failing / timing out / some other error.
>>>> - Are you running firewalls between your machines? If so, can you
>>>> disable them?
>>>> - I see that you're specifying "--mca btl_tcp_port_min_v4 36900"
>>>> but
>>>> one of the debug lines reads:
>>>>> [apex-backpack:31956] btl: tcp: attempting to connect() to address
>>>>> 10.11.14.203 on port 9360
>>>> - Try not using the name "localhost", but rather the IP address of
>>>> the
>>>> local machine
>>>>
>>>>
>>>> On Sep 22, 2009, at 5:27 PM, Pallab Datta wrote:
>>>>
>>>>> The following are the ifconfig for both the Mac and the Linux
>>>>> respectively:
>>>>>
>>>>> fuji:openmpi-1.3.3 pallabdatta$ ifconfig
>>>>> lo0: flags=8049<UP,LOOPBACK,RUNNING,MULTICAST> mtu 16384
>>>>> inet6 fe80::1%lo0 prefixlen 64 scopeid 0x1
>>>>> inet 127.0.0.1 netmask 0xff000000
>>>>> inet6 ::1 prefixlen 128
>>>>> gif0: flags=8010<POINTOPOINT,MULTICAST> mtu 1280
>>>>> stf0: flags=0<> mtu 1280
>>>>> en0: flags=8863<UP,BROADCAST,SMART,RUNNING,SIMPLEX,MULTICAST> mtu
>>>>> 1500
>>>>> inet6 fe80::21f:5bff:fe3d:eaac%en0 prefixlen 64 scopeid 0x4
>>>>> inet 10.11.14.203 netmask 0xfffff000 broadcast 10.11.15.255
>>>>> ether 00:1f:5b:3d:ea:ac
>>>>> media: autoselect (100baseTX <full-duplex>) status: active
>>>>> supported media: autoselect 10baseT/UTP <half-duplex> 10baseT/UTP
>>>>> <full-duplex> 10baseT/UTP <full-duplex,hw-loopback> 10baseT/UTP
>>>>> <full-duplex,flow-control> 100baseTX <half-duplex> 100baseTX
>>>>> <full-duplex> 100baseTX <full-duplex,hw-loopback> 100baseTX
>>>>> <full-duplex,flow-control> 1000baseT <full-duplex> 1000baseT
>>>>> <full-duplex,hw-loopback> 1000baseT <full-duplex,flow-control>
>>>>> en1: flags=8863<UP,BROADCAST,SMART,RUNNING,SIMPLEX,MULTICAST> mtu
>>>>> 1500
>>>>> ether 00:1f:5b:3d:ea:ad
>>>>> media: autoselect status: inactive
>>>>> supported media: autoselect 10baseT/UTP <half-duplex> 10baseT/UTP
>>>>> <full-duplex> 10baseT/UTP <full-duplex,hw-loopback> 10baseT/UTP
>>>>> <full-duplex,flow-control> 100baseTX <half-duplex> 100baseTX
>>>>> <full-duplex> 100baseTX <full-duplex,hw-loopback> 100baseTX
>>>>> <full-duplex,flow-control> 1000baseT <full-duplex> 1000baseT
>>>>> <full-duplex,hw-loopback> 1000baseT <full-duplex,flow-control>
>>>>> fw0: flags=8863<UP,BROADCAST,SMART,RUNNING,SIMPLEX,MULTICAST> mtu
>>>>> 4078
>>>>> lladdr 00:22:41:ff:fe:ed:7d:a8
>>>>> media: autoselect <full-duplex> status: inactive
>>>>> supported media: autoselect <full-duplex>
>>>>>
>>>>>
>>>>> LINUX:
>>>>> ====
>>>>> pallabdatta_at_apex-backpack:~/backpack/src$ ifconfig
>>>>> lo Link encap:Local Loopback
>>>>> inet addr:127.0.0.1 Mask:255.0.0.0
>>>>> inet6 addr: ::1/128 Scope:Host
>>>>> UP LOOPBACK RUNNING MTU:16436 Metric:1
>>>>> RX packets:116 errors:0 dropped:0 overruns:0 frame:0
>>>>> TX packets:116 errors:0 dropped:0 overruns:0 carrier:0
>>>>> collisions:0 txqueuelen:0
>>>>> RX bytes:11788 (11.7 KB) TX bytes:11788 (11.7 KB)
>>>>>
>>>>> wlan0 Link encap:Ethernet HWaddr 00:21:79:c2:54:c7
>>>>> inet addr:10.11.14.205 Bcast:10.11.14.255 Mask:
>>>>> 255.255.240.0
>>>>> inet6 addr: fe80::221:79ff:fec2:54c7/64 Scope:Link
>>>>> UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
>>>>> RX packets:72531 errors:0 dropped:0 overruns:0 frame:0
>>>>> TX packets:28894 errors:0 dropped:0 overruns:0 carrier:0
>>>>> collisions:0 txqueuelen:1000
>>>>> RX bytes:5459312 (5.4 MB) TX bytes:7264193 (7.2 MB)
>>>>>
>>>>> wmaster0 Link encap:UNSPEC HWaddr
>>>>> 00-21-79-C2-54-C7-34-63-00-00-00-00-00-00-00-00
>>>>> UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
>>>>> RX packets:0 errors:0 dropped:0 overruns:0 frame:0
>>>>> TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
>>>>> collisions:0 txqueuelen:1000
>>>>> RX bytes:0 (0.0 B) TX bytes:0 (0.0 B)
>>>>>
>>>>> The mac is a Two 2.26GHz Quad-Core Intel Xeon Mac Pro and the
>>>>> Linux
>>>>> Box is
>>>>> Ubuntu Server Edition 9.04. The Mac has the ethernet interface to
>>>>> connect
>>>>> to the network and the linux box connects via a wireless adapter
>>>>> (IOGEAR).
>>>>>
>>>>> Please help me any way I can fix this issue. It really needs to
>>>>> work
>>>>> for
>>>>> our project.
>>>>> thanks in advance,
>>>>> regards,
>>>>> pallab
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>> My other concern was the following but I am not sure it applies
>>>>>> here.
>>>>>> If you have multiple interfaces on the node, and they are on the
>>>>>> same
>>>>>> subnet, then you cannot actually select what IP address to go out
>>>>>> of.
>>>>>> You can only select the IP address you want to connect to. In
>>>>>> these
>>>>>> cases, I have seen a hang because we think we are selecting an IP
>>>>>> address to go out of, but it actually goes out the other one.
>>>>>> Perhaps you can send the User's list the output from "ifconfig"
>>>>>> on
>>>>>> each
>>>>>> of the machines which would show all the interfaces. You need to
>>>>>> get the
>>>>>> right arguments for ifconfig depending on the OS you are running
>>>>>> on.
>>>>>>
>>>>>> One thought is make sure the ethernet interface is marked down on
>>>>>> both
>>>>>> boxes if that is possible.
>>>>>>
>>>>>> Pallab Datta wrote:
>>>>>>> Any suggestions on to how to debug this further..??
>>>>>>> do you think I need to enable any other option besides
>>>>>>> heterogeneous at
>>>>>>> the configure proompt.?
>>>>>>>
>>>>>>>
>>>>>>>> The -enable-heterogeneous should do the trick. And to answer
>>>>>>>> the
>>>>>>>> previous question, yes, put both of the interfaces in the
>>>>>>>> include
>>>>>>>> list.
>>>>>>>>
>>>>>>>> --mca btl_tcp_if_include en0,wlan0
>>>>>>>>
>>>>>>>> If that does not work, then I may have one other thought why it
>>>>>>>> might
>>>>>>>> not work although perhaps not a solution.
>>>>>>>>
>>>>>>>> Rolf
>>>>>>>>
>>>>>>>> Pallab Datta wrote:
>>>>>>>>
>>>>>>>>> Hi Rolf,
>>>>>>>>>
>>>>>>>>> Do i need to configure openmpi with some specific options
>>>>>>>>> apart
>>>>>>>>> from
>>>>>>>>> --enable-heterogeneous..?
>>>>>>>>> I am currently using
>>>>>>>>> ./configure --prefix=/usr/local/ --enable-heterogeneous
>>>>>>>>> --disable-static
>>>>>>>>> --enable-shared --enable-debug
>>>>>>>>>
>>>>>>>>> on both ends...is the above correct..?! Please let me know.
>>>>>>>>> thanks and regards,
>>>>>>>>> pallab
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>> Hi:
>>>>>>>>>> I assume if you wait several minutes than your program will
>>>>>>>>>> actually
>>>>>>>>>> time out, yes? I guess I have two suggestions. First, can
>>>>>>>>>> you
>>>>>>>>>> run a
>>>>>>>>>> non-MPI job using the wireless? Something like hostname?
>>>>>>>>>> Secondly,
>>>>>>>>>> you
>>>>>>>>>> may want to specify the specific interfaces you want it to
>>>>>>>>>> use
>>>>>>>>>> on the
>>>>>>>>>> two machines. You can do that via the "--mca
>>>>>>>>>> btl_tcp_if_include"
>>>>>>>>>> run-time parameter. Just list the ones that you expect it to
>>>>>>>>>> use.
>>>>>>>>>>
>>>>>>>>>> Also, this is not right - "--mca OMPI_mca_mpi_preconnect_all
>>>>>>>>>> 1" It
>>>>>>>>>> should be --mca mpi_preconnect_mpi 1 if you want to do the
>>>>>>>>>> connection
>>>>>>>>>> during MPI_Init.
>>>>>>>>>>
>>>>>>>>>> Rolf
>>>>>>>>>>
>>>>>>>>>> Pallab Datta wrote:
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>> The following is the error dump
>>>>>>>>>>>
>>>>>>>>>>> fuji:src pallabdatta$ /usr/local/bin/mpirun --mca
>>>>>>>>>>> btl_tcp_port_min_v4
>>>>>>>>>>> 36900 -mca btl_tcp_port_range_v4 32 --mca btl_base_verbose
>>>>>>>>>>> 30
>>>>>>>>>>> --mca
>>>>>>>>>>> btl
>>>>>>>>>>> tcp,self --mca OMPI_mca_mpi_preconnect_all 1 -np 2 -hetero
>>>>>>>>>>> -H
>>>>>>>>>>> localhost,10.11.14.205 /tmp/hello
>>>>>>>>>>> [fuji.local:01316] mca: base: components_open: Looking for
>>>>>>>>>>> btl
>>>>>>>>>>> components
>>>>>>>>>>> [fuji.local:01316] mca: base: components_open: opening btl
>>>>>>>>>>> components
>>>>>>>>>>> [fuji.local:01316] mca: base: components_open: found loaded
>>>>>>>>>>> component
>>>>>>>>>>> self
>>>>>>>>>>> [fuji.local:01316] mca: base: components_open: component
>>>>>>>>>>> self
>>>>>>>>>>> has no
>>>>>>>>>>> register function
>>>>>>>>>>> [fuji.local:01316] mca: base: components_open: component
>>>>>>>>>>> self
>>>>>>>>>>> open
>>>>>>>>>>> function successful
>>>>>>>>>>> [fuji.local:01316] mca: base: components_open: found loaded
>>>>>>>>>>> component
>>>>>>>>>>> tcp
>>>>>>>>>>> [fuji.local:01316] mca: base: components_open: component tcp
>>>>>>>>>>> has no
>>>>>>>>>>> register function
>>>>>>>>>>> [fuji.local:01316] mca: base: components_open: component tcp
>>>>>>>>>>> open
>>>>>>>>>>> function
>>>>>>>>>>> successful
>>>>>>>>>>> [fuji.local:01316] select: initializing btl component self
>>>>>>>>>>> [fuji.local:01316] select: init of component self returned
>>>>>>>>>>> success
>>>>>>>>>>> [fuji.local:01316] select: initializing btl component tcp
>>>>>>>>>>> [fuji.local:01316] select: init of component tcp returned
>>>>>>>>>>> success
>>>>>>>>>>> [apex-backpack:04753] mca: base: components_open: Looking
>>>>>>>>>>> for
>>>>>>>>>>> btl
>>>>>>>>>>> components
>>>>>>>>>>> [apex-backpack:04753] mca: base: components_open: opening
>>>>>>>>>>> btl
>>>>>>>>>>> components
>>>>>>>>>>> [apex-backpack:04753] mca: base: components_open: found
>>>>>>>>>>> loaded
>>>>>>>>>>> component
>>>>>>>>>>> self
>>>>>>>>>>> [apex-backpack:04753] mca: base: components_open: component
>>>>>>>>>>> self has
>>>>>>>>>>> no
>>>>>>>>>>> register function
>>>>>>>>>>> [apex-backpack:04753] mca: base: components_open: component
>>>>>>>>>>> self
>>>>>>>>>>> open
>>>>>>>>>>> function successful
>>>>>>>>>>> [apex-backpack:04753] mca: base: components_open: found
>>>>>>>>>>> loaded
>>>>>>>>>>> component
>>>>>>>>>>> tcp
>>>>>>>>>>> [apex-backpack:04753] mca: base: components_open: component
>>>>>>>>>>> tcp has
>>>>>>>>>>> no
>>>>>>>>>>> register function
>>>>>>>>>>> [apex-backpack:04753] mca: base: components_open: component
>>>>>>>>>>> tcp open
>>>>>>>>>>> function successful
>>>>>>>>>>> [apex-backpack:04753] select: initializing btl component
>>>>>>>>>>> self
>>>>>>>>>>> [apex-backpack:04753] select: init of component self
>>>>>>>>>>> returned
>>>>>>>>>>> success
>>>>>>>>>>> [apex-backpack:04753] select: initializing btl component tcp
>>>>>>>>>>> [apex-backpack:04753] select: init of component tcp returned
>>>>>>>>>>> success
>>>>>>>>>>> Process 0 on fuji.local out of 2
>>>>>>>>>>> Process 1 on apex-backpack out of 2
>>>>>>>>>>> [apex-backpack:04753] btl: tcp: attempting to connect() to
>>>>>>>>>>> address
>>>>>>>>>>> 10.11.14.203 on port 9360
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>> Hi
>>>>>>>>>>>>
>>>>>>>>>>>> I am trying to run open-mpi 1.3.3. between a linux box
>>>>>>>>>>>> running
>>>>>>>>>>>> ubuntu
>>>>>>>>>>>> server v.9.04 and a Macintosh. I have configured openmpi
>>>>>>>>>>>> with
>>>>>>>>>>>> the
>>>>>>>>>>>> following options.:
>>>>>>>>>>>> ./configure --prefix=/usr/local/ --enable-heterogeneous
>>>>>>>>>>>> --disable-shared
>>>>>>>>>>>> --enable-static
>>>>>>>>>>>>
>>>>>>>>>>>> When both the machines are connected to the network via
>>>>>>>>>>>> ethernet
>>>>>>>>>>>> cables
>>>>>>>>>>>> openmpi works fine.
>>>>>>>>>>>>
>>>>>>>>>>>> But when I switch the linux box to a wireless adapter i can
>>>>>>>>>>>> reach
>>>>>>>>>>>> (ping)
>>>>>>>>>>>> the macintosh
>>>>>>>>>>>> but openmpi hangs on a hello world program.
>>>>>>>>>>>>
>>>>>>>>>>>> I ran :
>>>>>>>>>>>>
>>>>>>>>>>>> /usr/local/bin/mpirun --mca btl_tcp_port_min_v4 36900 -mca
>>>>>>>>>>>> btl_tcp_port_range_v4 32 --mca btl_base_verbose 30 --mca
>>>>>>>>>>>> OMPI_mca_mpi_preconnect_all 1 -np 2 -hetero -H
>>>>>>>>>>>> localhost,10.11.14.205
>>>>>>>>>>>> /tmp/back
>>>>>>>>>>>>
>>>>>>>>>>>> it hangs on a send receive function between the two ends.
>>>>>>>>>>>> All
>>>>>>>>>>>> my
>>>>>>>>>>>> firewalls
>>>>>>>>>>>> are turned off at the macintosh end. PLEASE HELP ASAP>
>>>>>>>>>>>> regards,
>>>>>>>>>>>> pallab
>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>> users mailing list
>>>>>>>>>>>> users_at_[hidden]
>>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>> _______________________________________________
>>>>>>>>>>> users mailing list
>>>>>>>>>>> users_at_[hidden]
>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>>
>>>>>>>>>> =========================
>>>>>>>>>> rolf.vandevaart_at_[hidden]
>>>>>>>>>> 781-442-3043
>>>>>>>>>> =========================
>>>>>>>>>>
>>>>>>>>>> _______________________________________________
>>>>>>>>>> users mailing list
>>>>>>>>>> users_at_[hidden]
>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>> --
>>>>>>>>
>>>>>>>> =========================
>>>>>>>> rolf.vandevaart_at_[hidden]
>>>>>>>> 781-442-3043
>>>>>>>> =========================
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>>
>>>>>> =========================
>>>>>> rolf.vandevaart_at_[hidden]
>>>>>> 781-442-3043
>>>>>> =========================
>>>>>>
>>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> devel mailing list
>>>>> devel_at_[hidden]
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>
>>>>
>>>> --
>>>> Jeff Squyres
>>>> jsquyres_at_[hidden]
>>>>
>>>> _______________________________________________
>>>> users mailing list
>>>> users_at_[hidden]
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>
>>>
>>
>>
>> --
>> Jeff Squyres
>> jsquyres_at_[hidden]
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users

-- 
Jeff Squyres
jsquyres_at_[hidden]