Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] [OMPI devel] Open-MPI between Mac and Linux (ubuntu 9.04) over wireless
From: Jeff Squyres (jsquyres_at_[hidden])
Date: 2009-09-24 10:32:04


(putting this back on the list where others can reply as well, and if
we solve it, the solution will be google-ized)

According to your debug output:

>> [apex-backpack:31956] btl: tcp: attempting to connect() to address
>> 10.11.14.203 on port 9360

It *is* trying to connect to the right IP address. Are you able to
ping to .203 from apex-backpack?

I also notice that you ethernet configuration does not exactly match
between linux and osx:

en0: flags=8863<UP,BROADCAST,SMART,RUNNING,SIMPLEX,MULTICAST> mtu 1500
         inet 10.11.14.203 netmask 0xfffff000 broadcast 10.11.15.255

wlan0 Link encap:Ethernet HWaddr 00:21:79:c2:54:c7
           inet addr:10.11.14.205 Bcast:10.11.14.255 Mask:
255.255.240.0

On Sep 22, 2009, at 9:26 PM, Pallab Datta wrote:

> There is no firewall running between the machines. I tried using the
> IP
> address instead of localhost but it gave me the same output. MPI is
> not
> even timing out..it keeps eternally hanging on..:(
>
> I have disabled the ethernet interface on the linux box, keeping
> only the
> wireless up. On the mac i only have the ethernet turned on. My mac
> is a 8
> core mac pro.
>
> Please help me debug this..
> thanks in advance, regards,
> pallab
>
>
>> (only replying to users list)
>>
>> Some suggestions:
>>
>> - MPI seems to startup but the additional TCP connections required
>> for
>> MPI connections seem to be failing / timing out / some other error.
>> - Are you running firewalls between your machines? If so, can you
>> disable them?
>> - I see that you're specifying "--mca btl_tcp_port_min_v4 36900" but
>> one of the debug lines reads:
>>> [apex-backpack:31956] btl: tcp: attempting to connect() to address
>>> 10.11.14.203 on port 9360
>> - Try not using the name "localhost", but rather the IP address of
>> the
>> local machine
>>
>>
>> On Sep 22, 2009, at 5:27 PM, Pallab Datta wrote:
>>
>>> The following are the ifconfig for both the Mac and the Linux
>>> respectively:
>>>
>>> fuji:openmpi-1.3.3 pallabdatta$ ifconfig
>>> lo0: flags=8049<UP,LOOPBACK,RUNNING,MULTICAST> mtu 16384
>>> inet6 fe80::1%lo0 prefixlen 64 scopeid 0x1
>>> inet 127.0.0.1 netmask 0xff000000
>>> inet6 ::1 prefixlen 128
>>> gif0: flags=8010<POINTOPOINT,MULTICAST> mtu 1280
>>> stf0: flags=0<> mtu 1280
>>> en0: flags=8863<UP,BROADCAST,SMART,RUNNING,SIMPLEX,MULTICAST> mtu
>>> 1500
>>> inet6 fe80::21f:5bff:fe3d:eaac%en0 prefixlen 64 scopeid 0x4
>>> inet 10.11.14.203 netmask 0xfffff000 broadcast 10.11.15.255
>>> ether 00:1f:5b:3d:ea:ac
>>> media: autoselect (100baseTX <full-duplex>) status: active
>>> supported media: autoselect 10baseT/UTP <half-duplex> 10baseT/UTP
>>> <full-duplex> 10baseT/UTP <full-duplex,hw-loopback> 10baseT/UTP
>>> <full-duplex,flow-control> 100baseTX <half-duplex> 100baseTX
>>> <full-duplex> 100baseTX <full-duplex,hw-loopback> 100baseTX
>>> <full-duplex,flow-control> 1000baseT <full-duplex> 1000baseT
>>> <full-duplex,hw-loopback> 1000baseT <full-duplex,flow-control>
>>> en1: flags=8863<UP,BROADCAST,SMART,RUNNING,SIMPLEX,MULTICAST> mtu
>>> 1500
>>> ether 00:1f:5b:3d:ea:ad
>>> media: autoselect status: inactive
>>> supported media: autoselect 10baseT/UTP <half-duplex> 10baseT/UTP
>>> <full-duplex> 10baseT/UTP <full-duplex,hw-loopback> 10baseT/UTP
>>> <full-duplex,flow-control> 100baseTX <half-duplex> 100baseTX
>>> <full-duplex> 100baseTX <full-duplex,hw-loopback> 100baseTX
>>> <full-duplex,flow-control> 1000baseT <full-duplex> 1000baseT
>>> <full-duplex,hw-loopback> 1000baseT <full-duplex,flow-control>
>>> fw0: flags=8863<UP,BROADCAST,SMART,RUNNING,SIMPLEX,MULTICAST> mtu
>>> 4078
>>> lladdr 00:22:41:ff:fe:ed:7d:a8
>>> media: autoselect <full-duplex> status: inactive
>>> supported media: autoselect <full-duplex>
>>>
>>>
>>> LINUX:
>>> ====
>>> pallabdatta_at_apex-backpack:~/backpack/src$ ifconfig
>>> lo Link encap:Local Loopback
>>> inet addr:127.0.0.1 Mask:255.0.0.0
>>> inet6 addr: ::1/128 Scope:Host
>>> UP LOOPBACK RUNNING MTU:16436 Metric:1
>>> RX packets:116 errors:0 dropped:0 overruns:0 frame:0
>>> TX packets:116 errors:0 dropped:0 overruns:0 carrier:0
>>> collisions:0 txqueuelen:0
>>> RX bytes:11788 (11.7 KB) TX bytes:11788 (11.7 KB)
>>>
>>> wlan0 Link encap:Ethernet HWaddr 00:21:79:c2:54:c7
>>> inet addr:10.11.14.205 Bcast:10.11.14.255 Mask:
>>> 255.255.240.0
>>> inet6 addr: fe80::221:79ff:fec2:54c7/64 Scope:Link
>>> UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
>>> RX packets:72531 errors:0 dropped:0 overruns:0 frame:0
>>> TX packets:28894 errors:0 dropped:0 overruns:0 carrier:0
>>> collisions:0 txqueuelen:1000
>>> RX bytes:5459312 (5.4 MB) TX bytes:7264193 (7.2 MB)
>>>
>>> wmaster0 Link encap:UNSPEC HWaddr
>>> 00-21-79-C2-54-C7-34-63-00-00-00-00-00-00-00-00
>>> UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
>>> RX packets:0 errors:0 dropped:0 overruns:0 frame:0
>>> TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
>>> collisions:0 txqueuelen:1000
>>> RX bytes:0 (0.0 B) TX bytes:0 (0.0 B)
>>>
>>> The mac is a Two 2.26GHz Quad-Core Intel Xeon Mac Pro and the Linux
>>> Box is
>>> Ubuntu Server Edition 9.04. The Mac has the ethernet interface to
>>> connect
>>> to the network and the linux box connects via a wireless adapter
>>> (IOGEAR).
>>>
>>> Please help me any way I can fix this issue. It really needs to work
>>> for
>>> our project.
>>> thanks in advance,
>>> regards,
>>> pallab
>>>
>>>
>>>
>>>
>>>
>>>> My other concern was the following but I am not sure it applies
>>>> here.
>>>> If you have multiple interfaces on the node, and they are on the
>>>> same
>>>> subnet, then you cannot actually select what IP address to go out
>>>> of.
>>>> You can only select the IP address you want to connect to. In these
>>>> cases, I have seen a hang because we think we are selecting an IP
>>>> address to go out of, but it actually goes out the other one.
>>>> Perhaps you can send the User's list the output from "ifconfig" on
>>>> each
>>>> of the machines which would show all the interfaces. You need to
>>>> get the
>>>> right arguments for ifconfig depending on the OS you are running
>>>> on.
>>>>
>>>> One thought is make sure the ethernet interface is marked down on
>>>> both
>>>> boxes if that is possible.
>>>>
>>>> Pallab Datta wrote:
>>>>> Any suggestions on to how to debug this further..??
>>>>> do you think I need to enable any other option besides
>>>>> heterogeneous at
>>>>> the configure proompt.?
>>>>>
>>>>>
>>>>>> The -enable-heterogeneous should do the trick. And to answer the
>>>>>> previous question, yes, put both of the interfaces in the include
>>>>>> list.
>>>>>>
>>>>>> --mca btl_tcp_if_include en0,wlan0
>>>>>>
>>>>>> If that does not work, then I may have one other thought why it
>>>>>> might
>>>>>> not work although perhaps not a solution.
>>>>>>
>>>>>> Rolf
>>>>>>
>>>>>> Pallab Datta wrote:
>>>>>>
>>>>>>> Hi Rolf,
>>>>>>>
>>>>>>> Do i need to configure openmpi with some specific options apart
>>>>>>> from
>>>>>>> --enable-heterogeneous..?
>>>>>>> I am currently using
>>>>>>> ./configure --prefix=/usr/local/ --enable-heterogeneous
>>>>>>> --disable-static
>>>>>>> --enable-shared --enable-debug
>>>>>>>
>>>>>>> on both ends...is the above correct..?! Please let me know.
>>>>>>> thanks and regards,
>>>>>>> pallab
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>> Hi:
>>>>>>>> I assume if you wait several minutes than your program will
>>>>>>>> actually
>>>>>>>> time out, yes? I guess I have two suggestions. First, can you
>>>>>>>> run a
>>>>>>>> non-MPI job using the wireless? Something like hostname?
>>>>>>>> Secondly,
>>>>>>>> you
>>>>>>>> may want to specify the specific interfaces you want it to use
>>>>>>>> on the
>>>>>>>> two machines. You can do that via the "--mca
>>>>>>>> btl_tcp_if_include"
>>>>>>>> run-time parameter. Just list the ones that you expect it to
>>>>>>>> use.
>>>>>>>>
>>>>>>>> Also, this is not right - "--mca OMPI_mca_mpi_preconnect_all
>>>>>>>> 1" It
>>>>>>>> should be --mca mpi_preconnect_mpi 1 if you want to do the
>>>>>>>> connection
>>>>>>>> during MPI_Init.
>>>>>>>>
>>>>>>>> Rolf
>>>>>>>>
>>>>>>>> Pallab Datta wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>>> The following is the error dump
>>>>>>>>>
>>>>>>>>> fuji:src pallabdatta$ /usr/local/bin/mpirun --mca
>>>>>>>>> btl_tcp_port_min_v4
>>>>>>>>> 36900 -mca btl_tcp_port_range_v4 32 --mca btl_base_verbose 30
>>>>>>>>> --mca
>>>>>>>>> btl
>>>>>>>>> tcp,self --mca OMPI_mca_mpi_preconnect_all 1 -np 2 -hetero -H
>>>>>>>>> localhost,10.11.14.205 /tmp/hello
>>>>>>>>> [fuji.local:01316] mca: base: components_open: Looking for btl
>>>>>>>>> components
>>>>>>>>> [fuji.local:01316] mca: base: components_open: opening btl
>>>>>>>>> components
>>>>>>>>> [fuji.local:01316] mca: base: components_open: found loaded
>>>>>>>>> component
>>>>>>>>> self
>>>>>>>>> [fuji.local:01316] mca: base: components_open: component self
>>>>>>>>> has no
>>>>>>>>> register function
>>>>>>>>> [fuji.local:01316] mca: base: components_open: component self
>>>>>>>>> open
>>>>>>>>> function successful
>>>>>>>>> [fuji.local:01316] mca: base: components_open: found loaded
>>>>>>>>> component
>>>>>>>>> tcp
>>>>>>>>> [fuji.local:01316] mca: base: components_open: component tcp
>>>>>>>>> has no
>>>>>>>>> register function
>>>>>>>>> [fuji.local:01316] mca: base: components_open: component tcp
>>>>>>>>> open
>>>>>>>>> function
>>>>>>>>> successful
>>>>>>>>> [fuji.local:01316] select: initializing btl component self
>>>>>>>>> [fuji.local:01316] select: init of component self returned
>>>>>>>>> success
>>>>>>>>> [fuji.local:01316] select: initializing btl component tcp
>>>>>>>>> [fuji.local:01316] select: init of component tcp returned
>>>>>>>>> success
>>>>>>>>> [apex-backpack:04753] mca: base: components_open: Looking for
>>>>>>>>> btl
>>>>>>>>> components
>>>>>>>>> [apex-backpack:04753] mca: base: components_open: opening btl
>>>>>>>>> components
>>>>>>>>> [apex-backpack:04753] mca: base: components_open: found loaded
>>>>>>>>> component
>>>>>>>>> self
>>>>>>>>> [apex-backpack:04753] mca: base: components_open: component
>>>>>>>>> self has
>>>>>>>>> no
>>>>>>>>> register function
>>>>>>>>> [apex-backpack:04753] mca: base: components_open: component
>>>>>>>>> self
>>>>>>>>> open
>>>>>>>>> function successful
>>>>>>>>> [apex-backpack:04753] mca: base: components_open: found loaded
>>>>>>>>> component
>>>>>>>>> tcp
>>>>>>>>> [apex-backpack:04753] mca: base: components_open: component
>>>>>>>>> tcp has
>>>>>>>>> no
>>>>>>>>> register function
>>>>>>>>> [apex-backpack:04753] mca: base: components_open: component
>>>>>>>>> tcp open
>>>>>>>>> function successful
>>>>>>>>> [apex-backpack:04753] select: initializing btl component self
>>>>>>>>> [apex-backpack:04753] select: init of component self returned
>>>>>>>>> success
>>>>>>>>> [apex-backpack:04753] select: initializing btl component tcp
>>>>>>>>> [apex-backpack:04753] select: init of component tcp returned
>>>>>>>>> success
>>>>>>>>> Process 0 on fuji.local out of 2
>>>>>>>>> Process 1 on apex-backpack out of 2
>>>>>>>>> [apex-backpack:04753] btl: tcp: attempting to connect() to
>>>>>>>>> address
>>>>>>>>> 10.11.14.203 on port 9360
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>> Hi
>>>>>>>>>>
>>>>>>>>>> I am trying to run open-mpi 1.3.3. between a linux box
>>>>>>>>>> running
>>>>>>>>>> ubuntu
>>>>>>>>>> server v.9.04 and a Macintosh. I have configured openmpi with
>>>>>>>>>> the
>>>>>>>>>> following options.:
>>>>>>>>>> ./configure --prefix=/usr/local/ --enable-heterogeneous
>>>>>>>>>> --disable-shared
>>>>>>>>>> --enable-static
>>>>>>>>>>
>>>>>>>>>> When both the machines are connected to the network via
>>>>>>>>>> ethernet
>>>>>>>>>> cables
>>>>>>>>>> openmpi works fine.
>>>>>>>>>>
>>>>>>>>>> But when I switch the linux box to a wireless adapter i can
>>>>>>>>>> reach
>>>>>>>>>> (ping)
>>>>>>>>>> the macintosh
>>>>>>>>>> but openmpi hangs on a hello world program.
>>>>>>>>>>
>>>>>>>>>> I ran :
>>>>>>>>>>
>>>>>>>>>> /usr/local/bin/mpirun --mca btl_tcp_port_min_v4 36900 -mca
>>>>>>>>>> btl_tcp_port_range_v4 32 --mca btl_base_verbose 30 --mca
>>>>>>>>>> OMPI_mca_mpi_preconnect_all 1 -np 2 -hetero -H
>>>>>>>>>> localhost,10.11.14.205
>>>>>>>>>> /tmp/back
>>>>>>>>>>
>>>>>>>>>> it hangs on a send receive function between the two ends. All
>>>>>>>>>> my
>>>>>>>>>> firewalls
>>>>>>>>>> are turned off at the macintosh end. PLEASE HELP ASAP>
>>>>>>>>>> regards,
>>>>>>>>>> pallab
>>>>>>>>>> _______________________________________________
>>>>>>>>>> users mailing list
>>>>>>>>>> users_at_[hidden]
>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>> _______________________________________________
>>>>>>>>> users mailing list
>>>>>>>>> users_at_[hidden]
>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>> --
>>>>>>>>
>>>>>>>> =========================
>>>>>>>> rolf.vandevaart_at_[hidden]
>>>>>>>> 781-442-3043
>>>>>>>> =========================
>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> users mailing list
>>>>>>>> users_at_[hidden]
>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>> --
>>>>>>
>>>>>> =========================
>>>>>> rolf.vandevaart_at_[hidden]
>>>>>> 781-442-3043
>>>>>> =========================
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>>
>>>> =========================
>>>> rolf.vandevaart_at_[hidden]
>>>> 781-442-3043
>>>> =========================
>>>>
>>>>
>>>
>>> _______________________________________________
>>> devel mailing list
>>> devel_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>
>>
>> --
>> Jeff Squyres
>> jsquyres_at_[hidden]
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>

-- 
Jeff Squyres
jsquyres_at_[hidden]