Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] [OMPI devel] Open-MPI between Mac and Linux (ubuntu 9.04) over wireless
From: Pallab Datta (datta_at_[hidden])
Date: 2009-09-24 12:39:29


Hi All,

Yes I can ping and ssh from apex-backpack to my Mac (fuji.local).
I fixed the wireless broadcast to reflect the same on both ends
(10.11.14.255) but still the problem persists.

I have tried other wireless adapters as well. But no luck till far.
Please let me know what can be done...
regards, pallab

> (putting this back on the list where others can reply as well, and if
> we solve it, the solution will be google-ized)
>
> According to your debug output:
>
>>> [apex-backpack:31956] btl: tcp: attempting to connect() to address
>>> 10.11.14.203 on port 9360
>
> It *is* trying to connect to the right IP address. Are you able to
> ping to .203 from apex-backpack?
>
> I also notice that you ethernet configuration does not exactly match
> between linux and osx:
>
> en0: flags=8863<UP,BROADCAST,SMART,RUNNING,SIMPLEX,MULTICAST> mtu 1500
> inet 10.11.14.203 netmask 0xfffff000 broadcast 10.11.15.255
>
> wlan0 Link encap:Ethernet HWaddr 00:21:79:c2:54:c7
> inet addr:10.11.14.205 Bcast:10.11.14.255 Mask:
> 255.255.240.0
>
>
> On Sep 22, 2009, at 9:26 PM, Pallab Datta wrote:
>
>> There is no firewall running between the machines. I tried using the
>> IP
>> address instead of localhost but it gave me the same output. MPI is
>> not
>> even timing out..it keeps eternally hanging on..:(
>>
>> I have disabled the ethernet interface on the linux box, keeping
>> only the
>> wireless up. On the mac i only have the ethernet turned on. My mac
>> is a 8
>> core mac pro.
>>
>> Please help me debug this..
>> thanks in advance, regards,
>> pallab
>>
>>
>>> (only replying to users list)
>>>
>>> Some suggestions:
>>>
>>> - MPI seems to startup but the additional TCP connections required
>>> for
>>> MPI connections seem to be failing / timing out / some other error.
>>> - Are you running firewalls between your machines? If so, can you
>>> disable them?
>>> - I see that you're specifying "--mca btl_tcp_port_min_v4 36900" but
>>> one of the debug lines reads:
>>>> [apex-backpack:31956] btl: tcp: attempting to connect() to address
>>>> 10.11.14.203 on port 9360
>>> - Try not using the name "localhost", but rather the IP address of
>>> the
>>> local machine
>>>
>>>
>>> On Sep 22, 2009, at 5:27 PM, Pallab Datta wrote:
>>>
>>>> The following are the ifconfig for both the Mac and the Linux
>>>> respectively:
>>>>
>>>> fuji:openmpi-1.3.3 pallabdatta$ ifconfig
>>>> lo0: flags=8049<UP,LOOPBACK,RUNNING,MULTICAST> mtu 16384
>>>> inet6 fe80::1%lo0 prefixlen 64 scopeid 0x1
>>>> inet 127.0.0.1 netmask 0xff000000
>>>> inet6 ::1 prefixlen 128
>>>> gif0: flags=8010<POINTOPOINT,MULTICAST> mtu 1280
>>>> stf0: flags=0<> mtu 1280
>>>> en0: flags=8863<UP,BROADCAST,SMART,RUNNING,SIMPLEX,MULTICAST> mtu
>>>> 1500
>>>> inet6 fe80::21f:5bff:fe3d:eaac%en0 prefixlen 64 scopeid 0x4
>>>> inet 10.11.14.203 netmask 0xfffff000 broadcast 10.11.15.255
>>>> ether 00:1f:5b:3d:ea:ac
>>>> media: autoselect (100baseTX <full-duplex>) status: active
>>>> supported media: autoselect 10baseT/UTP <half-duplex> 10baseT/UTP
>>>> <full-duplex> 10baseT/UTP <full-duplex,hw-loopback> 10baseT/UTP
>>>> <full-duplex,flow-control> 100baseTX <half-duplex> 100baseTX
>>>> <full-duplex> 100baseTX <full-duplex,hw-loopback> 100baseTX
>>>> <full-duplex,flow-control> 1000baseT <full-duplex> 1000baseT
>>>> <full-duplex,hw-loopback> 1000baseT <full-duplex,flow-control>
>>>> en1: flags=8863<UP,BROADCAST,SMART,RUNNING,SIMPLEX,MULTICAST> mtu
>>>> 1500
>>>> ether 00:1f:5b:3d:ea:ad
>>>> media: autoselect status: inactive
>>>> supported media: autoselect 10baseT/UTP <half-duplex> 10baseT/UTP
>>>> <full-duplex> 10baseT/UTP <full-duplex,hw-loopback> 10baseT/UTP
>>>> <full-duplex,flow-control> 100baseTX <half-duplex> 100baseTX
>>>> <full-duplex> 100baseTX <full-duplex,hw-loopback> 100baseTX
>>>> <full-duplex,flow-control> 1000baseT <full-duplex> 1000baseT
>>>> <full-duplex,hw-loopback> 1000baseT <full-duplex,flow-control>
>>>> fw0: flags=8863<UP,BROADCAST,SMART,RUNNING,SIMPLEX,MULTICAST> mtu
>>>> 4078
>>>> lladdr 00:22:41:ff:fe:ed:7d:a8
>>>> media: autoselect <full-duplex> status: inactive
>>>> supported media: autoselect <full-duplex>
>>>>
>>>>
>>>> LINUX:
>>>> ====
>>>> pallabdatta_at_apex-backpack:~/backpack/src$ ifconfig
>>>> lo Link encap:Local Loopback
>>>> inet addr:127.0.0.1 Mask:255.0.0.0
>>>> inet6 addr: ::1/128 Scope:Host
>>>> UP LOOPBACK RUNNING MTU:16436 Metric:1
>>>> RX packets:116 errors:0 dropped:0 overruns:0 frame:0
>>>> TX packets:116 errors:0 dropped:0 overruns:0 carrier:0
>>>> collisions:0 txqueuelen:0
>>>> RX bytes:11788 (11.7 KB) TX bytes:11788 (11.7 KB)
>>>>
>>>> wlan0 Link encap:Ethernet HWaddr 00:21:79:c2:54:c7
>>>> inet addr:10.11.14.205 Bcast:10.11.14.255 Mask:
>>>> 255.255.240.0
>>>> inet6 addr: fe80::221:79ff:fec2:54c7/64 Scope:Link
>>>> UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
>>>> RX packets:72531 errors:0 dropped:0 overruns:0 frame:0
>>>> TX packets:28894 errors:0 dropped:0 overruns:0 carrier:0
>>>> collisions:0 txqueuelen:1000
>>>> RX bytes:5459312 (5.4 MB) TX bytes:7264193 (7.2 MB)
>>>>
>>>> wmaster0 Link encap:UNSPEC HWaddr
>>>> 00-21-79-C2-54-C7-34-63-00-00-00-00-00-00-00-00
>>>> UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
>>>> RX packets:0 errors:0 dropped:0 overruns:0 frame:0
>>>> TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
>>>> collisions:0 txqueuelen:1000
>>>> RX bytes:0 (0.0 B) TX bytes:0 (0.0 B)
>>>>
>>>> The mac is a Two 2.26GHz Quad-Core Intel Xeon Mac Pro and the Linux
>>>> Box is
>>>> Ubuntu Server Edition 9.04. The Mac has the ethernet interface to
>>>> connect
>>>> to the network and the linux box connects via a wireless adapter
>>>> (IOGEAR).
>>>>
>>>> Please help me any way I can fix this issue. It really needs to work
>>>> for
>>>> our project.
>>>> thanks in advance,
>>>> regards,
>>>> pallab
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>> My other concern was the following but I am not sure it applies
>>>>> here.
>>>>> If you have multiple interfaces on the node, and they are on the
>>>>> same
>>>>> subnet, then you cannot actually select what IP address to go out
>>>>> of.
>>>>> You can only select the IP address you want to connect to. In these
>>>>> cases, I have seen a hang because we think we are selecting an IP
>>>>> address to go out of, but it actually goes out the other one.
>>>>> Perhaps you can send the User's list the output from "ifconfig" on
>>>>> each
>>>>> of the machines which would show all the interfaces. You need to
>>>>> get the
>>>>> right arguments for ifconfig depending on the OS you are running
>>>>> on.
>>>>>
>>>>> One thought is make sure the ethernet interface is marked down on
>>>>> both
>>>>> boxes if that is possible.
>>>>>
>>>>> Pallab Datta wrote:
>>>>>> Any suggestions on to how to debug this further..??
>>>>>> do you think I need to enable any other option besides
>>>>>> heterogeneous at
>>>>>> the configure proompt.?
>>>>>>
>>>>>>
>>>>>>> The -enable-heterogeneous should do the trick. And to answer the
>>>>>>> previous question, yes, put both of the interfaces in the include
>>>>>>> list.
>>>>>>>
>>>>>>> --mca btl_tcp_if_include en0,wlan0
>>>>>>>
>>>>>>> If that does not work, then I may have one other thought why it
>>>>>>> might
>>>>>>> not work although perhaps not a solution.
>>>>>>>
>>>>>>> Rolf
>>>>>>>
>>>>>>> Pallab Datta wrote:
>>>>>>>
>>>>>>>> Hi Rolf,
>>>>>>>>
>>>>>>>> Do i need to configure openmpi with some specific options apart
>>>>>>>> from
>>>>>>>> --enable-heterogeneous..?
>>>>>>>> I am currently using
>>>>>>>> ./configure --prefix=/usr/local/ --enable-heterogeneous
>>>>>>>> --disable-static
>>>>>>>> --enable-shared --enable-debug
>>>>>>>>
>>>>>>>> on both ends...is the above correct..?! Please let me know.
>>>>>>>> thanks and regards,
>>>>>>>> pallab
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>> Hi:
>>>>>>>>> I assume if you wait several minutes than your program will
>>>>>>>>> actually
>>>>>>>>> time out, yes? I guess I have two suggestions. First, can you
>>>>>>>>> run a
>>>>>>>>> non-MPI job using the wireless? Something like hostname?
>>>>>>>>> Secondly,
>>>>>>>>> you
>>>>>>>>> may want to specify the specific interfaces you want it to use
>>>>>>>>> on the
>>>>>>>>> two machines. You can do that via the "--mca
>>>>>>>>> btl_tcp_if_include"
>>>>>>>>> run-time parameter. Just list the ones that you expect it to
>>>>>>>>> use.
>>>>>>>>>
>>>>>>>>> Also, this is not right - "--mca OMPI_mca_mpi_preconnect_all
>>>>>>>>> 1" It
>>>>>>>>> should be --mca mpi_preconnect_mpi 1 if you want to do the
>>>>>>>>> connection
>>>>>>>>> during MPI_Init.
>>>>>>>>>
>>>>>>>>> Rolf
>>>>>>>>>
>>>>>>>>> Pallab Datta wrote:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>> The following is the error dump
>>>>>>>>>>
>>>>>>>>>> fuji:src pallabdatta$ /usr/local/bin/mpirun --mca
>>>>>>>>>> btl_tcp_port_min_v4
>>>>>>>>>> 36900 -mca btl_tcp_port_range_v4 32 --mca btl_base_verbose 30
>>>>>>>>>> --mca
>>>>>>>>>> btl
>>>>>>>>>> tcp,self --mca OMPI_mca_mpi_preconnect_all 1 -np 2 -hetero -H
>>>>>>>>>> localhost,10.11.14.205 /tmp/hello
>>>>>>>>>> [fuji.local:01316] mca: base: components_open: Looking for btl
>>>>>>>>>> components
>>>>>>>>>> [fuji.local:01316] mca: base: components_open: opening btl
>>>>>>>>>> components
>>>>>>>>>> [fuji.local:01316] mca: base: components_open: found loaded
>>>>>>>>>> component
>>>>>>>>>> self
>>>>>>>>>> [fuji.local:01316] mca: base: components_open: component self
>>>>>>>>>> has no
>>>>>>>>>> register function
>>>>>>>>>> [fuji.local:01316] mca: base: components_open: component self
>>>>>>>>>> open
>>>>>>>>>> function successful
>>>>>>>>>> [fuji.local:01316] mca: base: components_open: found loaded
>>>>>>>>>> component
>>>>>>>>>> tcp
>>>>>>>>>> [fuji.local:01316] mca: base: components_open: component tcp
>>>>>>>>>> has no
>>>>>>>>>> register function
>>>>>>>>>> [fuji.local:01316] mca: base: components_open: component tcp
>>>>>>>>>> open
>>>>>>>>>> function
>>>>>>>>>> successful
>>>>>>>>>> [fuji.local:01316] select: initializing btl component self
>>>>>>>>>> [fuji.local:01316] select: init of component self returned
>>>>>>>>>> success
>>>>>>>>>> [fuji.local:01316] select: initializing btl component tcp
>>>>>>>>>> [fuji.local:01316] select: init of component tcp returned
>>>>>>>>>> success
>>>>>>>>>> [apex-backpack:04753] mca: base: components_open: Looking for
>>>>>>>>>> btl
>>>>>>>>>> components
>>>>>>>>>> [apex-backpack:04753] mca: base: components_open: opening btl
>>>>>>>>>> components
>>>>>>>>>> [apex-backpack:04753] mca: base: components_open: found loaded
>>>>>>>>>> component
>>>>>>>>>> self
>>>>>>>>>> [apex-backpack:04753] mca: base: components_open: component
>>>>>>>>>> self has
>>>>>>>>>> no
>>>>>>>>>> register function
>>>>>>>>>> [apex-backpack:04753] mca: base: components_open: component
>>>>>>>>>> self
>>>>>>>>>> open
>>>>>>>>>> function successful
>>>>>>>>>> [apex-backpack:04753] mca: base: components_open: found loaded
>>>>>>>>>> component
>>>>>>>>>> tcp
>>>>>>>>>> [apex-backpack:04753] mca: base: components_open: component
>>>>>>>>>> tcp has
>>>>>>>>>> no
>>>>>>>>>> register function
>>>>>>>>>> [apex-backpack:04753] mca: base: components_open: component
>>>>>>>>>> tcp open
>>>>>>>>>> function successful
>>>>>>>>>> [apex-backpack:04753] select: initializing btl component self
>>>>>>>>>> [apex-backpack:04753] select: init of component self returned
>>>>>>>>>> success
>>>>>>>>>> [apex-backpack:04753] select: initializing btl component tcp
>>>>>>>>>> [apex-backpack:04753] select: init of component tcp returned
>>>>>>>>>> success
>>>>>>>>>> Process 0 on fuji.local out of 2
>>>>>>>>>> Process 1 on apex-backpack out of 2
>>>>>>>>>> [apex-backpack:04753] btl: tcp: attempting to connect() to
>>>>>>>>>> address
>>>>>>>>>> 10.11.14.203 on port 9360
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>> Hi
>>>>>>>>>>>
>>>>>>>>>>> I am trying to run open-mpi 1.3.3. between a linux box
>>>>>>>>>>> running
>>>>>>>>>>> ubuntu
>>>>>>>>>>> server v.9.04 and a Macintosh. I have configured openmpi with
>>>>>>>>>>> the
>>>>>>>>>>> following options.:
>>>>>>>>>>> ./configure --prefix=/usr/local/ --enable-heterogeneous
>>>>>>>>>>> --disable-shared
>>>>>>>>>>> --enable-static
>>>>>>>>>>>
>>>>>>>>>>> When both the machines are connected to the network via
>>>>>>>>>>> ethernet
>>>>>>>>>>> cables
>>>>>>>>>>> openmpi works fine.
>>>>>>>>>>>
>>>>>>>>>>> But when I switch the linux box to a wireless adapter i can
>>>>>>>>>>> reach
>>>>>>>>>>> (ping)
>>>>>>>>>>> the macintosh
>>>>>>>>>>> but openmpi hangs on a hello world program.
>>>>>>>>>>>
>>>>>>>>>>> I ran :
>>>>>>>>>>>
>>>>>>>>>>> /usr/local/bin/mpirun --mca btl_tcp_port_min_v4 36900 -mca
>>>>>>>>>>> btl_tcp_port_range_v4 32 --mca btl_base_verbose 30 --mca
>>>>>>>>>>> OMPI_mca_mpi_preconnect_all 1 -np 2 -hetero -H
>>>>>>>>>>> localhost,10.11.14.205
>>>>>>>>>>> /tmp/back
>>>>>>>>>>>
>>>>>>>>>>> it hangs on a send receive function between the two ends. All
>>>>>>>>>>> my
>>>>>>>>>>> firewalls
>>>>>>>>>>> are turned off at the macintosh end. PLEASE HELP ASAP>
>>>>>>>>>>> regards,
>>>>>>>>>>> pallab
>>>>>>>>>>> _______________________________________________
>>>>>>>>>>> users mailing list
>>>>>>>>>>> users_at_[hidden]
>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>> _______________________________________________
>>>>>>>>>> users mailing list
>>>>>>>>>> users_at_[hidden]
>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>> --
>>>>>>>>>
>>>>>>>>> =========================
>>>>>>>>> rolf.vandevaart_at_[hidden]
>>>>>>>>> 781-442-3043
>>>>>>>>> =========================
>>>>>>>>>
>>>>>>>>> _______________________________________________
>>>>>>>>> users mailing list
>>>>>>>>> users_at_[hidden]
>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>> --
>>>>>>>
>>>>>>> =========================
>>>>>>> rolf.vandevaart_at_[hidden]
>>>>>>> 781-442-3043
>>>>>>> =========================
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>>
>>>>> =========================
>>>>> rolf.vandevaart_at_[hidden]
>>>>> 781-442-3043
>>>>> =========================
>>>>>
>>>>>
>>>>
>>>> _______________________________________________
>>>> devel mailing list
>>>> devel_at_[hidden]
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>
>>>
>>> --
>>> Jeff Squyres
>>> jsquyres_at_[hidden]
>>>
>>> _______________________________________________
>>> users mailing list
>>> users_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>
>>
>
>
> --
> Jeff Squyres
> jsquyres_at_[hidden]
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>