Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] example program "ring" hangs when running across multiple hardware nodes
From: Ralph Castain (rhc_at_[hidden])
Date: 2013-07-04 19:58:11


You also might want to check that you don't have any firewalls between those nodes. This is a typical cause of what you describe.

On Jul 4, 2013, at 4:25 PM, Gustavo Correa <gus_at_[hidden]> wrote:

> Hi Jed
>
> You could try to select only ethernet interface that match your node's IP addresses,
> which seems to be en2.
>
> The en1 interface seems to be an external IP.
> Not sure about en3, but it is awkward that it has a
> different IP than en2, but in the same subnet.
> I wonder if this may be the reason for the program hanging.
>
> You may need to search all nodes ifconfig for a consistent set of interfaces/IP addresses,
> and tailor your mpiexec command line and your hostfile accordingly.
>
> Say, something like this:
>
> mpiexec -mca btl_tcp_if_include en2 -hostfile your_hostfile -np 43 ./ring_c
>
> See this FAQ (actually, all of them are very informative):
> http://www.open-mpi.org/faq/?category=tcp#tcp-selection
>
> I hope this helps,
> Gus Correa
>
>
>
> On Jul 4, 2013, at 6:37 PM, Jed O. Kaplan wrote:
>
>> Dear openmpi gurus,
>>
>> I am running openmpi 1.7.2 on a homogenous cluster of Apple XServes
>> running OS X 10.6.8. My hardware nodes are connected through four
>> gigabit ethernet connections; I have no infiniband or other high-speed
>> interconnect. The problem I describe below is the same if I use openmpi
>> 1.6.5. My openmpi installation is compiled with Intel icc and ifort. See
>> the attached result of ompi_info --all for more details on my
>> installation and runtime parameters, and other diagnostic information
>> below
>>
>> My problem is that I noticed that inter-hardware communication hangs in
>> one of my own programs; I thought this was the fault of my own bad
>> programming, so I tried some of the example programs that are
>> distributed with the openmpi source code. In the program "ring_*" using
>> whichever of the APIs (c, cxx, fortran etc.), I have the same faulty
>> behavior that I noticed in my own program: if I run the program on a
>> single hardware node (with multiple processes) it works fine. As soon as
>> I run the program across hardware nodes, it hangs. Below you will find
>> an example of the program output and other diagnostic information.
>>
>> This problem has really frustrated me. Unfortunately I am not
>> experienced enough with openmpi to get more into the debugging.
>>
>> Thank you in advance for any help you can give me!
>>
>> Jed Kaplan
>>
>> --- DETAILS OF MY PROBLEM ---
>>
>> -- this run works because it is only on one hardware node --
>>
>> jkaplan_at_grkapsrv2:~/openmpi_examples > mpirun --prefix /usr/local
>> --hostfile arvehosts.txt -np 3 ring_c
>> Process 0 sending 10 to 1, tag 201 (3 processes in ring)
>> Process 0 sent to 1
>> Process 0 decremented value: 9
>> Process 0 decremented value: 8
>> Process 0 decremented value: 7
>> Process 0 decremented value: 6
>> Process 0 decremented value: 5
>> Process 0 decremented value: 4
>> Process 0 decremented value: 3
>> Process 0 decremented value: 2
>> Process 0 decremented value: 1
>> Process 0 decremented value: 0
>> Process 0 exiting
>> Process 1 exiting
>> Process 2 exiting
>>
>> -- this run hangs when running over two hardware nodes --
>>
>> jkaplan_at_grkapsrv2:~/openmpi_examples > mpirun --prefix /usr/local
>> --hostfile arvehosts.txt -np 4 ring_c
>> Process 0 sending 10 to 1, tag 201 (4 processes in ring)
>> Process 0 sent to 1
>> Process 0 decremented value: 9
>> Process 0 decremented value: 8
>> ... hangs forever ...
>> ^CKilled by signal 2.
>>
>> -- here is what my hostfile looks like --
>>
>> jkaplan_at_grkapsrv2:~/openmpi_examples > cat arvehosts.txt
>> #host file for ARVE group mac servers
>>
>> 10.0.0.21 slots=3
>> 10.0.0.31 slots=8
>> 10.0.0.41 slots=8
>> 10.0.0.51 slots=8
>> 10.0.0.61 slots=8
>> 10.0.0.71 slots=8
>>
>> -- results of ifconfig - this looks pretty much the same on all of my
>> servers, with different ip addresses of course --
>>
>> jkaplan_at_grkapsrv2:~/openmpi_examples > ifconfig
>> lo0: flags=8049<UP,LOOPBACK,RUNNING,MULTICAST> mtu 16384
>> inet6 ::1 prefixlen 128
>> inet6 fe80::1%lo0 prefixlen 64 scopeid 0x1
>> inet 127.0.0.1 netmask 0xff000000
>> gif0: flags=8010<POINTOPOINT,MULTICAST> mtu 1280
>> stf0: flags=0<> mtu 1280
>> en0: flags=8863<UP,BROADCAST,SMART,RUNNING,SIMPLEX,MULTICAST> mtu 1500
>> ether 00:24:36:f3:dc:fc
>> inet6 fe80::224:36ff:fef3:dcfc%en0 prefixlen 64 scopeid 0x4
>> inet 128.178.107.85 netmask 0xffffff00 broadcast 128.178.107.255
>> media: autoselect (1000baseT <full-duplex>)
>> status: active
>> en1: flags=8863<UP,BROADCAST,SMART,RUNNING,SIMPLEX,MULTICAST> mtu 1500
>> ether 00:24:36:f3:dc:fa
>> inet6 fe80::224:36ff:fef3:dcfa%en1 prefixlen 64 scopeid 0x5
>> inet 10.0.0.2 netmask 0xff000000 broadcast 10.255.255.255
>> media: autoselect (1000baseT <full-duplex,flow-control>)
>> status: active
>> en2: flags=8863<UP,BROADCAST,SMART,RUNNING,SIMPLEX,MULTICAST> mtu 1500
>> ether 00:24:36:f5:ba:4e
>> inet6 fe80::224:36ff:fef5:ba4e%en2 prefixlen 64 scopeid 0x6
>> inet 10.0.0.21 netmask 0xff000000 broadcast 10.255.255.255
>> media: autoselect (1000baseT <full-duplex,flow-control>)
>> status: active
>> en3: flags=8863<UP,BROADCAST,SMART,RUNNING,SIMPLEX,MULTICAST> mtu 1500
>> ether 00:24:36:f5:ba:4f
>> inet6 fe80::224:36ff:fef5:ba4f%en3 prefixlen 64 scopeid 0x7
>> inet 10.0.0.22 netmask 0xff000000 broadcast 10.255.255.255
>> media: autoselect (1000baseT <full-duplex,flow-control>)
>> status: active
>> fw0: flags=8822<BROADCAST,SMART,SIMPLEX,MULTICAST> mtu 4078
>> lladdr 04:1e:64:ff:fe:f8:aa:d2
>> media: autoselect <full-duplex>
>> status: inactive
>> <ompi_info.txt>_______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users