Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

From: George Bosilca (bosilca_at_[hidden])
Date: 2007-02-02 00:14:06


Alex,

Can should try to limit the ethernet devices used by Open MPI during
the execution. Please add "--mca btl_tcp_if_exclude eth1,ib0,ib1" to
your mpirun command line and give it a try.

   Thanks,
     george.

On Feb 1, 2007, at 10:29 PM, Alex Tumanov wrote:

> On 2/1/07, Galen Shipman <gshipman_at_[hidden]> wrote:
>> What does ifconfig report on both nodes?
>
> Hi Galen,
>
> On headnode:
> # ifconfig
> eth0 Link encap:Ethernet HWaddr 00:11:43:EF:5D:6C
> inet addr:10.1.1.11 Bcast:10.1.1.255 Mask:255.255.255.0
> inet6 addr: fe80::211:43ff:feef:5d6c/64 Scope:Link
> UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
> RX packets:279965 errors:0 dropped:0 overruns:0 frame:0
> TX packets:785652 errors:0 dropped:0 overruns:0 carrier:0
> collisions:0 txqueuelen:1000
> RX bytes:28422663 (27.1 MiB) TX bytes:999981228 (953.6 MiB)
> Base address:0xecc0 Memory:dfae0000-dfb00000
>
> eth1 Link encap:Ethernet HWaddr 00:11:43:EF:5D:6D
> inet addr:<public IP> Bcast:172.25.238.255 Mask:
> 255.255.255.0
> inet6 addr: fe80::211:43ff:feef:5d6d/64 Scope:Link
> UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
> RX packets:1763252 errors:0 dropped:0 overruns:0 frame:0
> TX packets:133260 errors:0 dropped:0 overruns:0 carrier:0
> collisions:0 txqueuelen:1000
> RX bytes:1726135418 (1.6 GiB) TX bytes:40990369 (39.0 MiB)
> Base address:0xdcc0 Memory:df8e0000-df900000
>
> ib0 Link encap:UNSPEC HWaddr
> 00-00-00-00-00-00-00-00-00-00-00-00-00-00-00-00
> inet addr:20.1.0.11 Bcast:20.1.0.255 Mask:255.255.255.0
> UP BROADCAST RUNNING MULTICAST MTU:2044 Metric:1
> RX packets:9746 errors:0 dropped:0 overruns:0 frame:0
> TX packets:9746 errors:0 dropped:0 overruns:0 carrier:0
> collisions:0 txqueuelen:128
> RX bytes:576988 (563.4 KiB) TX bytes:462432 (451.5 KiB)
>
> ib1 Link encap:UNSPEC HWaddr
> 00-00-00-00-00-00-00-00-00-00-00-00-00-00-00-00
> inet addr:30.5.0.11 Bcast:30.5.0.255 Mask:255.255.255.0
> UP BROADCAST MULTICAST MTU:2044 Metric:1
> RX packets:0 errors:0 dropped:0 overruns:0 frame:0
> TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
> collisions:0 txqueuelen:128
> RX bytes:0 (0.0 b) TX bytes:0 (0.0 b)
>
> on COMPUTE node:
>
> # ifconfig
> eth0 Link encap:Ethernet HWaddr 00:11:43:D1:C0:80
> inet addr:10.1.1.254 Bcast:10.1.1.255 Mask:255.255.255.0
> inet6 addr: fe80::211:43ff:fed1:c080/64 Scope:Link
> UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
> RX packets:145725 errors:0 dropped:0 overruns:0 frame:0
> TX packets:85136 errors:0 dropped:0 overruns:0 carrier:0
> collisions:0 txqueuelen:1000
> RX bytes:46506800 (44.3 MiB) TX bytes:14722190 (14.0 MiB)
> Base address:0xbcc0 Memory:df7e0000-df800000
>
> ib0 Link encap:UNSPEC HWaddr
> 00-00-00-00-00-00-00-00-00-00-00-00-00-00-00-00
> inet addr:20.1.0.254 Bcast:20.1.0.255 Mask:255.255.255.0
> UP BROADCAST RUNNING MULTICAST MTU:2044 Metric:1
> RX packets:9773 errors:0 dropped:0 overruns:0 frame:0
> TX packets:9773 errors:0 dropped:0 overruns:0 carrier:0
> collisions:0 txqueuelen:128
> RX bytes:424624 (414.6 KiB) TX bytes:617676 (603.1 KiB)
>
> ib1 Link encap:UNSPEC HWaddr
> 00-00-00-00-00-00-00-00-00-00-00-00-00-00-00-00
> inet addr:30.5.0.254 Bcast:30.5.0.255 Mask:255.255.255.0
> UP BROADCAST MULTICAST MTU:2044 Metric:1
> RX packets:0 errors:0 dropped:0 overruns:0 frame:0
> TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
> collisions:0 txqueuelen:128
> RX bytes:0 (0.0 b) TX bytes:0 (0.0 b)
>
>
> Additionally, I've discovered that this problem is specific to either
> Dell hardware or Gig-E, because I cannot reproduce it in my VMware
> cluster. Output of lspci for ethernet devices:
> [headnode]# lspci |grep -i "ether"; ssh -x compute-0-0 'lspci |grep
> -i ether'
> 06:07.0 Ethernet controller: Intel Corporation 82541GI/PI Gigabit
> Ethernet Controller (rev 05)
> 07:08.0 Ethernet controller: Intel Corporation 82541GI/PI Gigabit
> Ethernet Controller (rev 05)
> 07:07.0 Ethernet controller: Intel Corporation 82541GI/PI Gigabit
> Ethernet Controller (rev 05)
>
> i.e. headnode has 2 gig-e interfaces and compute - one, and all are
> the same.
>
> Thanks,
> Alex.
>
> On 2/1/07, Galen Shipman <gshipman_at_[hidden]> wrote:
>> What does ifconfig report on both nodes?
>>
>> - Galen
>>
>> On Feb 1, 2007, at 2:50 PM, Alex Tumanov wrote:
>>
>>> Hi,
>>>
>>> I have kept doing my own investigation and recompiled OpenMPI to
>>> have
>>> only the barebones functionality with no support for any
>>> interconnects
>>> other than ethernet:
>>> # rpmbuild --rebuild --define="configure_options
>>> --prefix=/opt/openmpi/1.1.4" --define="install_in_opt 1"
>>> --define="mflags all" openmpi-1.1.4-1.src.rpm
>>>
>>> The error detailed in my previous message persisted, which
>>> eliminates
>>> the possibility of interconnect support interfering with ethernet
>>> support. Here's an excerpt from ompi_info:
>>> # ompi_info
>>> Open MPI: 1.1.4
>>> Open MPI SVN revision: r13362
>>> Open RTE: 1.1.4
>>> Open RTE SVN revision: r13362
>>> OPAL: 1.1.4
>>> OPAL SVN revision: r13362
>>> Prefix: /opt/openmpi/1.1.4
>>> Configured architecture: x86_64-redhat-linux-gnu
>>> . . .
>>> Thread support: posix (mpi: no, progress: no)
>>> Internal debug support: no
>>> MPI parameter check: runtime
>>> . . .
>>> MCA btl: self (MCA v1.0, API v1.0, Component v1.1.4)
>>> MCA btl: sm (MCA v1.0, API v1.0, Component v1.1.4)
>>> MCA btl: tcp (MCA v1.0, API v1.0, Component v1.0)
>>>
>>> Again, to replicate the error, I ran
>>> # mpirun -hostfile ~/testdir/hosts --mca btl tcp,self ~/testdir/
>>> hello
>>> In this case, you can even omit the runtime mca param
>>> specifications:
>>> # mpirun -hostfile ~/testdir/hosts ~/testdir/hello
>>>
>>> Thanks for reading this. I hope I've provided enough information.
>>>
>>> Sincerely,
>>> Alex.
>>>
>>> On 2/1/07, Alex Tumanov <atumanov_at_[hidden]> wrote:
>>>> Hello,
>>>>
>>>> I have tried a very basic test on a 2 node "cluster" consisting
>>>> of 2
>>>> dell boxes. One of them is dual CPU Intel(R) Xeon(TM) CPU 2.80GHz
>>>> with
>>>> 1GB of RAM and the slave node is quad-CPU Intel(R) Xeon(TM) CPU
>>>> 3.40GHz with 2GB of RAM. Both have Infiniband cards and Gig-E. The
>>>> slave node is connected directly to the headnode.
>>>>
>>>> OpenMPI version 1.1.4 was compiled with support for the following
>>>> btl's: openib,mx,gm, and mvapi. I got it to work over openib, but,
>>>> ironically, the same trivial hello world job fails over tcp (please
>>>> see the log below). I found that the same problem was already
>>>> discussed on this list here:
>>>> http://www.open-mpi.org/community/lists/users/2006/06/1347.php
>>>> The discussion mentioned that there could be something wrong
>>>> with the
>>>> TCP setup of the nodes. Unfortunately it was taken offline. Could
>>>> someone help me with this?
>>>>
>>>> Thanks,
>>>> Alex.
>>>>
>>>> # mpirun -hostfile ~/testdir/hosts --mca btl tcp,self ~/testdir/
>>>> hello
>>>> Hello from Alex' MPI test program
>>>> Process 0 on headnode out of 2
>>>> Hello from Alex' MPI test program
>>>> Process 1 on compute-0-0.local out of 2
>>>> Signal:11 info.si_errno:0(Success) si_code:1(SEGV_MAPERR)
>>>> Failing at addr:0xdebdf8
>>>> [0] func:/opt/openmpi/1.1.4/lib/libopal.so.0 [0x2a9587e0e5]
>>>> [1] func:/lib64/tls/libpthread.so.0 [0x3d1a00c430]
>>>> [2] func:/opt/openmpi/1.1.4/lib/libopal.so.0 [0x2a95880729]
>>>> [3] func:/opt/openmpi/1.1.4/lib/libopal.so.0(_int_free+0x24a)
>>>> [0x2a95880d7a]
>>>> [4] func:/opt/openmpi/1.1.4/lib/libopal.so.0(free+0xbf)
>>>> [0x2a9588303f]
>>>> [5] func:/opt/openmpi/1.1.4/lib/libmpi.so.0 [0x2a955949ca]
>>>> [6] func:/opt/openmpi/1.1.4/lib/openmpi/mca_btl_tcp.so
>>>> (mca_btl_tcp_component_close+0x34f)
>>>> [0x2a988ee8ef]
>>>> [7] func:/opt/openmpi/1.1.4/lib/libopal.so.0
>>>> (mca_base_components_close+0xde)
>>>> [0x2a95872e1e]
>>>> [8] func:/opt/openmpi/1.1.4/lib/libmpi.so.0(mca_btl_base_close
>>>> +0xe9)
>>>> [0x2a955e5159]
>>>> [9] func:/opt/openmpi/1.1.4/lib/libmpi.so.0(mca_bml_base_close+0x9)
>>>> [0x2a955e5029]
>>>> [10] func:/opt/openmpi/1.1.4/lib/openmpi/mca_pml_ob1.so
>>>> (mca_pml_ob1_component_close+0x25)
>>>> [0x2a97f4dc55]
>>>> [11] func:/opt/openmpi/1.1.4/lib/libopal.so.0
>>>> (mca_base_components_close+0xde)
>>>> [0x2a95872e1e]
>>>> [12] func:/opt/openmpi/1.1.4/lib/libmpi.so.0(mca_pml_base_close
>>>> +0x69)
>>>> [0x2a955ea3e9]
>>>> [13] func:/opt/openmpi/1.1.4/lib/libmpi.so.0(ompi_mpi_finalize
>>>> +0xfe)
>>>> [0x2a955ab57e]
>>>> [14] func:/root/testdir/hello(main+0x7b) [0x4009d3]
>>>> [15] func:/lib64/tls/libc.so.6(__libc_start_main+0xdb)
>>>> [0x3d1951c3fb]
>>>> [16] func:/root/testdir/hello [0x4008ca]
>>>> *** End of error message ***
>>>> mpirun noticed that job rank 0 with PID 15573 on node "dr11.local"
>>>> exited on signal 11.
>>>> 2 additional processes aborted (not shown)
>>>>
>>> _______________________________________________
>>> users mailing list
>>> users_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users