Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |  

This web mail archive is frozen.

This page is part of a frozen web archive of this mailing list.

You can still navigate around this archive, but know that no new mails have been added to it since July of 2016.

Click here to be taken to the new web archives of this list; it includes all the mails that are in this frozen archive plus all new mails that have been sent to the list since it was migrated to the new archives.

From: Alex Tumanov (atumanov_at_[hidden])
Date: 2007-02-01 22:29:02


On 2/1/07, Galen Shipman <gshipman_at_[hidden]> wrote:
> What does ifconfig report on both nodes?

Hi Galen,

On headnode:
# ifconfig
eth0 Link encap:Ethernet HWaddr 00:11:43:EF:5D:6C
          inet addr:10.1.1.11 Bcast:10.1.1.255 Mask:255.255.255.0
          inet6 addr: fe80::211:43ff:feef:5d6c/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
          RX packets:279965 errors:0 dropped:0 overruns:0 frame:0
          TX packets:785652 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:28422663 (27.1 MiB) TX bytes:999981228 (953.6 MiB)
          Base address:0xecc0 Memory:dfae0000-dfb00000

eth1 Link encap:Ethernet HWaddr 00:11:43:EF:5D:6D
          inet addr:<public IP> Bcast:172.25.238.255 Mask:255.255.255.0
          inet6 addr: fe80::211:43ff:feef:5d6d/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
          RX packets:1763252 errors:0 dropped:0 overruns:0 frame:0
          TX packets:133260 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:1726135418 (1.6 GiB) TX bytes:40990369 (39.0 MiB)
          Base address:0xdcc0 Memory:df8e0000-df900000

ib0 Link encap:UNSPEC HWaddr
00-00-00-00-00-00-00-00-00-00-00-00-00-00-00-00
          inet addr:20.1.0.11 Bcast:20.1.0.255 Mask:255.255.255.0
          UP BROADCAST RUNNING MULTICAST MTU:2044 Metric:1
          RX packets:9746 errors:0 dropped:0 overruns:0 frame:0
          TX packets:9746 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:128
          RX bytes:576988 (563.4 KiB) TX bytes:462432 (451.5 KiB)

ib1 Link encap:UNSPEC HWaddr
00-00-00-00-00-00-00-00-00-00-00-00-00-00-00-00
          inet addr:30.5.0.11 Bcast:30.5.0.255 Mask:255.255.255.0
          UP BROADCAST MULTICAST MTU:2044 Metric:1
          RX packets:0 errors:0 dropped:0 overruns:0 frame:0
          TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:128
          RX bytes:0 (0.0 b) TX bytes:0 (0.0 b)

on COMPUTE node:

# ifconfig
eth0 Link encap:Ethernet HWaddr 00:11:43:D1:C0:80
          inet addr:10.1.1.254 Bcast:10.1.1.255 Mask:255.255.255.0
          inet6 addr: fe80::211:43ff:fed1:c080/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
          RX packets:145725 errors:0 dropped:0 overruns:0 frame:0
          TX packets:85136 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:46506800 (44.3 MiB) TX bytes:14722190 (14.0 MiB)
          Base address:0xbcc0 Memory:df7e0000-df800000

ib0 Link encap:UNSPEC HWaddr
00-00-00-00-00-00-00-00-00-00-00-00-00-00-00-00
          inet addr:20.1.0.254 Bcast:20.1.0.255 Mask:255.255.255.0
          UP BROADCAST RUNNING MULTICAST MTU:2044 Metric:1
          RX packets:9773 errors:0 dropped:0 overruns:0 frame:0
          TX packets:9773 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:128
          RX bytes:424624 (414.6 KiB) TX bytes:617676 (603.1 KiB)

ib1 Link encap:UNSPEC HWaddr
00-00-00-00-00-00-00-00-00-00-00-00-00-00-00-00
          inet addr:30.5.0.254 Bcast:30.5.0.255 Mask:255.255.255.0
          UP BROADCAST MULTICAST MTU:2044 Metric:1
          RX packets:0 errors:0 dropped:0 overruns:0 frame:0
          TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:128
          RX bytes:0 (0.0 b) TX bytes:0 (0.0 b)

Additionally, I've discovered that this problem is specific to either
Dell hardware or Gig-E, because I cannot reproduce it in my VMware
cluster. Output of lspci for ethernet devices:
[headnode]# lspci |grep -i "ether"; ssh -x compute-0-0 'lspci |grep -i ether'
06:07.0 Ethernet controller: Intel Corporation 82541GI/PI Gigabit
Ethernet Controller (rev 05)
07:08.0 Ethernet controller: Intel Corporation 82541GI/PI Gigabit
Ethernet Controller (rev 05)
07:07.0 Ethernet controller: Intel Corporation 82541GI/PI Gigabit
Ethernet Controller (rev 05)

i.e. headnode has 2 gig-e interfaces and compute - one, and all are the same.

Thanks,
Alex.

On 2/1/07, Galen Shipman <gshipman_at_[hidden]> wrote:
> What does ifconfig report on both nodes?
>
> - Galen
>
> On Feb 1, 2007, at 2:50 PM, Alex Tumanov wrote:
>
> > Hi,
> >
> > I have kept doing my own investigation and recompiled OpenMPI to have
> > only the barebones functionality with no support for any interconnects
> > other than ethernet:
> > # rpmbuild --rebuild --define="configure_options
> > --prefix=/opt/openmpi/1.1.4" --define="install_in_opt 1"
> > --define="mflags all" openmpi-1.1.4-1.src.rpm
> >
> > The error detailed in my previous message persisted, which eliminates
> > the possibility of interconnect support interfering with ethernet
> > support. Here's an excerpt from ompi_info:
> > # ompi_info
> > Open MPI: 1.1.4
> > Open MPI SVN revision: r13362
> > Open RTE: 1.1.4
> > Open RTE SVN revision: r13362
> > OPAL: 1.1.4
> > OPAL SVN revision: r13362
> > Prefix: /opt/openmpi/1.1.4
> > Configured architecture: x86_64-redhat-linux-gnu
> > . . .
> > Thread support: posix (mpi: no, progress: no)
> > Internal debug support: no
> > MPI parameter check: runtime
> > . . .
> > MCA btl: self (MCA v1.0, API v1.0, Component v1.1.4)
> > MCA btl: sm (MCA v1.0, API v1.0, Component v1.1.4)
> > MCA btl: tcp (MCA v1.0, API v1.0, Component v1.0)
> >
> > Again, to replicate the error, I ran
> > # mpirun -hostfile ~/testdir/hosts --mca btl tcp,self ~/testdir/hello
> > In this case, you can even omit the runtime mca param specifications:
> > # mpirun -hostfile ~/testdir/hosts ~/testdir/hello
> >
> > Thanks for reading this. I hope I've provided enough information.
> >
> > Sincerely,
> > Alex.
> >
> > On 2/1/07, Alex Tumanov <atumanov_at_[hidden]> wrote:
> >> Hello,
> >>
> >> I have tried a very basic test on a 2 node "cluster" consisting of 2
> >> dell boxes. One of them is dual CPU Intel(R) Xeon(TM) CPU 2.80GHz
> >> with
> >> 1GB of RAM and the slave node is quad-CPU Intel(R) Xeon(TM) CPU
> >> 3.40GHz with 2GB of RAM. Both have Infiniband cards and Gig-E. The
> >> slave node is connected directly to the headnode.
> >>
> >> OpenMPI version 1.1.4 was compiled with support for the following
> >> btl's: openib,mx,gm, and mvapi. I got it to work over openib, but,
> >> ironically, the same trivial hello world job fails over tcp (please
> >> see the log below). I found that the same problem was already
> >> discussed on this list here:
> >> http://www.open-mpi.org/community/lists/users/2006/06/1347.php
> >> The discussion mentioned that there could be something wrong with the
> >> TCP setup of the nodes. Unfortunately it was taken offline. Could
> >> someone help me with this?
> >>
> >> Thanks,
> >> Alex.
> >>
> >> # mpirun -hostfile ~/testdir/hosts --mca btl tcp,self ~/testdir/hello
> >> Hello from Alex' MPI test program
> >> Process 0 on headnode out of 2
> >> Hello from Alex' MPI test program
> >> Process 1 on compute-0-0.local out of 2
> >> Signal:11 info.si_errno:0(Success) si_code:1(SEGV_MAPERR)
> >> Failing at addr:0xdebdf8
> >> [0] func:/opt/openmpi/1.1.4/lib/libopal.so.0 [0x2a9587e0e5]
> >> [1] func:/lib64/tls/libpthread.so.0 [0x3d1a00c430]
> >> [2] func:/opt/openmpi/1.1.4/lib/libopal.so.0 [0x2a95880729]
> >> [3] func:/opt/openmpi/1.1.4/lib/libopal.so.0(_int_free+0x24a)
> >> [0x2a95880d7a]
> >> [4] func:/opt/openmpi/1.1.4/lib/libopal.so.0(free+0xbf)
> >> [0x2a9588303f]
> >> [5] func:/opt/openmpi/1.1.4/lib/libmpi.so.0 [0x2a955949ca]
> >> [6] func:/opt/openmpi/1.1.4/lib/openmpi/mca_btl_tcp.so
> >> (mca_btl_tcp_component_close+0x34f)
> >> [0x2a988ee8ef]
> >> [7] func:/opt/openmpi/1.1.4/lib/libopal.so.0
> >> (mca_base_components_close+0xde)
> >> [0x2a95872e1e]
> >> [8] func:/opt/openmpi/1.1.4/lib/libmpi.so.0(mca_btl_base_close+0xe9)
> >> [0x2a955e5159]
> >> [9] func:/opt/openmpi/1.1.4/lib/libmpi.so.0(mca_bml_base_close+0x9)
> >> [0x2a955e5029]
> >> [10] func:/opt/openmpi/1.1.4/lib/openmpi/mca_pml_ob1.so
> >> (mca_pml_ob1_component_close+0x25)
> >> [0x2a97f4dc55]
> >> [11] func:/opt/openmpi/1.1.4/lib/libopal.so.0
> >> (mca_base_components_close+0xde)
> >> [0x2a95872e1e]
> >> [12] func:/opt/openmpi/1.1.4/lib/libmpi.so.0(mca_pml_base_close+0x69)
> >> [0x2a955ea3e9]
> >> [13] func:/opt/openmpi/1.1.4/lib/libmpi.so.0(ompi_mpi_finalize+0xfe)
> >> [0x2a955ab57e]
> >> [14] func:/root/testdir/hello(main+0x7b) [0x4009d3]
> >> [15] func:/lib64/tls/libc.so.6(__libc_start_main+0xdb) [0x3d1951c3fb]
> >> [16] func:/root/testdir/hello [0x4008ca]
> >> *** End of error message ***
> >> mpirun noticed that job rank 0 with PID 15573 on node "dr11.local"
> >> exited on signal 11.
> >> 2 additional processes aborted (not shown)
> >>
> > _______________________________________________
> > users mailing list
> > users_at_[hidden]
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>