Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: [OMPI users] Problem with openmpi version 1.3b1 beta1
From: Allan Menezes (amenezes007_at_[hidden])
Date: 2008-10-31 18:54:48


Date: Fri, 31 Oct 2008 09:34:52 -0600
From: Ralph Castain <rhc_at_[hidden]>
Subject: Re: [OMPI users] users Digest, Vol 1052, Issue 1
To: Open MPI Users <users_at_[hidden]>
Message-ID: <0CF28492-B13E-4F82-AC43-C1580F0794D1_at_[hidden]>
Content-Type: text/plain; charset="us-ascii"; Format="flowed";
        DelSp="yes"

It looks like the daemon isn't seeing the other interface address on
host x2. Can you ssh to x2 and send the contents of ifconfig -a?

Ralph

On Oct 31, 2008, at 9:18 AM, Allan Menezes wrote:

>users-request_at_[hidden] wrote:
>
>
>>Send users mailing list submissions to
>> users_at_[hidden]
>>
>>To subscribe or unsubscribe via the World Wide Web, visit
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>or, via email, send a message with subject or body 'help' to
>> users-request_at_[hidden]
>>
>>You can reach the person managing the list at
>> users-owner_at_[hidden]
>>
>>When replying, please edit your Subject line so it is more specific
>>than "Re: Contents of users digest..."
>>
>>
>>Today's Topics:
>>
>> 1. Openmpi ver1.3beta1 (Allan Menezes)
>> 2. Re: Openmpi ver1.3beta1 (Ralph Castain)
>> 3. Re: Equivalent .h files (Benjamin Lamptey)
>> 4. Re: Equivalent .h files (Jeff Squyres)
>> 5. ompi-checkpoint is hanging (Matthias Hovestadt)
>> 6. unsubscibe (Bertrand P. S. Russell)
>> 7. Re: ompi-checkpoint is hanging (Tim Mattox)
>>
>>
>>----------------------------------------------------------------------
>>
>>Message: 1
>>Date: Fri, 31 Oct 2008 02:06:09 -0400
>>From: Allan Menezes <amenezes007_at_[hidden]>
>>Subject: [OMPI users] Openmpi ver1.3beta1
>>To: users_at_[hidden]
>>Message-ID: <BLU0-SMTP224B5E356302AC7AA4481088200_at_phx.gbl>
>>Content-Type: text/plain; charset=ISO-8859-1; format=flowed
>>
>>Hi,
>> I built open mpi version 1.3b1 withe following cofigure command:
>>./configure --prefix=/opt/openmpi13b1 --enable-mpi-threads
>>--with-threads=posix --disable-ipv6
>>I have six nodes x1..6
>>I distributed the /opt/openmpi13b1 with scp to all other nodes from
>>the
>>head node
>>When i run the following command:
>>mpirun --prefix /opt/openmpi13b1 --host x1 hostname it works on x1
>>printing out the hostname of x1
>>But when i type
>>mpirun --prefix /opt/openmpi13b1 --host x2 hostname it hangs and does
>>not give me any output
>>I have a 6 node intel quad core cluster with OSCAR and pci express
>>gigabit ethernet for eth0
>>Can somebody advise?
>>Thank you very much.
>>Allan Menezes
>>
>>
>>------------------------------
>>
>>Message: 2
>>Date: Fri, 31 Oct 2008 02:41:59 -0600
>>From: Ralph Castain <rhc_at_[hidden]>
>>Subject: Re: [OMPI users] Openmpi ver1.3beta1
>>To: Open MPI Users <users_at_[hidden]>
>>Message-ID: <E8AF5AAF-99CB-4EFC-AA97-5385CE333AD2_at_[hidden]>
>>Content-Type: text/plain; charset=US-ASCII; format=flowed; delsp=yes
>>
>>When you typed the --host x1 command, were you sitting on x1?
>>Likewise, when you typed the --host x2 command, were you not on
>>host x2?
>>
>>If the answer to both questions is "yes", then my guess is that
>>something is preventing you from launching a daemon on host x2. Try
>>adding --leave-session-attached to your cmd line and see if any error
>>messages appear. And check the FAQ for tips on how to setup for ssh
>>launch (I'm assuming that is what you are using).
>>
>>http://www.open-mpi.org/faq/?category=rsh
>>
>>Ralph
>>
>>On Oct 31, 2008, at 12:06 AM, Allan Menezes wrote:
>>
>>
>>
>>
>Hi Ralph,
> Yes that is true I tried both commands on x1 and ver 1.28 works
>on the same setup without a problem.
>Here is the output with the added
>--leave-session-attached
>[allan_at_x1 ~]$ mpiexec --prefix /opt/openmpi13b2 --leave-session-
>attached -host x2 hostname
>[x2.brampton.net:02236] [[1354,0],1]-[[1354,0],0]
>mca_oob_tcp_peer_try_connect: connect to 192.168.0.198:0 failed:
>Network is unreachable (101)
>[x2.brampton.net:02236] [[1354,0],1]-[[1354,0],0]
>mca_oob_tcp_peer_try_connect: connect to 192.168.122.1:0 failed:
>Network is unreachable (101)
>[x2.brampton.net:02236] [[1354,0],1] routed:binomial: Connection to
>lifeline [[1354,0],0] lost
>--------------------------------------------------------------------------
>A daemon (pid 7665) died unexpectedly with status 1 while attempting
>to launch so we are aborting.
>
>There may be more information reported by the environment (see above).
>
>This may be because the daemon was unable to find all the needed
>shared
>libraries on the remote node. You may set your LD_LIBRARY_PATH to
>have the
>location of the shared libraries on the remote nodes and this will
>automatically be forwarded to the remote nodes.
>--------------------------------------------------------------------------
>--------------------------------------------------------------------------
>mpiexec noticed that the job aborted, but has no info as to the
>process
>that caused that situation.
>--------------------------------------------------------------------------
>mpiexec: clean termination accomplished
>
>[allan_at_x1 ~]$
>However my main eth0 IP is 192.168.1.1 and internet gate way is
>192.168.0.1
>Any solutions?
>Allan Menezes
>
>
>
>
>
>>>Hi,
>>> I built open mpi version 1.3b1 withe following cofigure command:
>>>./configure --prefix=/opt/openmpi13b1 --enable-mpi-threads --with-
>>>threads=posix --disable-ipv6
>>>I have six nodes x1..6
>>>I distributed the /opt/openmpi13b1 with scp to all other nodes from
>>>the head node
>>>When i run the following command:
>>>mpirun --prefix /opt/openmpi13b1 --host x1 hostname it works on x1
>>>printing out the hostname of x1
>>>But when i type
>>>mpirun --prefix /opt/openmpi13b1 --host x2 hostname it hangs and
>>>does not give me any output
>>>I have a 6 node intel quad core cluster with OSCAR and pci express
>>>gigabit ethernet for eth0
>>>Can somebody advise?
>>>Thank you very much.
>>>Allan Menezes
>>>_______________________________________________
>>>users mailing list
>>>users_at_[hidden]
>>>http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>
>>>
>>>
>>
>>
Hi Ralph,
     It works for openmpi version 1.28 why should it not work for
version 1.3?
Yes I can ssh to x2 from x1 and x1 from x2.
Here if the ifconfig -a for x1:
eth0 Link encap:Ethernet HWaddr 00:1B:21:02:89:DA
          inet addr:192.168.1.1 Bcast:192.168.1.255 Mask:255.255.255.0
          inet6 addr: fe80::21b:21ff:fe02:89da/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
          RX packets:44906 errors:0 dropped:0 overruns:0 frame:0
          TX packets:77644 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:3309896 (3.1 MiB) TX bytes:101134505 (96.4 MiB)
          Memory:feae0000-feb00000

eth1 Link encap:Ethernet HWaddr 00:0E:0C:BC:AB:6D
          inet addr:192.168.3.1 Bcast:192.168.3.255 Mask:255.255.255.0
          inet6 addr: fe80::20e:cff:febc:ab6d/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
          RX packets:124 errors:0 dropped:0 overruns:0 frame:0
          TX packets:133 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:7440 (7.2 KiB) TX bytes:10027 (9.7 KiB)

eth2 Link encap:Ethernet HWaddr 00:1B:FC:A0:A7:92
          inet addr:192.168.7.1 Bcast:192.168.7.255 Mask:255.255.255.0
          inet6 addr: fe80::21b:fcff:fea0:a792/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
          RX packets:159 errors:0 dropped:0 overruns:0 frame:0
          TX packets:158 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:10902 (10.6 KiB) TX bytes:13691 (13.3 KiB)
          Interrupt:17

eth4 Link encap:Ethernet HWaddr 00:0E:0C:B9:50:A3
          inet addr:192.168.0.198 Bcast:192.168.0.255 Mask:255.255.255.0
          inet6 addr: fe80::20e:cff:feb9:50a3/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
          RX packets:25111 errors:0 dropped:0 overruns:0 frame:0
          TX packets:11633 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:24133775 (23.0 MiB) TX bytes:833868 (814.3 KiB)

lo Link encap:Local Loopback
          inet addr:127.0.0.1 Mask:255.0.0.0
          inet6 addr: ::1/128 Scope:Host
          UP LOOPBACK RUNNING MTU:16436 Metric:1
          RX packets:28973 errors:0 dropped:0 overruns:0 frame:0
          TX packets:28973 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0
          RX bytes:1223211 (1.1 MiB) TX bytes:1223211 (1.1 MiB)

pan0 Link encap:Ethernet HWaddr CA:00:CE:02:90:90
          BROADCAST MULTICAST MTU:1500 Metric:1
          RX packets:0 errors:0 dropped:0 overruns:0 frame:0
          TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0
          RX bytes:0 (0.0 b) TX bytes:0 (0.0 b)

sit0 Link encap:IPv6-in-IPv4
          NOARP MTU:1480 Metric:1
          RX packets:0 errors:0 dropped:0 overruns:0 frame:0
          TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0
          RX bytes:0 (0.0 b) TX bytes:0 (0.0 b)

virbr0 Link encap:Ethernet HWaddr EA:6D:E7:85:8D:E7
          inet addr:192.168.122.1 Bcast:192.168.122.255 Mask:255.255.255.0
          inet6 addr: fe80::e86d:e7ff:fe85:8de7/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
          RX packets:0 errors:0 dropped:0 overruns:0 frame:0
          TX packets:30 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0
          RX bytes:0 (0.0 b) TX bytes:5083 (4.9 KiB)

Here is the ifconfig -a for x2:
eth0 Link encap:Ethernet HWaddr 00:1B:21:02:DE:E9
          inet addr:192.168.1.2 Bcast:192.168.1.255 Mask:255.255.255.0
          inet6 addr: fe80::21b:21ff:fe02:dee9/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
          RX packets:565 errors:0 dropped:0 overruns:0 frame:0
          TX packets:565 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:181079 (176.8 KiB) TX bytes:106650 (104.1 KiB)
          Memory:feae0000-feb00000

eth1 Link encap:Ethernet HWaddr 00:0E:0C:BC:B1:7D
          inet addr:192.168.3.2 Bcast:192.168.3.255 Mask:255.255.255.0
          inet6 addr: fe80::20e:cff:febc:b17d/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
          RX packets:11 errors:0 dropped:0 overruns:0 frame:0
          TX packets:20 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:660 (660.0 b) TX bytes:1136 (1.1 KiB)

eth2 Link encap:Ethernet HWaddr 00:1F:C6:27:1C:79
          inet addr:192.168.7.2 Bcast:192.168.7.255 Mask:255.255.255.0
          inet6 addr: fe80::21f:c6ff:fe27:1c79/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
          RX packets:11 errors:0 dropped:0 overruns:0 frame:0
          TX packets:19 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:506 (506.0 b) TX bytes:1094 (1.0 KiB)
          Interrupt:17

lo Link encap:Local Loopback
          inet addr:127.0.0.1 Mask:255.0.0.0
          inet6 addr: ::1/128 Scope:Host
          UP LOOPBACK RUNNING MTU:16436 Metric:1
          RX packets:1604 errors:0 dropped:0 overruns:0 frame:0
          TX packets:1604 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0
          RX bytes:140216 (136.9 KiB) TX bytes:140216 (136.9 KiB)

sit0 Link encap:IPv6-in-IPv4
          NOARP MTU:1480 Metric:1
          RX packets:0 errors:0 dropped:0 overruns:0 frame:0
          TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0
          RX bytes:0 (0.0 b) TX bytes:0 (0.0 b)

Any help would be appreciated!
Allan Menezes