Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

From: Jeff Squyres (jsquyres_at_[hidden])
Date: 2005-11-09 22:41:18


Sorry for the delay in replying -- it's a crazy week here preparing for
SC next week.

I'm double checking the code, and I don't see any obvious problems with
the btl tcp include stuff.

Can you also specify that you want OMPI's "out of band" communication
to use a specific network?

> mpiexec -d --mca btl_tcp_if_include eth0 --mca oob_tcp_include eth0
> -np 2 a.out

With the segv's, do you get meaningful core dumps? Can you send
backtraces?

On Nov 8, 2005, at 3:02 PM, Marty Humphrey wrote:

> It's taken me a while, but I've simplified the experiment...
>
> In a nutshell, I'm seeing strange behavior in my multi-NIC box when I
> attempt to execute " mpiexec -d --mca btl_tcp_if_include eth0 -np 2
> a.out".
> I have three different observed behaviors:
>
> [1] mpi worker rank 0 displays the banner and then just hangs
> (apparently
> trying to exchange MPI messages, which don't get delivered)
>
> 2 PE'S AS A 2 BY 1 GRID
>
> [2] it starts progressing (spitting out domain-specific msgs):
>
> 2 PE'S AS A 2 BY 1 GRID
>
> HALO2A NPES,N = 2 2 TIME = 0.000007 SECONDS
> HALO2A NPES,N = 2 4 TIME = 0.000007 SECONDS
> HALO2A NPES,N = 2 8 TIME = 0.000007 SECONDS
> HALO2A NPES,N = 2 16 TIME = 0.000008 SECONDS
> HALO2A NPES,N = 2 32 TIME = 0.000009 SECONDS
>
> [3] I get failure pretty quickly, with the line " mpiexec noticed that
> job
> rank 1 with PID 20425 on node "localhost" exited on signal 11."
>
> Here's the output of "ifconfig":
>
> [humphrey_at_zelda01 humphrey]$ /sbin/ifconfig
> eth0 Link encap:Ethernet HWaddr 00:11:43:DC:EA:EE
> inet addr:130.207.252.131 Bcast:130.207.252.255
> Mask:255.255.255.0
> UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
> RX packets:2441905 errors:0 dropped:0 overruns:0 frame:0
> TX packets:112786 errors:0 dropped:0 overruns:0 carrier:0
> collisions:0 txqueuelen:1000
> RX bytes:197322445 (188.1 Mb) TX bytes:32906750 (31.3 Mb)
> Base address:0xecc0 Memory:dfae0000-dfb00000
>
> eth2 Link encap:Ethernet HWaddr 00:11:95:C7:28:82
> inet addr:10.0.0.11 Bcast:10.0.0.255 Mask:255.255.255.0
> UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
> RX packets:11598757 errors:0 dropped:0 overruns:0 frame:0
> TX packets:7224590 errors:0 dropped:0 overruns:0 carrier:0
> collisions:0 txqueuelen:1000
> RX bytes:3491651158 (3329.8 Mb) TX bytes:1916674000 (1827.8
> Mb)
> Interrupt:77 Base address:0xcc00
>
> ipsec0 Link encap:Ethernet HWaddr 00:11:43:DC:EA:EE
> inet addr:130.207.252.131 Mask:255.255.255.0
> UP RUNNING NOARP MTU:16260 Metric:1
> RX packets:40113 errors:0 dropped:40113 overruns:0 frame:0
> TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
> collisions:0 txqueuelen:10
> RX bytes:0 (0.0 b) TX bytes:0 (0.0 b)
>
> lo Link encap:Local Loopback
> inet addr:127.0.0.1 Mask:255.0.0.0
> UP LOOPBACK RUNNING MTU:16436 Metric:1
> RX packets:4742 errors:0 dropped:0 overruns:0 frame:0
> TX packets:4742 errors:0 dropped:0 overruns:0 carrier:0
> collisions:0 txqueuelen:0
> RX bytes:2369841 (2.2 Mb) TX bytes:2369841 (2.2 Mb)
>
> This is with openmpi-1.1a1r8038 .
>
> Here is the output of a hanging invocation....
>
> ----- begin hanging invocation ----
> [humphrey_at_zelda01 humphrey]$ mpiexec -d --mca btl_tcp_if_include eth0
> -np 2
> a.out
> [zelda01.localdomain:20455] procdir: (null)
> [zelda01.localdomain:20455] jobdir: (null)
> [zelda01.localdomain:20455] unidir:
> /tmp/openmpi-sessions-humphrey_at_zelda01.localdomain_0/default-universe
> [zelda01.localdomain:20455] top:
> openmpi-sessions-humphrey_at_zelda01.localdomain_0
> [zelda01.localdomain:20455] tmp: /tmp
> [zelda01.localdomain:20455] connect_uni: contact info read
> [zelda01.localdomain:20455] connect_uni: connection not allowed
> [zelda01.localdomain:20455] [0,0,0] setting up session dir with
> [zelda01.localdomain:20455] tmpdir /tmp
> [zelda01.localdomain:20455] universe default-universe-20455
> [zelda01.localdomain:20455] user humphrey
> [zelda01.localdomain:20455] host zelda01.localdomain
> [zelda01.localdomain:20455] jobid 0
> [zelda01.localdomain:20455] procid 0
> [zelda01.localdomain:20455] procdir:
> /tmp/openmpi-sessions-humphrey_at_zelda01.localdomain_0/default-universe
> -20455/
> 0/0
> [zelda01.localdomain:20455] jobdir:
> /tmp/openmpi-sessions-humphrey_at_zelda01.localdomain_0/default-universe
> -20455/
> 0
> [zelda01.localdomain:20455] unidir:
> /tmp/openmpi-sessions-humphrey_at_zelda01.localdomain_0/default-universe
> -20455
> [zelda01.localdomain:20455] top:
> openmpi-sessions-humphrey_at_zelda01.localdomain_0
> [zelda01.localdomain:20455] tmp: /tmp
> [zelda01.localdomain:20455] [0,0,0] contact_file
> /tmp/openmpi-sessions-humphrey_at_zelda01.localdomain_0/default-universe
> -20455/
> universe-setup.txt
> [zelda01.localdomain:20455] [0,0,0] wrote setup file
> [zelda01.localdomain:20455] pls:rsh: local csh: 0, local bash: 1
> [zelda01.localdomain:20455] pls:rsh: assuming same remote shell as
> local
> shell
> [zelda01.localdomain:20455] pls:rsh: remote csh: 0, remote bash: 1
> [zelda01.localdomain:20455] pls:rsh: final template argv:
> [zelda01.localdomain:20455] pls:rsh: ssh <template> orted --debug
> --bootproxy 1 --name <template> --num_procs 2 --vpid_start 0 --nodename
> <template> --universe
> humphrey_at_zelda01.localdomain:default-universe-20455
> --nsreplica
> "0.0.0;tcp://130.207.252.131:35465;tcp://10.0.0.11:35465;tcp://
> 130.207.252.1
> 31:35465" --gprreplica
> "0.0.0;tcp://130.207.252.131:35465;tcp://10.0.0.11:35465;tcp://
> 130.207.252.1
> 31:35465" --mpi-call-yield 0
> [zelda01.localdomain:20455] pls:rsh: launching on node localhost
> [zelda01.localdomain:20455] pls:rsh: oversubscribed -- setting
> mpi_yield_when_idle to 1 (1 2)
> [zelda01.localdomain:20455] pls:rsh: localhost is a LOCAL node
> [zelda01.localdomain:20455] pls:rsh: executing: orted --debug
> --bootproxy 1
> --name 0.0.1 --num_procs 2 --vpid_start 0 --nodename localhost
> --universe
> humphrey_at_zelda01.localdomain:default-universe-20455 --nsreplica
> "0.0.0;tcp://130.207.252.131:35465;tcp://10.0.0.11:35465;tcp://
> 130.207.252.1
> 31:35465" --gprreplica
> "0.0.0;tcp://130.207.252.131:35465;tcp://10.0.0.11:35465;tcp://
> 130.207.252.1
> 31:35465" --mpi-call-yield 1
> [zelda01.localdomain:20456] [0,0,1] setting up session dir with
> [zelda01.localdomain:20456] universe default-universe-20455
> [zelda01.localdomain:20456] user humphrey
> [zelda01.localdomain:20456] host localhost
> [zelda01.localdomain:20456] jobid 0
> [zelda01.localdomain:20456] procid 1
> [zelda01.localdomain:20456] procdir:
> /tmp/openmpi-sessions-humphrey_at_localhost_0/default-universe-20455/0/1
> [zelda01.localdomain:20456] jobdir:
> /tmp/openmpi-sessions-humphrey_at_localhost_0/default-universe-20455/0
> [zelda01.localdomain:20456] unidir:
> /tmp/openmpi-sessions-humphrey_at_localhost_0/default-universe-20455
> [zelda01.localdomain:20456] top: openmpi-sessions-humphrey_at_localhost_0
> [zelda01.localdomain:20456] tmp: /tmp
> [zelda01.localdomain:20457] [0,1,1] setting up session dir with
> [zelda01.localdomain:20457] universe default-universe-20455
> [zelda01.localdomain:20457] user humphrey
> [zelda01.localdomain:20457] host localhost
> [zelda01.localdomain:20457] jobid 1
> [zelda01.localdomain:20457] procid 1
> [zelda01.localdomain:20457] procdir:
> /tmp/openmpi-sessions-humphrey_at_localhost_0/default-universe-20455/1/1
> [zelda01.localdomain:20457] jobdir:
> /tmp/openmpi-sessions-humphrey_at_localhost_0/default-universe-20455/1
> [zelda01.localdomain:20457] unidir:
> /tmp/openmpi-sessions-humphrey_at_localhost_0/default-universe-20455
> [zelda01.localdomain:20457] top: openmpi-sessions-humphrey_at_localhost_0
> [zelda01.localdomain:20457] tmp: /tmp
> [zelda01.localdomain:20458] [0,1,0] setting up session dir with
> [zelda01.localdomain:20458] universe default-universe-20455
> [zelda01.localdomain:20458] user humphrey
> [zelda01.localdomain:20458] host localhost
> [zelda01.localdomain:20458] jobid 1
> [zelda01.localdomain:20458] procid 0
> [zelda01.localdomain:20458] procdir:
> /tmp/openmpi-sessions-humphrey_at_localhost_0/default-universe-20455/1/0
> [zelda01.localdomain:20458] jobdir:
> /tmp/openmpi-sessions-humphrey_at_localhost_0/default-universe-20455/1
> [zelda01.localdomain:20458] unidir:
> /tmp/openmpi-sessions-humphrey_at_localhost_0/default-universe-20455
> [zelda01.localdomain:20458] top: openmpi-sessions-humphrey_at_localhost_0
> [zelda01.localdomain:20458] tmp: /tmp
> [zelda01.localdomain:20455] spawn: in job_state_callback(jobid = 1,
> state =
> 0x3)
> [zelda01.localdomain:20455] Info: Setting up debugger process table for
> applications
> MPIR_being_debugged = 0
> MPIR_debug_gate = 0
> MPIR_debug_state = 1
> MPIR_acquired_pre_main = 0
> MPIR_i_am_starter = 0
> MPIR_proctable_size = 2
> MPIR_proctable:
> (i, host, exe, pid) = (0, localhost, /home/humphrey/a.out, 20457)
> (i, host, exe, pid) = (1, localhost, /home/humphrey/a.out, 20458)
> [zelda01.localdomain:20455] spawn: in job_state_callback(jobid = 1,
> state =
> 0x4)
> [zelda01.localdomain:20458] [0,1,0] ompi_mpi_init completed
> [zelda01.localdomain:20457] [0,1,1] ompi_mpi_init completed
>
> 2 PE'S AS A 2 BY 1 GRID
> ------ end hanging invocation -----
>
> Here's the 1-in-approximately-20 that started working...
>
> ------- begin non-hanging invocation -----
> [humphrey_at_zelda01 humphrey]$ mpiexec -d --mca btl_tcp_if_include eth0
> -np 2
> a.out
> [zelda01.localdomain:20659] procdir: (null)
> [zelda01.localdomain:20659] jobdir: (null)
> [zelda01.localdomain:20659] unidir:
> /tmp/openmpi-sessions-humphrey_at_zelda01.localdomain_0/default-universe
> [zelda01.localdomain:20659] top:
> openmpi-sessions-humphrey_at_zelda01.localdomain_0
> [zelda01.localdomain:20659] tmp: /tmp
> [zelda01.localdomain:20659] connect_uni: contact info read
> [zelda01.localdomain:20659] connect_uni: connection not allowed
> [zelda01.localdomain:20659] [0,0,0] setting up session dir with
> [zelda01.localdomain:20659] tmpdir /tmp
> [zelda01.localdomain:20659] universe default-universe-20659
> [zelda01.localdomain:20659] user humphrey
> [zelda01.localdomain:20659] host zelda01.localdomain
> [zelda01.localdomain:20659] jobid 0
> [zelda01.localdomain:20659] procid 0
> [zelda01.localdomain:20659] procdir:
> /tmp/openmpi-sessions-humphrey_at_zelda01.localdomain_0/default-universe
> -20659/
> 0/0
> [zelda01.localdomain:20659] jobdir:
> /tmp/openmpi-sessions-humphrey_at_zelda01.localdomain_0/default-universe
> -20659/
> 0
> [zelda01.localdomain:20659] unidir:
> /tmp/openmpi-sessions-humphrey_at_zelda01.localdomain_0/default-universe
> -20659
> [zelda01.localdomain:20659] top:
> openmpi-sessions-humphrey_at_zelda01.localdomain_0
> [zelda01.localdomain:20659] tmp: /tmp
> [zelda01.localdomain:20659] [0,0,0] contact_file
> /tmp/openmpi-sessions-humphrey_at_zelda01.localdomain_0/default-universe
> -20659/
> universe-setup.txt
> [zelda01.localdomain:20659] [0,0,0] wrote setup file
> [zelda01.localdomain:20659] pls:rsh: local csh: 0, local bash: 1
> [zelda01.localdomain:20659] pls:rsh: assuming same remote shell as
> local
> shell
> [zelda01.localdomain:20659] pls:rsh: remote csh: 0, remote bash: 1
> [zelda01.localdomain:20659] pls:rsh: final template argv:
> [zelda01.localdomain:20659] pls:rsh: ssh <template> orted --debug
> --bootproxy 1 --name <template> --num_procs 2 --vpid_start 0 --nodename
> <template> --universe
> humphrey_at_zelda01.localdomain:default-universe-20659
> --nsreplica
> "0.0.0;tcp://130.207.252.131:35654;tcp://10.0.0.11:35654;tcp://
> 130.207.252.1
> 31:35654" --gprreplica
> "0.0.0;tcp://130.207.252.131:35654;tcp://10.0.0.11:35654;tcp://
> 130.207.252.1
> 31:35654" --mpi-call-yield 0
> [zelda01.localdomain:20659] pls:rsh: launching on node localhost
> [zelda01.localdomain:20659] pls:rsh: oversubscribed -- setting
> mpi_yield_when_idle to 1 (1 2)
> [zelda01.localdomain:20659] pls:rsh: localhost is a LOCAL node
> [zelda01.localdomain:20659] pls:rsh: executing: orted --debug
> --bootproxy 1
> --name 0.0.1 --num_procs 2 --vpid_start 0 --nodename localhost
> --universe
> humphrey_at_zelda01.localdomain:default-universe-20659 --nsreplica
> "0.0.0;tcp://130.207.252.131:35654;tcp://10.0.0.11:35654;tcp://
> 130.207.252.1
> 31:35654" --gprreplica
> "0.0.0;tcp://130.207.252.131:35654;tcp://10.0.0.11:35654;tcp://
> 130.207.252.1
> 31:35654" --mpi-call-yield 1
> [zelda01.localdomain:20660] [0,0,1] setting up session dir with
> [zelda01.localdomain:20660] universe default-universe-20659
> [zelda01.localdomain:20660] user humphrey
> [zelda01.localdomain:20660] host localhost
> [zelda01.localdomain:20660] jobid 0
> [zelda01.localdomain:20660] procid 1
> [zelda01.localdomain:20660] procdir:
> /tmp/openmpi-sessions-humphrey_at_localhost_0/default-universe-20659/0/1
> [zelda01.localdomain:20660] jobdir:
> /tmp/openmpi-sessions-humphrey_at_localhost_0/default-universe-20659/0
> [zelda01.localdomain:20660] unidir:
> /tmp/openmpi-sessions-humphrey_at_localhost_0/default-universe-20659
> [zelda01.localdomain:20660] top: openmpi-sessions-humphrey_at_localhost_0
> [zelda01.localdomain:20660] tmp: /tmp
> [zelda01.localdomain:20661] [0,1,1] setting up session dir with
> [zelda01.localdomain:20661] universe default-universe-20659
> [zelda01.localdomain:20661] user humphrey
> [zelda01.localdomain:20661] host localhost
> [zelda01.localdomain:20661] jobid 1
> [zelda01.localdomain:20661] procid 1
> [zelda01.localdomain:20661] procdir:
> /tmp/openmpi-sessions-humphrey_at_localhost_0/default-universe-20659/1/1
> [zelda01.localdomain:20661] jobdir:
> /tmp/openmpi-sessions-humphrey_at_localhost_0/default-universe-20659/1
> [zelda01.localdomain:20661] unidir:
> /tmp/openmpi-sessions-humphrey_at_localhost_0/default-universe-20659
> [zelda01.localdomain:20661] top: openmpi-sessions-humphrey_at_localhost_0
> [zelda01.localdomain:20661] tmp: /tmp
> [zelda01.localdomain:20662] [0,1,0] setting up session dir with
> [zelda01.localdomain:20662] universe default-universe-20659
> [zelda01.localdomain:20662] user humphrey
> [zelda01.localdomain:20662] host localhost
> [zelda01.localdomain:20662] jobid 1
> [zelda01.localdomain:20662] procid 0
> [zelda01.localdomain:20662] procdir:
> /tmp/openmpi-sessions-humphrey_at_localhost_0/default-universe-20659/1/0
> [zelda01.localdomain:20662] jobdir:
> /tmp/openmpi-sessions-humphrey_at_localhost_0/default-universe-20659/1
> [zelda01.localdomain:20662] unidir:
> /tmp/openmpi-sessions-humphrey_at_localhost_0/default-universe-20659
> [zelda01.localdomain:20662] top: openmpi-sessions-humphrey_at_localhost_0
> [zelda01.localdomain:20662] tmp: /tmp
> [zelda01.localdomain:20659] spawn: in job_state_callback(jobid = 1,
> state =
> 0x3)
> [zelda01.localdomain:20659] Info: Setting up debugger process table for
> applications
> MPIR_being_debugged = 0
> MPIR_debug_gate = 0
> MPIR_debug_state = 1
> MPIR_acquired_pre_main = 0
> MPIR_i_am_starter = 0
> MPIR_proctable_size = 2
> MPIR_proctable:
> (i, host, exe, pid) = (0, localhost, /home/humphrey/a.out, 20661)
> (i, host, exe, pid) = (1, localhost, /home/humphrey/a.out, 20662)
> [zelda01.localdomain:20659] spawn: in job_state_callback(jobid = 1,
> state =
> 0x4)
> [zelda01.localdomain:20662] [0,1,0] ompi_mpi_init completed
> [zelda01.localdomain:20661] [0,1,1] ompi_mpi_init completed
>
> 2 PE'S AS A 2 BY 1 GRID
>
> HALO2A NPES,N = 2 2 TIME = 0.000007 SECONDS
> HALO2A NPES,N = 2 4 TIME = 0.000007 SECONDS
> HALO2A NPES,N = 2 8 TIME = 0.000007 SECONDS
> HALO2A NPES,N = 2 16 TIME = 0.000008 SECONDS
> HALO2A NPES,N = 2 32 TIME = 0.000009 SECONDS
> HALO2A NPES,N = 2 64 TIME = 0.000011 SECONDS
> mpiexec: killing job...
> Interrupt
> Interrupt
> [zelda01.localdomain:20660] sess_dir_finalize: found proc session dir
> empty
> - deleting
> [zelda01.localdomain:20660] sess_dir_finalize: job session dir not
> empty -
> leaving
> [zelda01.localdomain:20660] sess_dir_finalize: found proc session dir
> empty
> - deleting
> [zelda01.localdomain:20660] sess_dir_finalize: found job session dir
> empty -
> deleting
> [zelda01.localdomain:20660] sess_dir_finalize: univ session dir not
> empty -
> leaving
> [zelda01.localdomain:20659] spawn: in job_state_callback(jobid = 1,
> state =
> 0xa)
> [zelda01.localdomain:20660] orted: job_state_callback(jobid = 1, state
> =
> ORTE_PROC_STATE_ABORTED)
> [zelda01.localdomain:20659] spawn: in job_state_callback(jobid = 1,
> state =
> 0x9)
> 2 processes killed (possibly by Open MPI)
> [zelda01.localdomain:20660] orted: job_state_callback(jobid = 1, state
> =
> ORTE_PROC_STATE_TERMINATED)
> [zelda01.localdomain:20660] sess_dir_finalize: found proc session dir
> empty
> - deleting
> [zelda01.localdomain:20660] sess_dir_finalize: found job session dir
> empty -
> deleting
> [zelda01.localdomain:20660] sess_dir_finalize: found univ session dir
> empty
> - deleting
> [zelda01.localdomain:20660] sess_dir_finalize: found top session dir
> empty -
> deleting
> [zelda01.localdomain:20659] sess_dir_finalize: found proc session dir
> empty
> - deleting
> [zelda01.localdomain:20659] sess_dir_finalize: found job session dir
> empty -
> deleting
> [zelda01.localdomain:20659] sess_dir_finalize: found univ session dir
> empty
> - deleting
> [zelda01.localdomain:20659] sess_dir_finalize: top session dir not
> empty -
> leaving
> [humphrey_at_zelda01 humphrey]$
> -------- end non-hanging invocation ------
>
> Any thoughts?
>
> -- Marty
>
>> -----Original Message-----
>> From: users-bounces_at_[hidden] [mailto:users-bounces_at_[hidden]]
>> On
>> Behalf Of Jeff Squyres
>> Sent: Tuesday, November 01, 2005 2:17 PM
>> To: Open MPI Users
>> Subject: Re: [O-MPI users] can't get openmpi to run across two multi-
>> NICmachines
>>
>> On Nov 1, 2005, at 12:02 PM, Marty Humphrey wrote:
>>
>>> wukong: eth0 (152.48.249.102, no MPI traffic), eth1
>>> (128.109.34.20,yes
>>> MPI
>>> traffic)
>>> zelda01: eth0 (130.207.252.131, yes MPI traffic), eth2 (10.0.0.12, no
>>> MPI
>>> traffic)
>>>
>>> on wukong, I have :
>>> [humphrey_at_wukong ~]$ more ~/.openmpi/mca-params.conf
>>> btl_tcp_if_include=eth1
>>> on zelda01, I have :
>>> [humphrey_at_zelda01 humphrey]$ more ~/.openmpi/mca-params.conf
>>> btl_tcp_if_include=eth0
>>
>> Just to make sure I'm reading this right -- 128.109.34.20 is supposed
>> to be routable to 130.207.252.131, right? Can you ssh directly from
>> one machine to the other? (I'm guessing that you can because OMPI was
>> able to start processes) Can you ping one machine from the other?
>>
>> Most importantly -- can you open arbitrary TCP ports between the two
>> machines? (i.e., not just well-known ports like 22 [ssh], etc.)
>>
>> --
>> {+} Jeff Squyres
>> {+} The Open MPI Project
>> {+} http://www.open-mpi.org/
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>

-- 
{+} Jeff Squyres
{+} The Open MPI Project
{+} http://www.open-mpi.org/