Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

From: Jeff Squyres (jsquyres_at_[hidden])
Date: 2005-11-13 14:52:27


To close this thread for the web archives...

We iterated about this quite a bit off the list and fixed a pair of
bugs that didn't make it into RC5. Many thanks to Marty for his
patience in helping us fix this!

For those who care, the bugs were:

- The shared memory btl had a problem if mmap() returned different
addresses to the same shared memory segment in different processes.
- The TCP btl does subnet mask checking to help determine which IP
addresses to hook up amongst peers (remember that Open MPI can utilize
multiple TCP interfaces in a single job); there was a bug that did not
allow arbitrary, non-subnet-mask-matched connections.

Both have been fixed in both the SVN trunk and v1.0 branch and are in
the nightly snapshot tarballs from this morning.

On Nov 10, 2005, at 9:02 AM, Marty Humphrey wrote:

> Here's a core I'm getting...
>
> [humphrey_at_zelda01 humphrey]$ mpiexec --mca btl_tcp_if_include eth0
> --mca
> oob_tcp_include eth0 -np 2 a.out
> mpiexec noticed that job rank 1 with PID 20028 on node "localhost"
> exited on
> signal 11.
> 1 process killed (possibly by Open MPI)
>
> [humphrey_at_zelda01 humphrey]$ gdb a.out core.20028
> GNU gdb Red Hat Linux (6.3.0.0-1.62rh)
> Copyright 2004 Free Software Foundation, Inc.
> GDB is free software, covered by the GNU General Public License, and
> you are
> welcome to change it and/or distribute copies of it under certain
> conditions.
> Type "show copying" to see the conditions.
> There is absolutely no warranty for GDB. Type "show warranty" for
> details.
> This GDB was configured as "i386-redhat-linux-gnu"...Using host
> libthread_db
> library "/lib/tls/libthread_db.so.1".
>
> Core was generated by `a.out'.
> Program terminated with signal 11, Segmentation fault.
> Reading symbols from
> /home/humphrey/ompi-install/lib/libmpi.so.0...done.
> Loaded symbols for /home/humphrey/ompi-install/lib/libmpi.so.0
> Reading symbols from
> /home/humphrey/ompi-install/lib/liborte.so.0...done.
> Loaded symbols for /home/humphrey/ompi-install/lib/liborte.so.0
> Reading symbols from
> /home/humphrey/ompi-install/lib/libopal.so.0...done.
> Loaded symbols for /home/humphrey/ompi-install/lib/libopal.so.0
> Reading symbols from /lib/libutil.so.1...done.
> Loaded symbols for /lib/libutil.so.1
> Reading symbols from /lib/libnsl.so.1...done.
> Loaded symbols for /lib/libnsl.so.1
> Reading symbols from /lib/libdl.so.2...done.
> Loaded symbols for /lib/libdl.so.2
> Reading symbols from /usr/lib/libaio.so.1...done.
> Loaded symbols for /usr/lib/libaio.so.1
> Reading symbols from /usr/lib/libg2c.so.0...done.
> Loaded symbols for /usr/lib/libg2c.so.0
> Reading symbols from /lib/tls/libm.so.6...done.
> Loaded symbols for /lib/tls/libm.so.6
> Reading symbols from /lib/libgcc_s.so.1...done.
> Loaded symbols for /lib/libgcc_s.so.1
> Reading symbols from /lib/tls/libpthread.so.0...done.
> Loaded symbols for /lib/tls/libpthread.so.0
> Reading symbols from /lib/tls/libc.so.6...done.
> Loaded symbols for /lib/tls/libc.so.6
> Reading symbols from /lib/ld-linux.so.2...done.
> Loaded symbols for /lib/ld-linux.so.2
> Reading symbols from
> /home/humphrey/ompi-install/lib/openmpi/mca_paffinity_linux.so...done.
> Loaded symbols for
> /home/humphrey/ompi-install/lib/openmpi/mca_paffinity_linux.so
> Reading symbols from /lib/libnss_files.so.2...done.
> Loaded symbols for /lib/libnss_files.so.2
> Reading symbols from
> /home/humphrey/ompi-install/lib/openmpi/mca_ns_proxy.so...done.
> Loaded symbols for
> /home/humphrey/ompi-install/lib/openmpi/mca_ns_proxy.so
> Reading symbols from
> /home/humphrey/ompi-install/lib/openmpi/mca_ns_replica.so...done.
> Loaded symbols for
> /home/humphrey/ompi-install/lib/openmpi/mca_ns_replica.so
> Reading symbols from
> /home/humphrey/ompi-install/lib/openmpi/mca_rml_oob.so...done.
> Loaded symbols for
> /home/humphrey/ompi-install/lib/openmpi/mca_rml_oob.so
> Reading symbols from
> /home/humphrey/ompi-install/lib/openmpi/mca_oob_tcp.so...done.
> Loaded symbols for
> /home/humphrey/ompi-install/lib/openmpi/mca_oob_tcp.so
> Reading symbols from
> /home/humphrey/ompi-install/lib/openmpi/mca_gpr_null.so...done.
> Loaded symbols for
> /home/humphrey/ompi-install/lib/openmpi/mca_gpr_null.so
> Reading symbols from
> /home/humphrey/ompi-install/lib/openmpi/mca_gpr_proxy.so...done.
> Loaded symbols for
> /home/humphrey/ompi-install/lib/openmpi/mca_gpr_proxy.so
> Reading symbols from
> /home/humphrey/ompi-install/lib/openmpi/mca_gpr_replica.so...done.
> Loaded symbols for
> /home/humphrey/ompi-install/lib/openmpi/mca_gpr_replica.so
> Reading symbols from
> /home/humphrey/ompi-install/lib/openmpi/mca_rmgr_proxy.so...done.
> Loaded symbols for
> /home/humphrey/ompi-install/lib/openmpi/mca_rmgr_proxy.so
> Reading symbols from
> /home/humphrey/ompi-install/lib/openmpi/mca_rmgr_urm.so...done.
> Loaded symbols for
> /home/humphrey/ompi-install/lib/openmpi/mca_rmgr_urm.so
> Reading symbols from
> /home/humphrey/ompi-install/lib/openmpi/mca_rds_hostfile.so...done.
> Loaded symbols for
> /home/humphrey/ompi-install/lib/openmpi/mca_rds_hostfile.so
> Reading symbols from
> /home/humphrey/ompi-install/lib/openmpi/mca_rds_resfile.so...done.
> Loaded symbols for
> /home/humphrey/ompi-install/lib/openmpi/mca_rds_resfile.so
> Reading symbols from
> /home/humphrey/ompi-install/lib/openmpi/mca_ras_dash_host.so...done.
> Loaded symbols for
> /home/humphrey/ompi-install/lib/openmpi/mca_ras_dash_host.so
> Reading symbols from
> /home/humphrey/ompi-install/lib/openmpi/mca_ras_hostfile.so...done.
> Loaded symbols for
> /home/humphrey/ompi-install/lib/openmpi/mca_ras_hostfile.so
> Reading symbols from
> /home/humphrey/ompi-install/lib/openmpi/mca_ras_localhost.so...done.
> Loaded symbols for
> /home/humphrey/ompi-install/lib/openmpi/mca_ras_localhost.so
> Reading symbols from
> /home/humphrey/ompi-install/lib/openmpi/mca_ras_slurm.so...done.
> Loaded symbols for
> /home/humphrey/ompi-install/lib/openmpi/mca_ras_slurm.so
> Reading symbols from
> /home/humphrey/ompi-install/lib/openmpi/
> mca_rmaps_round_robin.so...done.
> Loaded symbols for
> /home/humphrey/ompi-install/lib/openmpi/mca_rmaps_round_robin.so
> Reading symbols from
> /home/humphrey/ompi-install/lib/openmpi/mca_pls_fork.so...done.
> Loaded symbols for
> /home/humphrey/ompi-install/lib/openmpi/mca_pls_fork.so
> Reading symbols from
> /home/humphrey/ompi-install/lib/openmpi/mca_pls_proxy.so...done.
> Loaded symbols for
> /home/humphrey/ompi-install/lib/openmpi/mca_pls_proxy.so
> Reading symbols from
> /home/humphrey/ompi-install/lib/openmpi/mca_pls_rsh.so...done.
> Loaded symbols for
> /home/humphrey/ompi-install/lib/openmpi/mca_pls_rsh.so
> Reading symbols from
> /home/humphrey/ompi-install/lib/openmpi/mca_pls_slurm.so...done.
> Loaded symbols for
> /home/humphrey/ompi-install/lib/openmpi/mca_pls_slurm.so
> Reading symbols from
> /home/humphrey/ompi-install/lib/openmpi/mca_iof_proxy.so...done.
> Loaded symbols for
> /home/humphrey/ompi-install/lib/openmpi/mca_iof_proxy.so
> Reading symbols from
> /home/humphrey/ompi-install/lib/openmpi/mca_allocator_basic.so...done.
> Loaded symbols for
> /home/humphrey/ompi-install/lib/openmpi/mca_allocator_basic.so
> Reading symbols from
> /home/humphrey/ompi-install/lib/openmpi/mca_allocator_bucket.so...done.
> Loaded symbols for
> /home/humphrey/ompi-install/lib/openmpi/mca_allocator_bucket.so
> Reading symbols from
> /home/humphrey/ompi-install/lib/openmpi/mca_rcache_rb.so...done.
> Loaded symbols for
> /home/humphrey/ompi-install/lib/openmpi/mca_rcache_rb.so
> Reading symbols from
> /home/humphrey/ompi-install/lib/openmpi/mca_mpool_sm.so...done.
> Loaded symbols for
> /home/humphrey/ompi-install/lib/openmpi/mca_mpool_sm.so
> Reading symbols from
> /home/humphrey/ompi-install/lib/libmca_common_sm.so.0...done.
> Loaded symbols for
> /home/humphrey/ompi-install/lib/libmca_common_sm.so.0
> Reading symbols from
> /home/humphrey/ompi-install/lib/openmpi/mca_pml_ob1.so...done.
> Loaded symbols for
> /home/humphrey/ompi-install/lib/openmpi/mca_pml_ob1.so
> Reading symbols from
> /home/humphrey/ompi-install/lib/openmpi/mca_bml_r2.so...done.
> Loaded symbols for
> /home/humphrey/ompi-install/lib/openmpi/mca_bml_r2.so
> Reading symbols from
> /home/humphrey/ompi-install/lib/openmpi/mca_btl_self.so...done.
> Loaded symbols for
> /home/humphrey/ompi-install/lib/openmpi/mca_btl_self.so
> Reading symbols from
> /home/humphrey/ompi-install/lib/openmpi/mca_btl_sm.so...done.
> Loaded symbols for
> /home/humphrey/ompi-install/lib/openmpi/mca_btl_sm.so
> Reading symbols from
> /home/humphrey/ompi-install/lib/openmpi/mca_btl_tcp.so...done.
> Loaded symbols for
> /home/humphrey/ompi-install/lib/openmpi/mca_btl_tcp.so
> Reading symbols from
> /home/humphrey/ompi-install/lib/openmpi/mca_ptl_self.so...done.
> Loaded symbols for
> /home/humphrey/ompi-install/lib/openmpi/mca_ptl_self.so
> Reading symbols from
> /home/humphrey/ompi-install/lib/openmpi/mca_ptl_sm.so...done.
> Loaded symbols for
> /home/humphrey/ompi-install/lib/openmpi/mca_ptl_sm.so
> Reading symbols from
> /home/humphrey/ompi-install/lib/openmpi/mca_ptl_tcp.so...done.
> Loaded symbols for
> /home/humphrey/ompi-install/lib/openmpi/mca_ptl_tcp.so
> Reading symbols from
> /home/humphrey/ompi-install/lib/openmpi/mca_coll_basic.so...done.
> Loaded symbols for
> /home/humphrey/ompi-install/lib/openmpi/mca_coll_basic.so
> Reading symbols from
> /home/humphrey/ompi-install/lib/openmpi/mca_coll_hierarch.so...done.
> Loaded symbols for
> /home/humphrey/ompi-install/lib/openmpi/mca_coll_hierarch.so
> Reading symbols from
> /home/humphrey/ompi-install/lib/openmpi/mca_coll_self.so...done.
> Loaded symbols for
> /home/humphrey/ompi-install/lib/openmpi/mca_coll_self.so
> Reading symbols from
> /home/humphrey/ompi-install/lib/openmpi/mca_coll_sm.so...done.
> Loaded symbols for
> /home/humphrey/ompi-install/lib/openmpi/mca_coll_sm.so
> #0 0x009c4cbd in mca_btl_sm_add_procs_same_base_addr (btl=0x9c97c0,
> nprocs=2, procs=0x8c28628, peers=0x8c28660,
> reachability=0xbfffde80) at btl_sm.c:412
> 412 mca_btl_sm_component.sm_ctl_header->segment_header.
> (gdb) bt
> #0 0x009c4cbd in mca_btl_sm_add_procs_same_base_addr (btl=0x9c97c0,
> nprocs=2, procs=0x8c28628, peers=0x8c28660,
> reachability=0xbfffde80) at btl_sm.c:412
> #1 0x005e7245 in mca_bml_r2_add_procs (nprocs=2, procs=0x8c28628,
> bml_endpoints=0x8c28608, reachable=0xbfffde80) at bml_r2.c:220
> #2 0x00323671 in mca_pml_ob1_add_procs (procs=0x8c285f8, nprocs=2) at
> pml_ob1.c:131
> #3 0x00ed6e81 in ompi_mpi_init (argc=0, argv=0x0, requested=0,
> provided=0xbfffdf2c) at runtime/ompi_mpi_init.c:396
> #4 0x00f00c62 in PMPI_Init (argc=0xbfffdf60, argv=0xbfffdf5c) at
> pinit.c:71
> #5 0x00f2b23b in mpi_init_f (ierr=0x8052580) at pinit_f.c:65
> #6 0x08049362 in MAIN__ () at Halo.f:19
> #7 0x0804b7e6 in main ()
> (gdb) up
> #1 0x005e7245 in mca_bml_r2_add_procs (nprocs=2, procs=0x8c28628,
> bml_endpoints=0x8c28608, reachable=0xbfffde80) at bml_r2.c:220
> 220 rc = btl->btl_add_procs(btl, n_new_procs, new_procs,
> btl_endpoints, reachable);
> (gdb) up
> #2 0x00323671 in mca_pml_ob1_add_procs (procs=0x8c285f8, nprocs=2) at
> pml_ob1.c:131
> 131 rc = mca_bml.bml_add_procs(
> (gdb) up
> #3 0x00ed6e81 in ompi_mpi_init (argc=0, argv=0x0, requested=0,
> provided=0xbfffdf2c) at runtime/ompi_mpi_init.c:396
> 396 ret = MCA_PML_CALL(add_procs(procs, nprocs));
> (gdb) up
> #4 0x00f00c62 in PMPI_Init (argc=0xbfffdf60, argv=0xbfffdf5c) at
> pinit.c:71
> 71 err = ompi_mpi_init(*argc, *argv, required, &provided);
> (gdb) up
> #5 0x00f2b23b in mpi_init_f (ierr=0x8052580) at pinit_f.c:65
> 65 *ierr = OMPI_INT_2_FINT(MPI_Init( &argc, &argv ));
> (gdb) up
> #6 0x08049362 in MAIN__ () at Halo.f:19
> 19 CALL MPI_INIT(MPIERR)
> Current language: auto; currently fortran
> (gdb)
>
>> -----Original Message-----
>> From: users-bounces_at_[hidden] [mailto:users-bounces_at_[hidden]]
>> On
>> Behalf Of Jeff Squyres
>> Sent: Wednesday, November 09, 2005 10:41 PM
>> To: Open MPI Users
>> Subject: Re: [O-MPI users] can't get openmpi to run across twomulti-
>> NICmachines
>>
>> Sorry for the delay in replying -- it's a crazy week here preparing
>> for
>> SC next week.
>>
>> I'm double checking the code, and I don't see any obvious problems
>> with
>> the btl tcp include stuff.
>>
>> Can you also specify that you want OMPI's "out of band" communication
>> to use a specific network?
>>
>>> mpiexec -d --mca btl_tcp_if_include eth0 --mca oob_tcp_include eth0
>>> -np 2 a.out
>>
>> With the segv's, do you get meaningful core dumps? Can you send
>> backtraces?
>>
>>
>>
>> On Nov 8, 2005, at 3:02 PM, Marty Humphrey wrote:
>>
>>> It's taken me a while, but I've simplified the experiment...
>>>
>>> In a nutshell, I'm seeing strange behavior in my multi-NIC box when I
>>> attempt to execute " mpiexec -d --mca btl_tcp_if_include eth0 -np 2
>>> a.out".
>>> I have three different observed behaviors:
>>>
>>> [1] mpi worker rank 0 displays the banner and then just hangs
>>> (apparently
>>> trying to exchange MPI messages, which don't get delivered)
>>>
>>> 2 PE'S AS A 2 BY 1 GRID
>>>
>>> [2] it starts progressing (spitting out domain-specific msgs):
>>>
>>> 2 PE'S AS A 2 BY 1 GRID
>>>
>>> HALO2A NPES,N = 2 2 TIME = 0.000007 SECONDS
>>> HALO2A NPES,N = 2 4 TIME = 0.000007 SECONDS
>>> HALO2A NPES,N = 2 8 TIME = 0.000007 SECONDS
>>> HALO2A NPES,N = 2 16 TIME = 0.000008 SECONDS
>>> HALO2A NPES,N = 2 32 TIME = 0.000009 SECONDS
>>>
>>> [3] I get failure pretty quickly, with the line " mpiexec noticed
>>> that
>>> job
>>> rank 1 with PID 20425 on node "localhost" exited on signal 11."
>>>
>>> Here's the output of "ifconfig":
>>>
>>> [humphrey_at_zelda01 humphrey]$ /sbin/ifconfig
>>> eth0 Link encap:Ethernet HWaddr 00:11:43:DC:EA:EE
>>> inet addr:130.207.252.131 Bcast:130.207.252.255
>>> Mask:255.255.255.0
>>> UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
>>> RX packets:2441905 errors:0 dropped:0 overruns:0 frame:0
>>> TX packets:112786 errors:0 dropped:0 overruns:0 carrier:0
>>> collisions:0 txqueuelen:1000
>>> RX bytes:197322445 (188.1 Mb) TX bytes:32906750 (31.3 Mb)
>>> Base address:0xecc0 Memory:dfae0000-dfb00000
>>>
>>> eth2 Link encap:Ethernet HWaddr 00:11:95:C7:28:82
>>> inet addr:10.0.0.11 Bcast:10.0.0.255 Mask:255.255.255.0
>>> UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
>>> RX packets:11598757 errors:0 dropped:0 overruns:0 frame:0
>>> TX packets:7224590 errors:0 dropped:0 overruns:0 carrier:0
>>> collisions:0 txqueuelen:1000
>>> RX bytes:3491651158 (3329.8 Mb) TX bytes:1916674000
>>> (1827.8
>>> Mb)
>>> Interrupt:77 Base address:0xcc00
>>>
>>> ipsec0 Link encap:Ethernet HWaddr 00:11:43:DC:EA:EE
>>> inet addr:130.207.252.131 Mask:255.255.255.0
>>> UP RUNNING NOARP MTU:16260 Metric:1
>>> RX packets:40113 errors:0 dropped:40113 overruns:0 frame:0
>>> TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
>>> collisions:0 txqueuelen:10
>>> RX bytes:0 (0.0 b) TX bytes:0 (0.0 b)
>>>
>>> lo Link encap:Local Loopback
>>> inet addr:127.0.0.1 Mask:255.0.0.0
>>> UP LOOPBACK RUNNING MTU:16436 Metric:1
>>> RX packets:4742 errors:0 dropped:0 overruns:0 frame:0
>>> TX packets:4742 errors:0 dropped:0 overruns:0 carrier:0
>>> collisions:0 txqueuelen:0
>>> RX bytes:2369841 (2.2 Mb) TX bytes:2369841 (2.2 Mb)
>>>
>>> This is with openmpi-1.1a1r8038 .
>>>
>>> Here is the output of a hanging invocation....
>>>
>>> ----- begin hanging invocation ----
>>> [humphrey_at_zelda01 humphrey]$ mpiexec -d --mca btl_tcp_if_include eth0
>>> -np 2
>>> a.out
>>> [zelda01.localdomain:20455] procdir: (null)
>>> [zelda01.localdomain:20455] jobdir: (null)
>>> [zelda01.localdomain:20455] unidir:
>>> /tmp/openmpi-sessions-humphrey_at_zelda01.localdomain_0/default-universe
>>> [zelda01.localdomain:20455] top:
>>> openmpi-sessions-humphrey_at_zelda01.localdomain_0
>>> [zelda01.localdomain:20455] tmp: /tmp
>>> [zelda01.localdomain:20455] connect_uni: contact info read
>>> [zelda01.localdomain:20455] connect_uni: connection not allowed
>>> [zelda01.localdomain:20455] [0,0,0] setting up session dir with
>>> [zelda01.localdomain:20455] tmpdir /tmp
>>> [zelda01.localdomain:20455] universe default-universe-20455
>>> [zelda01.localdomain:20455] user humphrey
>>> [zelda01.localdomain:20455] host zelda01.localdomain
>>> [zelda01.localdomain:20455] jobid 0
>>> [zelda01.localdomain:20455] procid 0
>>> [zelda01.localdomain:20455] procdir:
>>> /tmp/openmpi-sessions-humphrey_at_zelda01.localdomain_0/default-universe
>>> -20455/
>>> 0/0
>>> [zelda01.localdomain:20455] jobdir:
>>> /tmp/openmpi-sessions-humphrey_at_zelda01.localdomain_0/default-universe
>>> -20455/
>>> 0
>>> [zelda01.localdomain:20455] unidir:
>>> /tmp/openmpi-sessions-humphrey_at_zelda01.localdomain_0/default-universe
>>> -20455
>>> [zelda01.localdomain:20455] top:
>>> openmpi-sessions-humphrey_at_zelda01.localdomain_0
>>> [zelda01.localdomain:20455] tmp: /tmp
>>> [zelda01.localdomain:20455] [0,0,0] contact_file
>>> /tmp/openmpi-sessions-humphrey_at_zelda01.localdomain_0/default-universe
>>> -20455/
>>> universe-setup.txt
>>> [zelda01.localdomain:20455] [0,0,0] wrote setup file
>>> [zelda01.localdomain:20455] pls:rsh: local csh: 0, local bash: 1
>>> [zelda01.localdomain:20455] pls:rsh: assuming same remote shell as
>>> local
>>> shell
>>> [zelda01.localdomain:20455] pls:rsh: remote csh: 0, remote bash: 1
>>> [zelda01.localdomain:20455] pls:rsh: final template argv:
>>> [zelda01.localdomain:20455] pls:rsh: ssh <template> orted --debug
>>> --bootproxy 1 --name <template> --num_procs 2 --vpid_start 0
>>> --nodename
>>> <template> --universe
>>> humphrey_at_zelda01.localdomain:default-universe-20455
>>> --nsreplica
>>> "0.0.0;tcp://130.207.252.131:35465;tcp://10.0.0.11:35465;tcp://
>>> 130.207.252.1
>>> 31:35465" --gprreplica
>>> "0.0.0;tcp://130.207.252.131:35465;tcp://10.0.0.11:35465;tcp://
>>> 130.207.252.1
>>> 31:35465" --mpi-call-yield 0
>>> [zelda01.localdomain:20455] pls:rsh: launching on node localhost
>>> [zelda01.localdomain:20455] pls:rsh: oversubscribed -- setting
>>> mpi_yield_when_idle to 1 (1 2)
>>> [zelda01.localdomain:20455] pls:rsh: localhost is a LOCAL node
>>> [zelda01.localdomain:20455] pls:rsh: executing: orted --debug
>>> --bootproxy 1
>>> --name 0.0.1 --num_procs 2 --vpid_start 0 --nodename localhost
>>> --universe
>>> humphrey_at_zelda01.localdomain:default-universe-20455 --nsreplica
>>> "0.0.0;tcp://130.207.252.131:35465;tcp://10.0.0.11:35465;tcp://
>>> 130.207.252.1
>>> 31:35465" --gprreplica
>>> "0.0.0;tcp://130.207.252.131:35465;tcp://10.0.0.11:35465;tcp://
>>> 130.207.252.1
>>> 31:35465" --mpi-call-yield 1
>>> [zelda01.localdomain:20456] [0,0,1] setting up session dir with
>>> [zelda01.localdomain:20456] universe default-universe-20455
>>> [zelda01.localdomain:20456] user humphrey
>>> [zelda01.localdomain:20456] host localhost
>>> [zelda01.localdomain:20456] jobid 0
>>> [zelda01.localdomain:20456] procid 1
>>> [zelda01.localdomain:20456] procdir:
>>> /tmp/openmpi-sessions-humphrey_at_localhost_0/default-universe-20455/0/1
>>> [zelda01.localdomain:20456] jobdir:
>>> /tmp/openmpi-sessions-humphrey_at_localhost_0/default-universe-20455/0
>>> [zelda01.localdomain:20456] unidir:
>>> /tmp/openmpi-sessions-humphrey_at_localhost_0/default-universe-20455
>>> [zelda01.localdomain:20456] top:
>>> openmpi-sessions-humphrey_at_localhost_0
>>> [zelda01.localdomain:20456] tmp: /tmp
>>> [zelda01.localdomain:20457] [0,1,1] setting up session dir with
>>> [zelda01.localdomain:20457] universe default-universe-20455
>>> [zelda01.localdomain:20457] user humphrey
>>> [zelda01.localdomain:20457] host localhost
>>> [zelda01.localdomain:20457] jobid 1
>>> [zelda01.localdomain:20457] procid 1
>>> [zelda01.localdomain:20457] procdir:
>>> /tmp/openmpi-sessions-humphrey_at_localhost_0/default-universe-20455/1/1
>>> [zelda01.localdomain:20457] jobdir:
>>> /tmp/openmpi-sessions-humphrey_at_localhost_0/default-universe-20455/1
>>> [zelda01.localdomain:20457] unidir:
>>> /tmp/openmpi-sessions-humphrey_at_localhost_0/default-universe-20455
>>> [zelda01.localdomain:20457] top:
>>> openmpi-sessions-humphrey_at_localhost_0
>>> [zelda01.localdomain:20457] tmp: /tmp
>>> [zelda01.localdomain:20458] [0,1,0] setting up session dir with
>>> [zelda01.localdomain:20458] universe default-universe-20455
>>> [zelda01.localdomain:20458] user humphrey
>>> [zelda01.localdomain:20458] host localhost
>>> [zelda01.localdomain:20458] jobid 1
>>> [zelda01.localdomain:20458] procid 0
>>> [zelda01.localdomain:20458] procdir:
>>> /tmp/openmpi-sessions-humphrey_at_localhost_0/default-universe-20455/1/0
>>> [zelda01.localdomain:20458] jobdir:
>>> /tmp/openmpi-sessions-humphrey_at_localhost_0/default-universe-20455/1
>>> [zelda01.localdomain:20458] unidir:
>>> /tmp/openmpi-sessions-humphrey_at_localhost_0/default-universe-20455
>>> [zelda01.localdomain:20458] top:
>>> openmpi-sessions-humphrey_at_localhost_0
>>> [zelda01.localdomain:20458] tmp: /tmp
>>> [zelda01.localdomain:20455] spawn: in job_state_callback(jobid = 1,
>>> state =
>>> 0x3)
>>> [zelda01.localdomain:20455] Info: Setting up debugger process table
>>> for
>>> applications
>>> MPIR_being_debugged = 0
>>> MPIR_debug_gate = 0
>>> MPIR_debug_state = 1
>>> MPIR_acquired_pre_main = 0
>>> MPIR_i_am_starter = 0
>>> MPIR_proctable_size = 2
>>> MPIR_proctable:
>>> (i, host, exe, pid) = (0, localhost, /home/humphrey/a.out, 20457)
>>> (i, host, exe, pid) = (1, localhost, /home/humphrey/a.out, 20458)
>>> [zelda01.localdomain:20455] spawn: in job_state_callback(jobid = 1,
>>> state =
>>> 0x4)
>>> [zelda01.localdomain:20458] [0,1,0] ompi_mpi_init completed
>>> [zelda01.localdomain:20457] [0,1,1] ompi_mpi_init completed
>>>
>>> 2 PE'S AS A 2 BY 1 GRID
>>> ------ end hanging invocation -----
>>>
>>> Here's the 1-in-approximately-20 that started working...
>>>
>>> ------- begin non-hanging invocation -----
>>> [humphrey_at_zelda01 humphrey]$ mpiexec -d --mca btl_tcp_if_include eth0
>>> -np 2
>>> a.out
>>> [zelda01.localdomain:20659] procdir: (null)
>>> [zelda01.localdomain:20659] jobdir: (null)
>>> [zelda01.localdomain:20659] unidir:
>>> /tmp/openmpi-sessions-humphrey_at_zelda01.localdomain_0/default-universe
>>> [zelda01.localdomain:20659] top:
>>> openmpi-sessions-humphrey_at_zelda01.localdomain_0
>>> [zelda01.localdomain:20659] tmp: /tmp
>>> [zelda01.localdomain:20659] connect_uni: contact info read
>>> [zelda01.localdomain:20659] connect_uni: connection not allowed
>>> [zelda01.localdomain:20659] [0,0,0] setting up session dir with
>>> [zelda01.localdomain:20659] tmpdir /tmp
>>> [zelda01.localdomain:20659] universe default-universe-20659
>>> [zelda01.localdomain:20659] user humphrey
>>> [zelda01.localdomain:20659] host zelda01.localdomain
>>> [zelda01.localdomain:20659] jobid 0
>>> [zelda01.localdomain:20659] procid 0
>>> [zelda01.localdomain:20659] procdir:
>>> /tmp/openmpi-sessions-humphrey_at_zelda01.localdomain_0/default-universe
>>> -20659/
>>> 0/0
>>> [zelda01.localdomain:20659] jobdir:
>>> /tmp/openmpi-sessions-humphrey_at_zelda01.localdomain_0/default-universe
>>> -20659/
>>> 0
>>> [zelda01.localdomain:20659] unidir:
>>> /tmp/openmpi-sessions-humphrey_at_zelda01.localdomain_0/default-universe
>>> -20659
>>> [zelda01.localdomain:20659] top:
>>> openmpi-sessions-humphrey_at_zelda01.localdomain_0
>>> [zelda01.localdomain:20659] tmp: /tmp
>>> [zelda01.localdomain:20659] [0,0,0] contact_file
>>> /tmp/openmpi-sessions-humphrey_at_zelda01.localdomain_0/default-universe
>>> -20659/
>>> universe-setup.txt
>>> [zelda01.localdomain:20659] [0,0,0] wrote setup file
>>> [zelda01.localdomain:20659] pls:rsh: local csh: 0, local bash: 1
>>> [zelda01.localdomain:20659] pls:rsh: assuming same remote shell as
>>> local
>>> shell
>>> [zelda01.localdomain:20659] pls:rsh: remote csh: 0, remote bash: 1
>>> [zelda01.localdomain:20659] pls:rsh: final template argv:
>>> [zelda01.localdomain:20659] pls:rsh: ssh <template> orted --debug
>>> --bootproxy 1 --name <template> --num_procs 2 --vpid_start 0
>>> --nodename
>>> <template> --universe
>>> humphrey_at_zelda01.localdomain:default-universe-20659
>>> --nsreplica
>>> "0.0.0;tcp://130.207.252.131:35654;tcp://10.0.0.11:35654;tcp://
>>> 130.207.252.1
>>> 31:35654" --gprreplica
>>> "0.0.0;tcp://130.207.252.131:35654;tcp://10.0.0.11:35654;tcp://
>>> 130.207.252.1
>>> 31:35654" --mpi-call-yield 0
>>> [zelda01.localdomain:20659] pls:rsh: launching on node localhost
>>> [zelda01.localdomain:20659] pls:rsh: oversubscribed -- setting
>>> mpi_yield_when_idle to 1 (1 2)
>>> [zelda01.localdomain:20659] pls:rsh: localhost is a LOCAL node
>>> [zelda01.localdomain:20659] pls:rsh: executing: orted --debug
>>> --bootproxy 1
>>> --name 0.0.1 --num_procs 2 --vpid_start 0 --nodename localhost
>>> --universe
>>> humphrey_at_zelda01.localdomain:default-universe-20659 --nsreplica
>>> "0.0.0;tcp://130.207.252.131:35654;tcp://10.0.0.11:35654;tcp://
>>> 130.207.252.1
>>> 31:35654" --gprreplica
>>> "0.0.0;tcp://130.207.252.131:35654;tcp://10.0.0.11:35654;tcp://
>>> 130.207.252.1
>>> 31:35654" --mpi-call-yield 1
>>> [zelda01.localdomain:20660] [0,0,1] setting up session dir with
>>> [zelda01.localdomain:20660] universe default-universe-20659
>>> [zelda01.localdomain:20660] user humphrey
>>> [zelda01.localdomain:20660] host localhost
>>> [zelda01.localdomain:20660] jobid 0
>>> [zelda01.localdomain:20660] procid 1
>>> [zelda01.localdomain:20660] procdir:
>>> /tmp/openmpi-sessions-humphrey_at_localhost_0/default-universe-20659/0/1
>>> [zelda01.localdomain:20660] jobdir:
>>> /tmp/openmpi-sessions-humphrey_at_localhost_0/default-universe-20659/0
>>> [zelda01.localdomain:20660] unidir:
>>> /tmp/openmpi-sessions-humphrey_at_localhost_0/default-universe-20659
>>> [zelda01.localdomain:20660] top:
>>> openmpi-sessions-humphrey_at_localhost_0
>>> [zelda01.localdomain:20660] tmp: /tmp
>>> [zelda01.localdomain:20661] [0,1,1] setting up session dir with
>>> [zelda01.localdomain:20661] universe default-universe-20659
>>> [zelda01.localdomain:20661] user humphrey
>>> [zelda01.localdomain:20661] host localhost
>>> [zelda01.localdomain:20661] jobid 1
>>> [zelda01.localdomain:20661] procid 1
>>> [zelda01.localdomain:20661] procdir:
>>> /tmp/openmpi-sessions-humphrey_at_localhost_0/default-universe-20659/1/1
>>> [zelda01.localdomain:20661] jobdir:
>>> /tmp/openmpi-sessions-humphrey_at_localhost_0/default-universe-20659/1
>>> [zelda01.localdomain:20661] unidir:
>>> /tmp/openmpi-sessions-humphrey_at_localhost_0/default-universe-20659
>>> [zelda01.localdomain:20661] top:
>>> openmpi-sessions-humphrey_at_localhost_0
>>> [zelda01.localdomain:20661] tmp: /tmp
>>> [zelda01.localdomain:20662] [0,1,0] setting up session dir with
>>> [zelda01.localdomain:20662] universe default-universe-20659
>>> [zelda01.localdomain:20662] user humphrey
>>> [zelda01.localdomain:20662] host localhost
>>> [zelda01.localdomain:20662] jobid 1
>>> [zelda01.localdomain:20662] procid 0
>>> [zelda01.localdomain:20662] procdir:
>>> /tmp/openmpi-sessions-humphrey_at_localhost_0/default-universe-20659/1/0
>>> [zelda01.localdomain:20662] jobdir:
>>> /tmp/openmpi-sessions-humphrey_at_localhost_0/default-universe-20659/1
>>> [zelda01.localdomain:20662] unidir:
>>> /tmp/openmpi-sessions-humphrey_at_localhost_0/default-universe-20659
>>> [zelda01.localdomain:20662] top:
>>> openmpi-sessions-humphrey_at_localhost_0
>>> [zelda01.localdomain:20662] tmp: /tmp
>>> [zelda01.localdomain:20659] spawn: in job_state_callback(jobid = 1,
>>> state =
>>> 0x3)
>>> [zelda01.localdomain:20659] Info: Setting up debugger process table
>>> for
>>> applications
>>> MPIR_being_debugged = 0
>>> MPIR_debug_gate = 0
>>> MPIR_debug_state = 1
>>> MPIR_acquired_pre_main = 0
>>> MPIR_i_am_starter = 0
>>> MPIR_proctable_size = 2
>>> MPIR_proctable:
>>> (i, host, exe, pid) = (0, localhost, /home/humphrey/a.out, 20661)
>>> (i, host, exe, pid) = (1, localhost, /home/humphrey/a.out, 20662)
>>> [zelda01.localdomain:20659] spawn: in job_state_callback(jobid = 1,
>>> state =
>>> 0x4)
>>> [zelda01.localdomain:20662] [0,1,0] ompi_mpi_init completed
>>> [zelda01.localdomain:20661] [0,1,1] ompi_mpi_init completed
>>>
>>> 2 PE'S AS A 2 BY 1 GRID
>>>
>>> HALO2A NPES,N = 2 2 TIME = 0.000007 SECONDS
>>> HALO2A NPES,N = 2 4 TIME = 0.000007 SECONDS
>>> HALO2A NPES,N = 2 8 TIME = 0.000007 SECONDS
>>> HALO2A NPES,N = 2 16 TIME = 0.000008 SECONDS
>>> HALO2A NPES,N = 2 32 TIME = 0.000009 SECONDS
>>> HALO2A NPES,N = 2 64 TIME = 0.000011 SECONDS
>>> mpiexec: killing job...
>>> Interrupt
>>> Interrupt
>>> [zelda01.localdomain:20660] sess_dir_finalize: found proc session dir
>>> empty
>>> - deleting
>>> [zelda01.localdomain:20660] sess_dir_finalize: job session dir not
>>> empty -
>>> leaving
>>> [zelda01.localdomain:20660] sess_dir_finalize: found proc session dir
>>> empty
>>> - deleting
>>> [zelda01.localdomain:20660] sess_dir_finalize: found job session dir
>>> empty -
>>> deleting
>>> [zelda01.localdomain:20660] sess_dir_finalize: univ session dir not
>>> empty -
>>> leaving
>>> [zelda01.localdomain:20659] spawn: in job_state_callback(jobid = 1,
>>> state =
>>> 0xa)
>>> [zelda01.localdomain:20660] orted: job_state_callback(jobid = 1,
>>> state
>>> =
>>> ORTE_PROC_STATE_ABORTED)
>>> [zelda01.localdomain:20659] spawn: in job_state_callback(jobid = 1,
>>> state =
>>> 0x9)
>>> 2 processes killed (possibly by Open MPI)
>>> [zelda01.localdomain:20660] orted: job_state_callback(jobid = 1,
>>> state
>>> =
>>> ORTE_PROC_STATE_TERMINATED)
>>> [zelda01.localdomain:20660] sess_dir_finalize: found proc session dir
>>> empty
>>> - deleting
>>> [zelda01.localdomain:20660] sess_dir_finalize: found job session dir
>>> empty -
>>> deleting
>>> [zelda01.localdomain:20660] sess_dir_finalize: found univ session dir
>>> empty
>>> - deleting
>>> [zelda01.localdomain:20660] sess_dir_finalize: found top session dir
>>> empty -
>>> deleting
>>> [zelda01.localdomain:20659] sess_dir_finalize: found proc session dir
>>> empty
>>> - deleting
>>> [zelda01.localdomain:20659] sess_dir_finalize: found job session dir
>>> empty -
>>> deleting
>>> [zelda01.localdomain:20659] sess_dir_finalize: found univ session dir
>>> empty
>>> - deleting
>>> [zelda01.localdomain:20659] sess_dir_finalize: top session dir not
>>> empty -
>>> leaving
>>> [humphrey_at_zelda01 humphrey]$
>>> -------- end non-hanging invocation ------
>>>
>>> Any thoughts?
>>>
>>> -- Marty
>>>
>>>> -----Original Message-----
>>>> From: users-bounces_at_[hidden] [mailto:users-bounces_at_[hidden]]
>>>> On
>>>> Behalf Of Jeff Squyres
>>>> Sent: Tuesday, November 01, 2005 2:17 PM
>>>> To: Open MPI Users
>>>> Subject: Re: [O-MPI users] can't get openmpi to run across two
>>>> multi-
>>>> NICmachines
>>>>
>>>> On Nov 1, 2005, at 12:02 PM, Marty Humphrey wrote:
>>>>
>>>>> wukong: eth0 (152.48.249.102, no MPI traffic), eth1
>>>>> (128.109.34.20,yes
>>>>> MPI
>>>>> traffic)
>>>>> zelda01: eth0 (130.207.252.131, yes MPI traffic), eth2 (10.0.0.12,
>>>>> no
>>>>> MPI
>>>>> traffic)
>>>>>
>>>>> on wukong, I have :
>>>>> [humphrey_at_wukong ~]$ more ~/.openmpi/mca-params.conf
>>>>> btl_tcp_if_include=eth1
>>>>> on zelda01, I have :
>>>>> [humphrey_at_zelda01 humphrey]$ more ~/.openmpi/mca-params.conf
>>>>> btl_tcp_if_include=eth0
>>>>
>>>> Just to make sure I'm reading this right -- 128.109.34.20 is
>>>> supposed
>>>> to be routable to 130.207.252.131, right? Can you ssh directly from
>>>> one machine to the other? (I'm guessing that you can because OMPI
>>>> was
>>>> able to start processes) Can you ping one machine from the other?
>>>>
>>>> Most importantly -- can you open arbitrary TCP ports between the two
>>>> machines? (i.e., not just well-known ports like 22 [ssh], etc.)
>>>>
>>>> --
>>>> {+} Jeff Squyres
>>>> {+} The Open MPI Project
>>>> {+} http://www.open-mpi.org/
>>>>
>>>> _______________________________________________
>>>> users mailing list
>>>> users_at_[hidden]
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>
>>> _______________________________________________
>>> users mailing list
>>> users_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>
>>
>> --
>> {+} Jeff Squyres
>> {+} The Open MPI Project
>> {+} http://www.open-mpi.org/
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>

-- 
{+} Jeff Squyres
{+} The Open MPI Project
{+} http://www.open-mpi.org/