Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |  

This web mail archive is frozen.

This page is part of a frozen web archive of this mailing list.

You can still navigate around this archive, but know that no new mails have been added to it since July of 2016.

Click here to be taken to the new web archives of this list; it includes all the mails that are in this frozen archive plus all new mails that have been sent to the list since it was migrated to the new archives.

From: Marty Humphrey (humphrey_at_[hidden])
Date: 2005-11-10 08:50:14


I'm not seeing any cores -- I'll see if there's anything stopping them from
being produced.

I've attached to one of the hanging "a.out"s (this is with the "mpiexec"
invocation that includes "--mca oob_tcp_include eth0")

(gdb) bt
#0 0x001e3007 in sched_yield () from /lib/tls/libc.so.6
#1 0x00512b55 in opal_progress () at runtime/opal_progress.c:306
#2 0x00298f57 in opal_condition_wait (c=0xbf5120, m=0xbf5180) at
../../../../opal/threads/condition.h:74
#3 0x00298877 in mca_pml_ob1_recv (addr=0x0, count=0, datatype=0xbe8ce0,
src=-1, tag=-16, comm=0xbf0440, status=0x0)
    at pml_ob1_irecv.c:102
#4 0x002b7ff2 in mca_coll_basic_barrier_intra_lin (comm=0xbf0440) at
coll_basic_barrier.c:70
#5 0x00b4dbfa in PMPI_Barrier (comm=0xbf0440) at pbarrier.c:52
#6 0x00b8d257 in mpi_barrier_f (comm=0x804b92c, ierr=0x8052580) at
pbarrier_f.c:66
#7 0x080494a8 in MAIN__ () at Halo.f:32
#8 0x0804b7e6 in main ()
(gdb) up
#1 0x00512b55 in opal_progress () at runtime/opal_progress.c:306
306 sched_yield();
(gdb) up
#2 0x00298f57 in opal_condition_wait (c=0xbf5120, m=0xbf5180) at
../../../../opal/threads/condition.h:74
74 opal_progress();
(gdb) up
#3 0x00298877 in mca_pml_ob1_recv (addr=0x0, count=0, datatype=0xbe8ce0,
src=-1, tag=-16, comm=0xbf0440, status=0x0)
    at pml_ob1_irecv.c:102
102 opal_condition_wait(&ompi_request_cond,
&ompi_request_lock);
(gdb) up
#4 0x002b7ff2 in mca_coll_basic_barrier_intra_lin (comm=0xbf0440) at
coll_basic_barrier.c:70
70 err = MCA_PML_CALL(recv(NULL, 0, MPI_BYTE,
MPI_ANY_SOURCE,
(gdb) up
#5 0x00b4dbfa in PMPI_Barrier (comm=0xbf0440) at pbarrier.c:52
52 err = comm->c_coll.coll_barrier(comm);
(gdb) up
#6 0x00b8d257 in mpi_barrier_f (comm=0x804b92c, ierr=0x8052580) at
pbarrier_f.c:66
66 *ierr = OMPI_INT_2_FINT(MPI_Barrier(c_comm));
(gdb) up
#7 0x080494a8 in MAIN__ () at Halo.f:32
32 CALL MPI_BARRIER(MPI_COMM_WORLD,MPIERR)
Current language: auto; currently fortran
(gdb)

Does this help? (By the way, I've upgraded to "openmpi-1.1a1r8084" before
running this experiment).

Thanks for your help,
Marty

> -----Original Message-----
> From: users-bounces_at_[hidden] [mailto:users-bounces_at_[hidden]] On
> Behalf Of Jeff Squyres
> Sent: Wednesday, November 09, 2005 10:41 PM
> To: Open MPI Users
> Subject: Re: [O-MPI users] can't get openmpi to run across twomulti-
> NICmachines
>
> Sorry for the delay in replying -- it's a crazy week here preparing for
> SC next week.
>
> I'm double checking the code, and I don't see any obvious problems with
> the btl tcp include stuff.
>
> Can you also specify that you want OMPI's "out of band" communication
> to use a specific network?
>
> > mpiexec -d --mca btl_tcp_if_include eth0 --mca oob_tcp_include eth0
> > -np 2 a.out
>
> With the segv's, do you get meaningful core dumps? Can you send
> backtraces?
>
>
>
> On Nov 8, 2005, at 3:02 PM, Marty Humphrey wrote:
>
> > It's taken me a while, but I've simplified the experiment...
> >
> > In a nutshell, I'm seeing strange behavior in my multi-NIC box when I
> > attempt to execute " mpiexec -d --mca btl_tcp_if_include eth0 -np 2
> > a.out".
> > I have three different observed behaviors:
> >
> > [1] mpi worker rank 0 displays the banner and then just hangs
> > (apparently
> > trying to exchange MPI messages, which don't get delivered)
> >
> > 2 PE'S AS A 2 BY 1 GRID
> >
> > [2] it starts progressing (spitting out domain-specific msgs):
> >
> > 2 PE'S AS A 2 BY 1 GRID
> >
> > HALO2A NPES,N = 2 2 TIME = 0.000007 SECONDS
> > HALO2A NPES,N = 2 4 TIME = 0.000007 SECONDS
> > HALO2A NPES,N = 2 8 TIME = 0.000007 SECONDS
> > HALO2A NPES,N = 2 16 TIME = 0.000008 SECONDS
> > HALO2A NPES,N = 2 32 TIME = 0.000009 SECONDS
> >
> > [3] I get failure pretty quickly, with the line " mpiexec noticed that
> > job
> > rank 1 with PID 20425 on node "localhost" exited on signal 11."
> >
> > Here's the output of "ifconfig":
> >
> > [humphrey_at_zelda01 humphrey]$ /sbin/ifconfig
> > eth0 Link encap:Ethernet HWaddr 00:11:43:DC:EA:EE
> > inet addr:130.207.252.131 Bcast:130.207.252.255
> > Mask:255.255.255.0
> > UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
> > RX packets:2441905 errors:0 dropped:0 overruns:0 frame:0
> > TX packets:112786 errors:0 dropped:0 overruns:0 carrier:0
> > collisions:0 txqueuelen:1000
> > RX bytes:197322445 (188.1 Mb) TX bytes:32906750 (31.3 Mb)
> > Base address:0xecc0 Memory:dfae0000-dfb00000
> >
> > eth2 Link encap:Ethernet HWaddr 00:11:95:C7:28:82
> > inet addr:10.0.0.11 Bcast:10.0.0.255 Mask:255.255.255.0
> > UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
> > RX packets:11598757 errors:0 dropped:0 overruns:0 frame:0
> > TX packets:7224590 errors:0 dropped:0 overruns:0 carrier:0
> > collisions:0 txqueuelen:1000
> > RX bytes:3491651158 (3329.8 Mb) TX bytes:1916674000 (1827.8
> > Mb)
> > Interrupt:77 Base address:0xcc00
> >
> > ipsec0 Link encap:Ethernet HWaddr 00:11:43:DC:EA:EE
> > inet addr:130.207.252.131 Mask:255.255.255.0
> > UP RUNNING NOARP MTU:16260 Metric:1
> > RX packets:40113 errors:0 dropped:40113 overruns:0 frame:0
> > TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
> > collisions:0 txqueuelen:10
> > RX bytes:0 (0.0 b) TX bytes:0 (0.0 b)
> >
> > lo Link encap:Local Loopback
> > inet addr:127.0.0.1 Mask:255.0.0.0
> > UP LOOPBACK RUNNING MTU:16436 Metric:1
> > RX packets:4742 errors:0 dropped:0 overruns:0 frame:0
> > TX packets:4742 errors:0 dropped:0 overruns:0 carrier:0
> > collisions:0 txqueuelen:0
> > RX bytes:2369841 (2.2 Mb) TX bytes:2369841 (2.2 Mb)
> >
> > This is with openmpi-1.1a1r8038 .
> >
> > Here is the output of a hanging invocation....
> >
> > ----- begin hanging invocation ----
> > [humphrey_at_zelda01 humphrey]$ mpiexec -d --mca btl_tcp_if_include eth0
> > -np 2
> > a.out
> > [zelda01.localdomain:20455] procdir: (null)
> > [zelda01.localdomain:20455] jobdir: (null)
> > [zelda01.localdomain:20455] unidir:
> > /tmp/openmpi-sessions-humphrey_at_zelda01.localdomain_0/default-universe
> > [zelda01.localdomain:20455] top:
> > openmpi-sessions-humphrey_at_zelda01.localdomain_0
> > [zelda01.localdomain:20455] tmp: /tmp
> > [zelda01.localdomain:20455] connect_uni: contact info read
> > [zelda01.localdomain:20455] connect_uni: connection not allowed
> > [zelda01.localdomain:20455] [0,0,0] setting up session dir with
> > [zelda01.localdomain:20455] tmpdir /tmp
> > [zelda01.localdomain:20455] universe default-universe-20455
> > [zelda01.localdomain:20455] user humphrey
> > [zelda01.localdomain:20455] host zelda01.localdomain
> > [zelda01.localdomain:20455] jobid 0
> > [zelda01.localdomain:20455] procid 0
> > [zelda01.localdomain:20455] procdir:
> > /tmp/openmpi-sessions-humphrey_at_zelda01.localdomain_0/default-universe
> > -20455/
> > 0/0
> > [zelda01.localdomain:20455] jobdir:
> > /tmp/openmpi-sessions-humphrey_at_zelda01.localdomain_0/default-universe
> > -20455/
> > 0
> > [zelda01.localdomain:20455] unidir:
> > /tmp/openmpi-sessions-humphrey_at_zelda01.localdomain_0/default-universe
> > -20455
> > [zelda01.localdomain:20455] top:
> > openmpi-sessions-humphrey_at_zelda01.localdomain_0
> > [zelda01.localdomain:20455] tmp: /tmp
> > [zelda01.localdomain:20455] [0,0,0] contact_file
> > /tmp/openmpi-sessions-humphrey_at_zelda01.localdomain_0/default-universe
> > -20455/
> > universe-setup.txt
> > [zelda01.localdomain:20455] [0,0,0] wrote setup file
> > [zelda01.localdomain:20455] pls:rsh: local csh: 0, local bash: 1
> > [zelda01.localdomain:20455] pls:rsh: assuming same remote shell as
> > local
> > shell
> > [zelda01.localdomain:20455] pls:rsh: remote csh: 0, remote bash: 1
> > [zelda01.localdomain:20455] pls:rsh: final template argv:
> > [zelda01.localdomain:20455] pls:rsh: ssh <template> orted --debug
> > --bootproxy 1 --name <template> --num_procs 2 --vpid_start 0 --nodename
> > <template> --universe
> > humphrey_at_zelda01.localdomain:default-universe-20455
> > --nsreplica
> > "0.0.0;tcp://130.207.252.131:35465;tcp://10.0.0.11:35465;tcp://
> > 130.207.252.1
> > 31:35465" --gprreplica
> > "0.0.0;tcp://130.207.252.131:35465;tcp://10.0.0.11:35465;tcp://
> > 130.207.252.1
> > 31:35465" --mpi-call-yield 0
> > [zelda01.localdomain:20455] pls:rsh: launching on node localhost
> > [zelda01.localdomain:20455] pls:rsh: oversubscribed -- setting
> > mpi_yield_when_idle to 1 (1 2)
> > [zelda01.localdomain:20455] pls:rsh: localhost is a LOCAL node
> > [zelda01.localdomain:20455] pls:rsh: executing: orted --debug
> > --bootproxy 1
> > --name 0.0.1 --num_procs 2 --vpid_start 0 --nodename localhost
> > --universe
> > humphrey_at_zelda01.localdomain:default-universe-20455 --nsreplica
> > "0.0.0;tcp://130.207.252.131:35465;tcp://10.0.0.11:35465;tcp://
> > 130.207.252.1
> > 31:35465" --gprreplica
> > "0.0.0;tcp://130.207.252.131:35465;tcp://10.0.0.11:35465;tcp://
> > 130.207.252.1
> > 31:35465" --mpi-call-yield 1
> > [zelda01.localdomain:20456] [0,0,1] setting up session dir with
> > [zelda01.localdomain:20456] universe default-universe-20455
> > [zelda01.localdomain:20456] user humphrey
> > [zelda01.localdomain:20456] host localhost
> > [zelda01.localdomain:20456] jobid 0
> > [zelda01.localdomain:20456] procid 1
> > [zelda01.localdomain:20456] procdir:
> > /tmp/openmpi-sessions-humphrey_at_localhost_0/default-universe-20455/0/1
> > [zelda01.localdomain:20456] jobdir:
> > /tmp/openmpi-sessions-humphrey_at_localhost_0/default-universe-20455/0
> > [zelda01.localdomain:20456] unidir:
> > /tmp/openmpi-sessions-humphrey_at_localhost_0/default-universe-20455
> > [zelda01.localdomain:20456] top: openmpi-sessions-humphrey_at_localhost_0
> > [zelda01.localdomain:20456] tmp: /tmp
> > [zelda01.localdomain:20457] [0,1,1] setting up session dir with
> > [zelda01.localdomain:20457] universe default-universe-20455
> > [zelda01.localdomain:20457] user humphrey
> > [zelda01.localdomain:20457] host localhost
> > [zelda01.localdomain:20457] jobid 1
> > [zelda01.localdomain:20457] procid 1
> > [zelda01.localdomain:20457] procdir:
> > /tmp/openmpi-sessions-humphrey_at_localhost_0/default-universe-20455/1/1
> > [zelda01.localdomain:20457] jobdir:
> > /tmp/openmpi-sessions-humphrey_at_localhost_0/default-universe-20455/1
> > [zelda01.localdomain:20457] unidir:
> > /tmp/openmpi-sessions-humphrey_at_localhost_0/default-universe-20455
> > [zelda01.localdomain:20457] top: openmpi-sessions-humphrey_at_localhost_0
> > [zelda01.localdomain:20457] tmp: /tmp
> > [zelda01.localdomain:20458] [0,1,0] setting up session dir with
> > [zelda01.localdomain:20458] universe default-universe-20455
> > [zelda01.localdomain:20458] user humphrey
> > [zelda01.localdomain:20458] host localhost
> > [zelda01.localdomain:20458] jobid 1
> > [zelda01.localdomain:20458] procid 0
> > [zelda01.localdomain:20458] procdir:
> > /tmp/openmpi-sessions-humphrey_at_localhost_0/default-universe-20455/1/0
> > [zelda01.localdomain:20458] jobdir:
> > /tmp/openmpi-sessions-humphrey_at_localhost_0/default-universe-20455/1
> > [zelda01.localdomain:20458] unidir:
> > /tmp/openmpi-sessions-humphrey_at_localhost_0/default-universe-20455
> > [zelda01.localdomain:20458] top: openmpi-sessions-humphrey_at_localhost_0
> > [zelda01.localdomain:20458] tmp: /tmp
> > [zelda01.localdomain:20455] spawn: in job_state_callback(jobid = 1,
> > state =
> > 0x3)
> > [zelda01.localdomain:20455] Info: Setting up debugger process table for
> > applications
> > MPIR_being_debugged = 0
> > MPIR_debug_gate = 0
> > MPIR_debug_state = 1
> > MPIR_acquired_pre_main = 0
> > MPIR_i_am_starter = 0
> > MPIR_proctable_size = 2
> > MPIR_proctable:
> > (i, host, exe, pid) = (0, localhost, /home/humphrey/a.out, 20457)
> > (i, host, exe, pid) = (1, localhost, /home/humphrey/a.out, 20458)
> > [zelda01.localdomain:20455] spawn: in job_state_callback(jobid = 1,
> > state =
> > 0x4)
> > [zelda01.localdomain:20458] [0,1,0] ompi_mpi_init completed
> > [zelda01.localdomain:20457] [0,1,1] ompi_mpi_init completed
> >
> > 2 PE'S AS A 2 BY 1 GRID
> > ------ end hanging invocation -----
> >
> > Here's the 1-in-approximately-20 that started working...
> >
> > ------- begin non-hanging invocation -----
> > [humphrey_at_zelda01 humphrey]$ mpiexec -d --mca btl_tcp_if_include eth0
> > -np 2
> > a.out
> > [zelda01.localdomain:20659] procdir: (null)
> > [zelda01.localdomain:20659] jobdir: (null)
> > [zelda01.localdomain:20659] unidir:
> > /tmp/openmpi-sessions-humphrey_at_zelda01.localdomain_0/default-universe
> > [zelda01.localdomain:20659] top:
> > openmpi-sessions-humphrey_at_zelda01.localdomain_0
> > [zelda01.localdomain:20659] tmp: /tmp
> > [zelda01.localdomain:20659] connect_uni: contact info read
> > [zelda01.localdomain:20659] connect_uni: connection not allowed
> > [zelda01.localdomain:20659] [0,0,0] setting up session dir with
> > [zelda01.localdomain:20659] tmpdir /tmp
> > [zelda01.localdomain:20659] universe default-universe-20659
> > [zelda01.localdomain:20659] user humphrey
> > [zelda01.localdomain:20659] host zelda01.localdomain
> > [zelda01.localdomain:20659] jobid 0
> > [zelda01.localdomain:20659] procid 0
> > [zelda01.localdomain:20659] procdir:
> > /tmp/openmpi-sessions-humphrey_at_zelda01.localdomain_0/default-universe
> > -20659/
> > 0/0
> > [zelda01.localdomain:20659] jobdir:
> > /tmp/openmpi-sessions-humphrey_at_zelda01.localdomain_0/default-universe
> > -20659/
> > 0
> > [zelda01.localdomain:20659] unidir:
> > /tmp/openmpi-sessions-humphrey_at_zelda01.localdomain_0/default-universe
> > -20659
> > [zelda01.localdomain:20659] top:
> > openmpi-sessions-humphrey_at_zelda01.localdomain_0
> > [zelda01.localdomain:20659] tmp: /tmp
> > [zelda01.localdomain:20659] [0,0,0] contact_file
> > /tmp/openmpi-sessions-humphrey_at_zelda01.localdomain_0/default-universe
> > -20659/
> > universe-setup.txt
> > [zelda01.localdomain:20659] [0,0,0] wrote setup file
> > [zelda01.localdomain:20659] pls:rsh: local csh: 0, local bash: 1
> > [zelda01.localdomain:20659] pls:rsh: assuming same remote shell as
> > local
> > shell
> > [zelda01.localdomain:20659] pls:rsh: remote csh: 0, remote bash: 1
> > [zelda01.localdomain:20659] pls:rsh: final template argv:
> > [zelda01.localdomain:20659] pls:rsh: ssh <template> orted --debug
> > --bootproxy 1 --name <template> --num_procs 2 --vpid_start 0 --nodename
> > <template> --universe
> > humphrey_at_zelda01.localdomain:default-universe-20659
> > --nsreplica
> > "0.0.0;tcp://130.207.252.131:35654;tcp://10.0.0.11:35654;tcp://
> > 130.207.252.1
> > 31:35654" --gprreplica
> > "0.0.0;tcp://130.207.252.131:35654;tcp://10.0.0.11:35654;tcp://
> > 130.207.252.1
> > 31:35654" --mpi-call-yield 0
> > [zelda01.localdomain:20659] pls:rsh: launching on node localhost
> > [zelda01.localdomain:20659] pls:rsh: oversubscribed -- setting
> > mpi_yield_when_idle to 1 (1 2)
> > [zelda01.localdomain:20659] pls:rsh: localhost is a LOCAL node
> > [zelda01.localdomain:20659] pls:rsh: executing: orted --debug
> > --bootproxy 1
> > --name 0.0.1 --num_procs 2 --vpid_start 0 --nodename localhost
> > --universe
> > humphrey_at_zelda01.localdomain:default-universe-20659 --nsreplica
> > "0.0.0;tcp://130.207.252.131:35654;tcp://10.0.0.11:35654;tcp://
> > 130.207.252.1
> > 31:35654" --gprreplica
> > "0.0.0;tcp://130.207.252.131:35654;tcp://10.0.0.11:35654;tcp://
> > 130.207.252.1
> > 31:35654" --mpi-call-yield 1
> > [zelda01.localdomain:20660] [0,0,1] setting up session dir with
> > [zelda01.localdomain:20660] universe default-universe-20659
> > [zelda01.localdomain:20660] user humphrey
> > [zelda01.localdomain:20660] host localhost
> > [zelda01.localdomain:20660] jobid 0
> > [zelda01.localdomain:20660] procid 1
> > [zelda01.localdomain:20660] procdir:
> > /tmp/openmpi-sessions-humphrey_at_localhost_0/default-universe-20659/0/1
> > [zelda01.localdomain:20660] jobdir:
> > /tmp/openmpi-sessions-humphrey_at_localhost_0/default-universe-20659/0
> > [zelda01.localdomain:20660] unidir:
> > /tmp/openmpi-sessions-humphrey_at_localhost_0/default-universe-20659
> > [zelda01.localdomain:20660] top: openmpi-sessions-humphrey_at_localhost_0
> > [zelda01.localdomain:20660] tmp: /tmp
> > [zelda01.localdomain:20661] [0,1,1] setting up session dir with
> > [zelda01.localdomain:20661] universe default-universe-20659
> > [zelda01.localdomain:20661] user humphrey
> > [zelda01.localdomain:20661] host localhost
> > [zelda01.localdomain:20661] jobid 1
> > [zelda01.localdomain:20661] procid 1
> > [zelda01.localdomain:20661] procdir:
> > /tmp/openmpi-sessions-humphrey_at_localhost_0/default-universe-20659/1/1
> > [zelda01.localdomain:20661] jobdir:
> > /tmp/openmpi-sessions-humphrey_at_localhost_0/default-universe-20659/1
> > [zelda01.localdomain:20661] unidir:
> > /tmp/openmpi-sessions-humphrey_at_localhost_0/default-universe-20659
> > [zelda01.localdomain:20661] top: openmpi-sessions-humphrey_at_localhost_0
> > [zelda01.localdomain:20661] tmp: /tmp
> > [zelda01.localdomain:20662] [0,1,0] setting up session dir with
> > [zelda01.localdomain:20662] universe default-universe-20659
> > [zelda01.localdomain:20662] user humphrey
> > [zelda01.localdomain:20662] host localhost
> > [zelda01.localdomain:20662] jobid 1
> > [zelda01.localdomain:20662] procid 0
> > [zelda01.localdomain:20662] procdir:
> > /tmp/openmpi-sessions-humphrey_at_localhost_0/default-universe-20659/1/0
> > [zelda01.localdomain:20662] jobdir:
> > /tmp/openmpi-sessions-humphrey_at_localhost_0/default-universe-20659/1
> > [zelda01.localdomain:20662] unidir:
> > /tmp/openmpi-sessions-humphrey_at_localhost_0/default-universe-20659
> > [zelda01.localdomain:20662] top: openmpi-sessions-humphrey_at_localhost_0
> > [zelda01.localdomain:20662] tmp: /tmp
> > [zelda01.localdomain:20659] spawn: in job_state_callback(jobid = 1,
> > state =
> > 0x3)
> > [zelda01.localdomain:20659] Info: Setting up debugger process table for
> > applications
> > MPIR_being_debugged = 0
> > MPIR_debug_gate = 0
> > MPIR_debug_state = 1
> > MPIR_acquired_pre_main = 0
> > MPIR_i_am_starter = 0
> > MPIR_proctable_size = 2
> > MPIR_proctable:
> > (i, host, exe, pid) = (0, localhost, /home/humphrey/a.out, 20661)
> > (i, host, exe, pid) = (1, localhost, /home/humphrey/a.out, 20662)
> > [zelda01.localdomain:20659] spawn: in job_state_callback(jobid = 1,
> > state =
> > 0x4)
> > [zelda01.localdomain:20662] [0,1,0] ompi_mpi_init completed
> > [zelda01.localdomain:20661] [0,1,1] ompi_mpi_init completed
> >
> > 2 PE'S AS A 2 BY 1 GRID
> >
> > HALO2A NPES,N = 2 2 TIME = 0.000007 SECONDS
> > HALO2A NPES,N = 2 4 TIME = 0.000007 SECONDS
> > HALO2A NPES,N = 2 8 TIME = 0.000007 SECONDS
> > HALO2A NPES,N = 2 16 TIME = 0.000008 SECONDS
> > HALO2A NPES,N = 2 32 TIME = 0.000009 SECONDS
> > HALO2A NPES,N = 2 64 TIME = 0.000011 SECONDS
> > mpiexec: killing job...
> > Interrupt
> > Interrupt
> > [zelda01.localdomain:20660] sess_dir_finalize: found proc session dir
> > empty
> > - deleting
> > [zelda01.localdomain:20660] sess_dir_finalize: job session dir not
> > empty -
> > leaving
> > [zelda01.localdomain:20660] sess_dir_finalize: found proc session dir
> > empty
> > - deleting
> > [zelda01.localdomain:20660] sess_dir_finalize: found job session dir
> > empty -
> > deleting
> > [zelda01.localdomain:20660] sess_dir_finalize: univ session dir not
> > empty -
> > leaving
> > [zelda01.localdomain:20659] spawn: in job_state_callback(jobid = 1,
> > state =
> > 0xa)
> > [zelda01.localdomain:20660] orted: job_state_callback(jobid = 1, state
> > =
> > ORTE_PROC_STATE_ABORTED)
> > [zelda01.localdomain:20659] spawn: in job_state_callback(jobid = 1,
> > state =
> > 0x9)
> > 2 processes killed (possibly by Open MPI)
> > [zelda01.localdomain:20660] orted: job_state_callback(jobid = 1, state
> > =
> > ORTE_PROC_STATE_TERMINATED)
> > [zelda01.localdomain:20660] sess_dir_finalize: found proc session dir
> > empty
> > - deleting
> > [zelda01.localdomain:20660] sess_dir_finalize: found job session dir
> > empty -
> > deleting
> > [zelda01.localdomain:20660] sess_dir_finalize: found univ session dir
> > empty
> > - deleting
> > [zelda01.localdomain:20660] sess_dir_finalize: found top session dir
> > empty -
> > deleting
> > [zelda01.localdomain:20659] sess_dir_finalize: found proc session dir
> > empty
> > - deleting
> > [zelda01.localdomain:20659] sess_dir_finalize: found job session dir
> > empty -
> > deleting
> > [zelda01.localdomain:20659] sess_dir_finalize: found univ session dir
> > empty
> > - deleting
> > [zelda01.localdomain:20659] sess_dir_finalize: top session dir not
> > empty -
> > leaving
> > [humphrey_at_zelda01 humphrey]$
> > -------- end non-hanging invocation ------
> >
> > Any thoughts?
> >
> > -- Marty
> >
> >> -----Original Message-----
> >> From: users-bounces_at_[hidden] [mailto:users-bounces_at_[hidden]]
> >> On
> >> Behalf Of Jeff Squyres
> >> Sent: Tuesday, November 01, 2005 2:17 PM
> >> To: Open MPI Users
> >> Subject: Re: [O-MPI users] can't get openmpi to run across two multi-
> >> NICmachines
> >>
> >> On Nov 1, 2005, at 12:02 PM, Marty Humphrey wrote:
> >>
> >>> wukong: eth0 (152.48.249.102, no MPI traffic), eth1
> >>> (128.109.34.20,yes
> >>> MPI
> >>> traffic)
> >>> zelda01: eth0 (130.207.252.131, yes MPI traffic), eth2 (10.0.0.12, no
> >>> MPI
> >>> traffic)
> >>>
> >>> on wukong, I have :
> >>> [humphrey_at_wukong ~]$ more ~/.openmpi/mca-params.conf
> >>> btl_tcp_if_include=eth1
> >>> on zelda01, I have :
> >>> [humphrey_at_zelda01 humphrey]$ more ~/.openmpi/mca-params.conf
> >>> btl_tcp_if_include=eth0
> >>
> >> Just to make sure I'm reading this right -- 128.109.34.20 is supposed
> >> to be routable to 130.207.252.131, right? Can you ssh directly from
> >> one machine to the other? (I'm guessing that you can because OMPI was
> >> able to start processes) Can you ping one machine from the other?
> >>
> >> Most importantly -- can you open arbitrary TCP ports between the two
> >> machines? (i.e., not just well-known ports like 22 [ssh], etc.)
> >>
> >> --
> >> {+} Jeff Squyres
> >> {+} The Open MPI Project
> >> {+} http://www.open-mpi.org/
> >>
> >> _______________________________________________
> >> users mailing list
> >> users_at_[hidden]
> >> http://www.open-mpi.org/mailman/listinfo.cgi/users
> >
> > _______________________________________________
> > users mailing list
> > users_at_[hidden]
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
> >
>
> --
> {+} Jeff Squyres
> {+} The Open MPI Project
> {+} http://www.open-mpi.org/
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users