Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

From: Greg Watson (gwatson_at_[hidden])
Date: 2005-12-16 10:47:24


Jeff,

I finally worked out why I couldn't reproduce the problem. You're not
going to like it though.

As before, this is running on FC4 and I'm using 1.0.1r8453 (the 1.0.1
release version).

First test:

$ ./configure --with-devel-headers --prefix=/usr/local/ompi
$ make
$ make install
$ mpicc -o x x.c
$ mpirun -d -np 2 ./x

[localhost.localdomain:10085] [0,0,0] setting up session dir with
[localhost.localdomain:10085] universe default-universe
[localhost.localdomain:10085] user greg
[localhost.localdomain:10085] host localhost.localdomain
[localhost.localdomain:10085] jobid 0
[localhost.localdomain:10085] procid 0
[localhost.localdomain:10085] procdir: /tmp/openmpi-sessions-
greg_at_localhost.localdomain_0/default-universe/0/0
[localhost.localdomain:10085] jobdir: /tmp/openmpi-sessions-
greg_at_localhost.localdomain_0/default-universe/0
[localhost.localdomain:10085] unidir: /tmp/openmpi-sessions-
greg_at_localhost.localdomain_0/default-universe
[localhost.localdomain:10085] top: openmpi-sessions-
greg_at_localhost.localdomain_0
[localhost.localdomain:10085] tmp: /tmp
[localhost.localdomain:10085] [0,0,0] contact_file /tmp/openmpi-
sessions-greg_at_localhost.localdomain_0/default-universe/universe-
setup.txt
[localhost.localdomain:10085] [0,0,0] wrote setup file
[localhost.localdomain:10085] spawn: in job_state_callback(jobid = 1,
state = 0x1)
[localhost.localdomain:10085] pls:rsh: local csh: 0, local bash: 1
[localhost.localdomain:10085] pls:rsh: assuming same remote shell as
local shell
[localhost.localdomain:10085] pls:rsh: remote csh: 0, remote bash: 1
[localhost.localdomain:10085] pls:rsh: final template argv:
[localhost.localdomain:10085] pls:rsh: ssh <template> orted --
debug --bootproxy 1 --name <template> --num_procs 2 --vpid_start 0 --
nodename <template> --universe greg_at_localhost.localdomain:default-
universe --nsreplica "0.0.0;tcp://10.0.1.103:32818" --gprreplica
"0.0.0;tcp://10.0.1.103:32818" --mpi-call-yield 0
[localhost.localdomain:10085] pls:rsh: launching on node localhost
[localhost.localdomain:10085] pls:rsh: oversubscribed -- setting
mpi_yield_when_idle to 1 (1 2)
[localhost.localdomain:10085] sess_dir_finalize: proc session dir not
empty - leaving
[localhost.localdomain:10085] spawn: in job_state_callback(jobid = 1,
state = 0xa)
mpirun noticed that job rank 1 with PID 0 on node "localhost" exited
on signal 11.
[localhost.localdomain:10085] sess_dir_finalize: proc session dir not
empty - leaving
[localhost.localdomain:10085] spawn: in job_state_callback(jobid = 1,
state = 0x9)
[localhost.localdomain:10085] ERROR: A daemon on node localhost
failed to startas expected.
[localhost.localdomain:10085] ERROR: There may be more information
available from
[localhost.localdomain:10085] ERROR: the remote shell (see above).
[localhost.localdomain:10085] The daemon received a signal 11 (with
core).
1 additional process aborted (not shown)
[localhost.localdomain:10085] sess_dir_finalize: found proc session
dir empty -deleting
[localhost.localdomain:10085] sess_dir_finalize: found job session
dir empty - deleting
[localhost.localdomain:10085] sess_dir_finalize: found univ session
dir empty -deleting
[localhost.localdomain:10085] sess_dir_finalize: found top session
dir empty - deleting

Here's the stacktracefrom the core file:

#0 0x00e93fe8 in orte_pls_rsh_launch ()
    from /usr/local/ompi/lib/openmpi/mca_pls_rsh.so
#1 0x0023c642 in orte_rmgr_urm_spawn ()
    from /usr/local/ompi/lib/openmpi/mca_rmgr_urm.so
#2 0x0804a0d4 in orterun (argc=5, argv=0xbfab2e84) at orterun.c:373
#3 0x08049b16 in main (argc=5, argv=0xbfab2e84) at main.c:13

Now reconfigure with debugging enabled:

$ CFLAGS=-g ./configure --with-devel-headers --prefix=/usr/local/ompi
$ make
$ make install
$ mpicc -o x x.c
$ mpirun -d -np 2 ./x

[localhost.localdomain:10166] [0,0,0] setting up session dir with
[localhost.localdomain:10166] universe default-universe
[localhost.localdomain:10166] user greg
[localhost.localdomain:10166] host localhost.localdomain
[localhost.localdomain:10166] jobid 0
[localhost.localdomain:10166] procid 0
[localhost.localdomain:10166] procdir: /tmp/openmpi-sessions-
greg_at_localhost.localdomain_0/default-universe/0/0
[localhost.localdomain:10166] jobdir: /tmp/openmpi-sessions-
greg_at_localhost.localdomain_0/default-universe/0
[localhost.localdomain:10166] unidir: /tmp/openmpi-sessions-
greg_at_localhost.localdomain_0/default-universe
[localhost.localdomain:10166] top: openmpi-sessions-
greg_at_localhost.localdomain_0
[localhost.localdomain:10166] tmp: /tmp
[localhost.localdomain:10166] [0,0,0] contact_file /tmp/openmpi-
sessions-greg_at_localhost.localdomain_0/default-universe/universe-
setup.txt
[localhost.localdomain:10166] [0,0,0] wrote setup file
[localhost.localdomain:10166] spawn: in job_state_callback(jobid = 1,
state = 0x1)
[localhost.localdomain:10166] pls:rsh: local csh: 0, local bash: 1
[localhost.localdomain:10166] pls:rsh: assuming same remote shell as
local shell
[localhost.localdomain:10166] pls:rsh: remote csh: 0, remote bash: 1
[localhost.localdomain:10166] pls:rsh: final template argv:
[localhost.localdomain:10166] pls:rsh: ssh <template> orted --
debug --bootproxy 1 --name <template> --num_procs 2 --vpid_start 0 --
nodename <template> --universe greg_at_localhost.localdomain:default-
universe --nsreplica "0.0.0;tcp://10.0.1.103:32820" --gprreplica
"0.0.0;tcp://10.0.1.103:32820" --mpi-call-yield 0
[localhost.localdomain:10166] pls:rsh: launching on node localhost
[localhost.localdomain:10166] pls:rsh: oversubscribed -- setting
mpi_yield_when_idle to 1 (1 2)
[localhost.localdomain:10166] pls:rsh: localhost is a LOCAL node
[localhost.localdomain:10166] pls:rsh: executing: orted --debug --
bootproxy 1 --name 0.0.1 --num_procs 2 --vpid_start 0 --nodename
localhost --universe greg_at_localhost.localdomain:default-universe --
nsreplica "0.0.0;tcp://10.0.1.103:32820" --gprreplica "0.0.0;tcp://
10.0.1.103:32820" --mpi-call-yield 1
[localhost.localdomain:10167] [0,0,1] setting up session dir with
[localhost.localdomain:10167] universe default-universe
[localhost.localdomain:10167] user greg
[localhost.localdomain:10167] host localhost
[localhost.localdomain:10167] jobid 0
[localhost.localdomain:10167] procid 1
[localhost.localdomain:10167] procdir: /tmp/openmpi-sessions-
greg_at_localhost_0/default-universe/0/1
[localhost.localdomain:10167] jobdir: /tmp/openmpi-sessions-
greg_at_localhost_0/default-universe/0
[localhost.localdomain:10167] unidir: /tmp/openmpi-sessions-
greg_at_localhost_0/default-universe
[localhost.localdomain:10167] top: openmpi-sessions-greg_at_localhost_0
[localhost.localdomain:10167] tmp: /tmp
[localhost.localdomain:10169] [0,1,1] setting up session dir with
[localhost.localdomain:10169] universe default-universe
[localhost.localdomain:10169] user greg
[localhost.localdomain:10169] host localhost
[localhost.localdomain:10169] jobid 1
[localhost.localdomain:10169] procid 1
[localhost.localdomain:10169] procdir: /tmp/openmpi-sessions-
greg_at_localhost_0/default-universe/1/1
[localhost.localdomain:10169] jobdir: /tmp/openmpi-sessions-
greg_at_localhost_0/default-universe/1
[localhost.localdomain:10169] unidir: /tmp/openmpi-sessions-
greg_at_localhost_0/default-universe
[localhost.localdomain:10169] top: openmpi-sessions-greg_at_localhost_0
[localhost.localdomain:10169] tmp: /tmp
[localhost.localdomain:10170] [0,1,0] setting up session dir with
[localhost.localdomain:10170] universe default-universe
[localhost.localdomain:10170] user greg
[localhost.localdomain:10170] host localhost
[localhost.localdomain:10170] jobid 1
[localhost.localdomain:10170] procid 0
[localhost.localdomain:10170] procdir: /tmp/openmpi-sessions-
greg_at_localhost_0/default-universe/1/0
[localhost.localdomain:10170] jobdir: /tmp/openmpi-sessions-
greg_at_localhost_0/default-universe/1
[localhost.localdomain:10170] unidir: /tmp/openmpi-sessions-
greg_at_localhost_0/default-universe
[localhost.localdomain:10170] top: openmpi-sessions-greg_at_localhost_0
[localhost.localdomain:10170] tmp: /tmp
[localhost.localdomain:10166] spawn: in job_state_callback(jobid = 1,
state = 0x3)
[localhost.localdomain:10166] Info: Setting up debugger process table
for applications
   MPIR_being_debugged = 0
   MPIR_debug_gate = 0
   MPIR_debug_state = 1
   MPIR_acquired_pre_main = 0
   MPIR_i_am_starter = 0
   MPIR_proctable_size = 2
   MPIR_proctable:
     (i, host, exe, pid) = (0, localhost, ./x, 10169)
     (i, host, exe, pid) = (1, localhost, ./x, 10170)
[localhost.localdomain:10166] spawn: in job_state_callback(jobid = 1,
state = 0x4)
[localhost.localdomain:10170] [0,1,0] ompi_mpi_init completed
[localhost.localdomain:10169] [0,1,1] ompi_mpi_init completed
my tid is 0
my tid is 1
[localhost.localdomain:10166] spawn: in job_state_callback(jobid = 1,
state = 0x7)
[localhost.localdomain:10166] spawn: in job_state_callback(jobid = 1,
state = 0x8)
[localhost.localdomain:10167] sess_dir_finalize: proc session dir not
empty - leaving
[localhost.localdomain:10170] sess_dir_finalize: found proc session
dir empty -deleting
[localhost.localdomain:10169] sess_dir_finalize: found proc session
dir empty -deleting
[localhost.localdomain:10170] sess_dir_finalize: job session dir not
empty - leaving
[localhost.localdomain:10167] sess_dir_finalize: proc session dir not
empty - leaving
[localhost.localdomain:10167] orted: job_state_callback(jobid = 1,
state = ORTE_PROC_STATE_TERMINATED)
[localhost.localdomain:10167] sess_dir_finalize: found proc session
dir empty -deleting
[localhost.localdomain:10167] sess_dir_finalize: found job session
dir empty - deleting
[localhost.localdomain:10167] sess_dir_finalize: found univ session
dir empty -deleting
[localhost.localdomain:10167] sess_dir_finalize: found top session
dir empty - deleting

So it looks like you're doing something that breaks with normal
optimization (I presume -O3) but works otherwise.

FC4 is using gcc 4.0.0 20050519.

Suggestions on how to proceed would be appreciated.

Greg

On Dec 1, 2005, at 9:19 AM, Jeff Squyres wrote:

> On Dec 1, 2005, at 10:58 AM, Greg Watson wrote:
>
>> @#$%^& it! I can't get the problem to manifest for either branch now.
>
> Well, that's good for me. :-)
>
> FWIW, the problem existed on systems that could/would return different
> addresses in different processes from mmap() for memory that was
> common
> to all of them. E.g., if processes A and B share common memory Z, A
> would get virtual address M for Z, and B would get virtual address N
> (as opposed to both A and B getting virtual address M).
>
> Here's the history of what happened...
>
> We had code paths for that situation in the sm btl (i.e., when A and B
> get different addresses for the same shared memory), but
> unbeknownst to
> us, mmap() on most systems seems to return the same value in A and B
> (this could be a side-effect of typical MPI testing memory usage
> patterns... I don't know).
>
> But FC3 and FC4 consistently did not seem to follow this pattern --
> they would return different values from mmap() in different processes.
> Unfortunately, we did not do any testing on FC3 or FC4 systems until a
> few weeks before SC, and discovered that some of our
> previously-unknowingly-untested sm bootstrap code paths had some bugs.
> I fixed all of those and brought [almost all of] them over to the 1.0
> release branch. I missed one patch in v1.0, but it will be
> included in
> v1.0.1, to be released shortly.
>
> So I'd be surprised if you were still seeing this bug in either branch
> -- as far as I know, I fixed all the issues. More specifically, if
> you
> see this behavior, it will probably be in *both* branches.
>
> Let me know if you run across it again. Thanks!
>
> --
> {+} Jeff Squyres
> {+} The Open MPI Project
> {+} http://www.open-mpi.org/
>
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel