Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] [OMPI svn-full] svn:open-mpi r19991
From: Tim Mattox (timattox_at_[hidden])
Date: 2008-11-13 11:54:09


I'm not 100% sure, but this looks like the changeset that caused all
of IU's trunk MTT
runs last night to segfault... yes, all. :-(

Here's the magnitude of the problem:
http://www.open-mpi.org/mtt/index.php?do_redir=883

Note how pretty much everything was passing for 1.4a1r19979,
and everything failing for 1.4a1r19991.

I am not sure why there are only results from absoft and IU. Maybe the
sun MTT runs just haven't finished yet from last night.

Take a look at these MTT results for a manageable sample where you could
click on the "details" button to see the various segfaults:
http://www.open-mpi.org/mtt/index.php?do_redir=884

Most of the segfaults look something like this that involve the mca_iof_hnp.so:

======================
[odin093:06882] *** Process received signal ***
[odin093:06882] Signal: Segmentation fault (11)
[odin093:06882] Signal code: Address not mapped (1)
[odin093:06882] Failing at address: 0x8
[odin093:06882] [ 0] /lib64/libpthread.so.0 [0x2aaaaba4ee70]
[odin093:06882] [ 1]
/nfs/rinfs/san/homedirs/mpiteam/mtt-runs/odin/20081112-Nightly/pb_3/installs/TqMo/install/lib/openmpi/mca_iof_hnp.so
[0x2aaaadc1c3fd]
[odin093:06882] [ 2]
/nfs/rinfs/san/homedirs/mpiteam/mtt-runs/odin/20081112-Nightly/pb_3/installs/TqMo/install/lib/libopen-pal.so.0
[0x2aaaaaf29b0b]
[odin093:06882] [ 3] mpirun [0x4033e3]
[odin093:06882] [ 4] mpirun [0x402b13]
[odin093:06882] [ 5] /lib64/libc.so.6(__libc_start_main+0xf4) [0x2aaaabc788b4]
[odin093:06882] [ 6] mpirun [0x402a49]
[odin093:06882] *** End of error message ***
======================

But there are a few that don't have mca_iof_hnp.so in the stacktrace, so
I could be wrong about which changeset caused this:

======================
[odin090:12437] *** Process received signal ***
[odin090:12437] Signal: Segmentation fault (11)
[odin090:12437] Signal code: Address not mapped (1)
[odin090:12437] Failing at address: 0x4
[odin090:12437] [ 0] [0xffffe600]
[odin090:12437] [ 1]
/nfs/rinfs/san/homedirs/mpiteam/mtt-runs/odin/20081112-Nightly/pb_2/installs/U_ro/install/lib/libopen-pal.so.0
[0xf7f5b118]
[odin090:12437] [ 2]
/nfs/rinfs/san/homedirs/mpiteam/mtt-runs/odin/20081112-Nightly/pb_2/installs/U_ro/install/lib/libopen-pal.so.0(opal_event_loop+0x27)
[0xf7f5b367]
[odin090:12437] [ 3]
/nfs/rinfs/san/homedirs/mpiteam/mtt-runs/odin/20081112-Nightly/pb_2/installs/U_ro/install/lib/libopen-pal.so.0(opal_event_dispatch+0x1e)
[0xf7f5b38e]
[odin090:12437] [ 4] mpirun [0x804a8f8]
[odin090:12437] [ 5] mpirun [0x8049f36]
[odin090:12437] [ 6] /lib/libc.so.6(__libc_start_main+0xdc) [0xf7d8ddec]
[odin090:12437] [ 7] mpirun [0x8049e61]
[odin090:12437] *** End of error message ***
======================

On Wed, Nov 12, 2008 at 6:32 PM, <rhc_at_[hidden]> wrote:
> Author: rhc
> Date: 2008-11-12 18:32:01 EST (Wed, 12 Nov 2008)
> New Revision: 19991
> URL: https://svn.open-mpi.org/trac/ompi/changeset/19991
>
> Log:
> Fix the iof race conditions wrt proc termination. This is comprised of two sections:
>
> 1. modify the iof to track when a proc actually closes all of its open iof output pipes. When this occurs, notify the odls that the proc's iof is complete. This is done via a zero-time event so that we can step out of the read event before processing the notification.
>
> 2. in the odls, modify the waitpid callback so it only flags that it was called. Add a function to receive the iof-complete notification, and a function that checks for both iof complete and waitpid callback before declaring a proc fully terminated. This ensures that we read and deliver -all- of the IO prior to declaring the job complete.
>
> Also modified the odls call to orte_iof.close (and the component's implementation) so it only closes stdin, leaving the other io channels alone. This fixes the other half of the known problem.
>
> This should fix the ticket on this subject, but I'll wait to close it pending further testing in the trunk.
>
> Text files modified:
> trunk/orte/mca/iof/base/base.h | 30 +++-
> trunk/orte/mca/iof/base/iof_base_open.c | 32 ++++
> trunk/orte/mca/iof/hnp/iof_hnp.c | 98 +++++++------
> trunk/orte/mca/iof/hnp/iof_hnp.h | 2
> trunk/orte/mca/iof/hnp/iof_hnp_component.c | 14 -
> trunk/orte/mca/iof/hnp/iof_hnp_read.c | 62 +++++++-
> trunk/orte/mca/iof/orted/iof_orted.c | 85 +++++++----
> trunk/orte/mca/iof/orted/iof_orted.h | 2
> trunk/orte/mca/iof/orted/iof_orted_component.c | 6
> trunk/orte/mca/iof/orted/iof_orted_read.c | 39 +++++
> trunk/orte/mca/odls/base/base.h | 5
> trunk/orte/mca/odls/base/odls_base_default_fns.c | 280 +++++++++++++++++++++++++--------------
> trunk/orte/mca/odls/base/odls_base_open.c | 2
> trunk/orte/mca/odls/base/odls_private.h | 2
> trunk/orte/mca/odls/odls_types.h | 3
> trunk/orte/runtime/orte_wait.c | 17 ++
> trunk/orte/runtime/orte_wait.h | 33 ++++
> 17 files changed, 491 insertions(+), 221 deletions(-)
>
> Modified: trunk/orte/mca/iof/base/base.h
>

-- 
Tim Mattox, Ph.D. - http://homepage.mac.com/tmattox/
 tmattox_at_[hidden] || timattox_at_[hidden]
    I'm a bright... http://www.the-brights.net/