Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] mca_btl_openib_post_srr() posts to an uncreated SRQwhen ibv_resize_cq() has failed
From: Nadia Derbey (Nadia.Derbey_at_[hidden])
Date: 2009-11-26 10:38:16


On Mon, 2009-10-26 at 15:06 -0700, Paul H. Hargrove wrote:
> Retrying w/ fewer CQ entires as Jeff describes is a good idea to help
> ensure that EINVAL actually does signify that the count exceeds the max
> instead of just assuming this is so). If it actually was signifying
> some other error case, then one would probably not want to continue.

Sorry for the delay, but I had many other things to do...

You'll find a patch proposal in attachment, ready for review.

The only part I'm not sure about is the following hunk:

@@ -496,7 +540,13 @@ int mca_btl_openib_add_procs(
         peers[i] = endpoint;
     }
 
- return mca_btl_openib_size_queues(openib_btl, nprocs);
+ rc = mca_btl_openib_size_queues(openib_btl, nprocs);
+ if (OMPI_SUCCESS != rc) {
+ mca_btl_openib_del_procs(btl, nprocs, ompi_procs, peers);
+ opal_bitmap_clear_all_bits(reachable);
+ }
+
+ return rc;

Don't know if there's a "less violent" way of undoing things.

Anyway, things work well with the path applied.

You'll also find in attachment:
1. the output without the patch applied
2. the output with the patch applied
3. the output with the patch applied + an emulation of an EINVAL that is
still returned.

Comments would be welcome.

Regards,
Nadia

>
> -Paul
>
> Jeff Squyres wrote:
> > Thanks for the analysis!
> >
> > We've argued about btl_r2_add_btls() before -- IIRC, the consensus is
> > that we want it to be able to continue even if a BTL fails. So I
> > *think* that your #1 answer is better.
> >
> > However, we might want to try a little harder if EINVAL is returned --
> > perhaps try decreasing number of CQ entries and try again until either
> > we have too few CQ entries to be useful (e.g., 0 or some higher number
> > that is still "too small"), or fail the BTL alltogether...?
> >
> > On Oct 23, 2009, at 10:10 AM, Nadia Derbey wrote:
> >
> >> Hi,
> >>
> >> Yesterdays I had to analyze a SIGSEV occuring after the following
> >> message had been output:
> >> [.... adjust_cq] cannot resize completion queue, error: 22
> >>
> >>
> >> What I found is the following:
> >>
> >> When ibv_resize_cq() fails to resize a CQ (in my case it returned
> >> EINVAL), adjust_cq() returns an error and create_srq() is not called by
> >> mca_btl_openib_size_queues().
> >>
> >> Note: One of our infiniband specialists told me that EINVAL was returned
> >> in that case because we were asking for more CQ entries than the max
> >> available.
> >>
> >> mca_bml_r2_add_btls() goes on executing.
> >>
> >> Then qp_create_all() is called (connect/btl_openib_connect_oob.c).
> >> ibv_create_qp() succeeds even though init_attr.srq is a NULL pointer
> >> (remember that create_srq() has not been previously called).
> >>
> >> Since all the QPs have been successfully created, qp_create_all() then
> >> calls:
> >> mca_btl_openib_endpoint_post_recvs()
> >> --> mca_btl_openib_post_srr()
> >> --> ibv_post_srq_recv() on a NULL SRQ
> >> ==> SIGSEGV
> >>
> >>
> >> If I'm not wrong in the analysis above, we have the choice between 2
> >> solutions to fix this problem:
> >>
> >> 1. if EINVAL is returned by ibv_resize_cq() in adjust_cq(), treat this
> >> as the ENOSYS case: do not return an error, since the CQ has
> >> successfully been created may be with less entries than needed, but it
> >> is there.
> >>
> >> Doing this we assume that EINVAL will always be the symptom of a "too
> >> many entries asked for" error from the IB stack. I don't have the
> >> answer...
> >> + I don't know if this won't imply a degraded mode in terms of
> >> performances.
> >>
> >> 2. Fix mca_bml_r2_add_btls() to cleanly exit if an error occurs during
> >> btl_add_procs().
> >>
> >> FYI I tested solution #1 and it worked...
> >>
> >> Any suggestion or comment would be welcome.
> >>
> >> Regards,
> >> Nadia
> >>
> >> --
> >> Nadia Derbey <Nadia.Derbey_at_[hidden]>
> >>

>

-- 
Nadia Derbey <Nadia.Derbey_at_[hidden]>


[derbeyn_at_inti0 ~]$ salloc -n 16 -N 2 -p Zeus mpirun --mca btl openib,self /home_nfs/derbeyn/Bull-vs2//opt/IMB/IMB-MPI1 -npmin 16 sendrecv
salloc: Granted job allocation 90732
[inti42][[4571,1],13][../../../../../ompi/mca/btl/openib/btl_openib.c:201:adjust_cq] cannot resize completion queue, error: 22
[inti41][[4571,1],6][../../../../../ompi/mca/btl/openib/btl_openib.c:201:adjust_cq] cannot resize completion queue, error: 22
#---------------------------------------------------
# Intel (R) MPI Benchmark Suite V2.3, MPI-1 part
#---------------------------------------------------
# Date : Thu Nov 26 15:52:27 2009
# Machine : x86_64# System : Linux
# Release : 2.6.18-128.el5.Bull.3
# Version : #1 SMP Fri Feb 13 10:09:19 CET 2009

#
# Minimum message length in bytes: 0
# Maximum message length in bytes: 16777216
#
# MPI_Datatype : MPI_BYTE
# MPI_Datatype for reductions : MPI_FLOAT
# MPI_Op : MPI_SUM
#
#

# List of Benchmarks to run:

# Sendrecv
[inti41:06482] *** Process received signal ***
[inti41:06482] Signal: Segmentation fault (11)
[inti41:06482] Signal code: Address not mapped (1)
[inti41:06482] Failing at address: (nil)
[inti41:06482] [ 0] /lib64/libpthread.so.0 [0x305d00de60]
[inti41:06482] [ 1] /home_nfs/derbeyn/DISTS/openmpi-default/lib/openmpi/mca_btl_openib.so [0x2aac3d401597]
[inti41:06482] [ 2] /home_nfs/derbeyn/DISTS/openmpi-default/lib/openmpi/mca_btl_openib.so [0x2aac3d409e2c]
[inti41:06482] [ 3] /home_nfs/derbeyn/DISTS/openmpi-default/lib/openmpi/mca_btl_openib.so [0x2aac3d4134c5]
[inti41:06482] [ 4] /home_nfs/derbeyn/DISTS/openmpi-default/lib/openmpi/mca_rml_oob.so [0x2aac3b1868a1]
[inti41:06482] [ 5] /home_nfs/derbeyn/DISTS/openmpi-default/lib/openmpi/mca_oob_tcp.so [0x2aac3b3901a0]
[inti41:06482] [ 6] /home_nfs/derbeyn/DISTS/openmpi-default/lib/openmpi/mca_oob_tcp.so [0x2aac3b3914ca]
[inti41:06482] [ 7] /home_nfs/derbeyn/DISTS/openmpi-default/lib/libopen-pal.so.0 [0x2aac3a908fcb]
[inti41:06482] [ 8] /home_nfs/derbeyn/DISTS/openmpi-default/lib/libopen-pal.so.0(opal_progress+0x9e) [0x2aac3a8f57fe]
[inti41:06482] [ 9] /home_nfs/derbeyn/DISTS/openmpi-default/lib/libmpi.so.0 [0x2aac3a418035]
[inti41:06482] [10] /home_nfs/derbeyn/DISTS/openmpi-default/lib/openmpi/mca_coll_tuned.so [0x2aac3e67ed55]
[inti41:06482] [11] /home_nfs/derbeyn/DISTS/openmpi-default/lib/openmpi/mca_coll_tuned.so [0x2aac3e67eed7]
[inti41:06482] [12] /home_nfs/derbeyn/DISTS/openmpi-default/lib/openmpi/mca_coll_tuned.so [0x2aac3e674d7f]
[inti41:06482] [13] /home_nfs/derbeyn/DISTS/openmpi-default/lib/openmpi/mca_coll_sync.so [0x2aac3e4712d9]
[inti41:06482] [14] /home_nfs/derbeyn/DISTS/openmpi-default/lib/libmpi.so.0(MPI_Bcast+0x171) [0x2aac3a4241b1]
[inti41:06482] [15] /home_nfs/derbeyn/Bull-vs2//opt/IMB/IMB-MPI1(IMB_basic_input+0x956) [0x4042d6]
[inti41:06482] [16] /home_nfs/derbeyn/Bull-vs2//opt/IMB/IMB-MPI1(main+0x6b) [0x402eab]
[inti41:06482] [17] /lib64/libc.so.6(__libc_start_main+0xf4) [0x305c41d8a4]
[inti41:06482] [18] /home_nfs/derbeyn/Bull-vs2//opt/IMB/IMB-MPI1 [0x402d89]
[inti41:06482] *** End of error message ***
--------------------------------------------------------------------------
mpirun noticed that process rank 6 with PID 6482 on node inti41 exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------
salloc: Relinquishing job allocation 90732


[derbeyn_at_inti0 ~]$ salloc -n 16 -N 2 -p Zeus mpirun --mca btl openib,self /home_nfs/derbeyn/Bull-vs2//opt/IMB/IMB-MPI1 -npmin 16 sendrecv
salloc: Granted job allocation 90737
--------------------------------------------------------------------------
WARNING: Could not resize CQ to the size originally asked for.

  Local host: inti41
  Device name: mthca0
  Size asked for: 9344
  Actual CQ size: 7008

This may result in lower performance.
--------------------------------------------------------------------------
#---------------------------------------------------
# Intel (R) MPI Benchmark Suite V2.3, MPI-1 part
#---------------------------------------------------
# Date : Thu Nov 26 15:55:03 2009
# Machine : x86_64# System : Linux
# Release : 2.6.18-128.el5.Bull.3
# Version : #1 SMP Fri Feb 13 10:09:19 CET 2009

#
# Minimum message length in bytes: 0
# Maximum message length in bytes: 16777216
#
# MPI_Datatype : MPI_BYTE
# MPI_Datatype for reductions : MPI_FLOAT
# MPI_Op : MPI_SUM
#
#

# List of Benchmarks to run:

# Sendrecv

#-----------------------------------------------------------------------------
# Benchmarking Sendrecv
# #processes = 16
#-----------------------------------------------------------------------------
       #bytes #repetitions t_min[usec] t_max[usec] t_avg[usec] Mbytes/sec
            0 1000 30.91 31.07 30.99 0.00
            1 1000 30.15 30.32 30.24 0.06
            2 1000 29.79 30.04 29.96 0.13
            4 1000 29.38 29.56 29.47 0.26
            8 1000 39.45 39.60 39.55 0.39
           16 1000 29.22 29.38 29.32 1.04
           32 1000 29.44 29.97 29.85 2.04
           64 1000 39.91 41.17 40.51 2.97
          128 1000 38.99 39.62 39.47 6.16
          256 1000 28.58 28.81 28.72 16.95
          512 1000 29.67 29.85 29.77 32.72
         1024 1000 42.02 42.18 42.07 46.31
         2048 1000 46.98 47.27 47.16 82.64
         4096 1000 47.91 48.25 48.12 161.93
         8192 1000 88.36 88.62 88.49 176.31
        16384 1000 254.80 255.03 254.96 122.53
        32768 1000 360.07 361.14 360.73 173.06
        65536 640 561.28 574.89 571.20 217.43
[inti0:22534] 3 more processes have sent help message help-mpi-btl-openib.txt / CQ resized lower
[inti0:22534] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
       131072 320 1054.02 1077.62 1069.75 231.99
       262144 160 2098.44 2128.30 2119.77 234.93
       524288 80 4234.15 4258.82 4250.09 234.81
      1048576 40 8390.82 8463.80 8435.44 236.30
      2097152 20 16565.85 16796.04 16712.02 238.15
      4194304 10 32637.41 33899.59 33395.27 235.99
      8388608 5 61666.01 68091.20 65520.32 234.98
     16777216 2 129991.41 138666.51 134618.62 230.77
salloc: Relinquishing job allocation 90737


[derbeyn_at_inti0 ~]$ salloc -n 16 -N 2 -p Zeus mpirun --mca btl openib,self /home_nfs/derbeyn/Bull-vs2//opt/IMB/IMB-MPI1 -npmin 16 sendrecv
salloc: Granted job allocation 90741
[inti42][[6518,1],11][../../../../../ompi/mca/btl/openib/btl_openib.c:220:adjust_cq] cannot resize completion queue, error: 22
[inti41][[6518,1],4][../../../../../ompi/mca/btl/openib/btl_openib.c:220:adjust_cq] cannot resize completion queue, error: 22
[inti41][[6518,1],4][../../../../../ompi/mca/btl/openib/connect/btl_openib_connect_oob.c:788:rml_recv_cb] can't find suitable endpoint for this peer

--------------------------------------------------------------------------
mpirun has exited due to process rank 4 with PID 6569 on
node inti41 exiting without calling "finalize". This may
have caused other processes in the application to be
terminated by signals sent by mpirun (as reported here).
--------------------------------------------------------------------------
salloc: Relinquishing job allocation 90741

This result has been obtained applying the following patch to emulate an
unconditional EINVAL.

btl/openib: emulate persisting EINVAL

diff -r bd820c9c0415 ompi/mca/btl/openib/btl_openib.c
--- a/ompi/mca/btl/openib/btl_openib.c Thu Nov 26 15:53:22 2009 +0100
+++ b/ompi/mca/btl/openib/btl_openib.c Thu Nov 26 15:59:03 2009 +0100
@@ -214,6 +214,7 @@ static int adjust_cq(mca_btl_openib_devi
             while (EINVAL == abs(rc) && cq_size > old_cq_size) {
                 cq_size = old_cq_size + ((cq_size - old_cq_size) / 2);
                 rc = ibv_resize_cq(device->ib_cq[cq], cq_size);
+rc = EINVAL;
             }
             if (rc) {
                 BTL_ERROR(("cannot resize completion queue, error: %d", rc));