Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] Error launching w/ 1.5.3 on IB mthca nodes
From: V. Ram (vramml0_at_[hidden])
Date: 2011-12-14 15:00:30


Open MPI InfiniBand gurus and/or Mellanox: could I please get some
assistance with this? Any suggestions on tunables or debugging
parameters to try?

Thank you very much.

On Mon, Dec 12, 2011, at 10:42 AM, V. Ram wrote:
> Hello,
>
> We are running a cluster that has a good number of older nodes with
> Mellanox IB HCAs that have the "mthca" device name ("ib_mthca" kernel
> module).
>
> These adapters are all at firmware level 4.8.917 .
>
> The Open MPI in use is 1.5.3 , kernel 2.6.39 , x86-64. Jobs are
> launched/managed using Slurm 2.2.7. The IB software and drivers
> correspond to OFED 1.5.3.2 , and I've verified that the kernel modules
> in use are all from this OFED version.
>
> On nodes with the mthca hardware *only*, we get frequent, but
> intermittent job startup failures, with messages like:
>
> /////////////////////////////////
>
> [[19373,1],54][btl_openib_component.c:3320:handle_wc] from compute-c3-07
> to: compute-c3-01 error polling LP CQ with status RECEIVER NOT READY
> RETRY EXCEEDED ERROR status
> number 13 for wr_id 2a25c200 opcode 128 vendor error 135 qp_idx 0
>
> --------------------------------------------------------------------------
> The OpenFabrics "receiver not ready" retry count on a per-peer
> connection between two MPI processes has been exceeded. In general,
> this should not happen because Open MPI uses flow control on per-peer
> connections to ensure that receivers are always ready when data is
> sent.
>
> [further standard error text snipped...]
>
> Below is some information about the host that raised the error and the
> peer to which it was connected:
>
> Local host: compute-c3-07
> Local device: mthca0
> Peer host: compute-c3-01
>
> You may need to consult with your system administrator to get this
> problem fixed.
> --------------------------------------------------------------------------
>
> /////////////////////////////////
>
> During these job runs, I have monitored the InfiniBand performance
> counters on the endpoints and switch. No telltale counters for any of
> these ports change during these failed job initiations.
>
> ibdiagnet works fine and properly enumerates the fabric and related
> performance counters, both from the affected nodes, as well as other
> nodes attached to the IB switch. The IB connectivity itself seems fine
> from these nodes.
>
> Other nodes with different HCAs use the same InfiniBand fabric
> continuously without any issue, so I don't think it's the fabric/switch.
>
> I'm at a loss for what to do next to try and find the root cause of the
> issue. I suspect something perhaps having to do with the mthca
> support/drivers, but how can I track this down further?
>
> Thank you,
>
> V. Ram.

-- 
http://www.fastmail.fm - Choose from over 50 domains or use your own