Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] Error launching w/ 1.5.3 on IB mthca nodes
From: Jeff Squyres (jsquyres_at_[hidden])
Date: 2011-12-15 21:28:14


Very strange. I have a lot of older mthca-based HCAs in my Cisco MPI test cluster, and I don't see these kinds of problems.

Mellanox -- any ideas?

On Dec 15, 2011, at 7:24 PM, V. Ram wrote:

> Hi Terry,
>
> Thanks so much for the response. My replies are in-line below.
>
> On Thu, Dec 15, 2011, at 07:00 AM, TERRY DONTJE wrote:
>> IIRC, RNR's are usually due to the receiving side not having a segment
>> registered and ready to receive data on a QP. The btl does go through a
>> big dance and does its own flow control to make sure this doesn't happen.
>>
>> So when this happens are both the sending and receiving nodes using
>> mthca's to communicate with?
>
> Yes. For the newer nodes using onboard mlx4, this issue doesn't arise.
> The mlx4-based nodes are using the same core switch as the mthca nodes.
>
>> By any chance is it a particular node (or pair of nodes) this seems to
>> happen with?
>
> No. I've got 40 nodes total with this hardware configuration, and the
> problem has been seen on most/all nodes at one time or another. It
> doesn't seem, based on the limited number of observable parameters I'm
> aware of, to be dependent on the number of nodes involved.
>
> It is an intermittent problem, but when it happens, it happens at job
> launch, and it does occur most of the time.
>
> Thanks,
>
> V. Ram
>
>> --td
>>>
>>> Open MPI InfiniBand gurus and/or Mellanox: could I please get some
>>> assistance with this? Any suggestions on tunables or debugging
>>> parameters to try?
>>>
>>> Thank you very much.
>>>
>>> On Mon, Dec 12, 2011, at 10:42 AM, V. Ram wrote:
>>>> Hello,
>>>>
>>>> We are running a cluster that has a good number of older nodes with
>>>> Mellanox IB HCAs that have the "mthca" device name ("ib_mthca" kernel
>>>> module).
>>>>
>>>> These adapters are all at firmware level 4.8.917 .
>>>>
>>>> The Open MPI in use is 1.5.3 , kernel 2.6.39 , x86-64. Jobs are
>>>> launched/managed using Slurm 2.2.7. The IB software and drivers
>>>> correspond to OFED 1.5.3.2 , and I've verified that the kernel modules
>>>> in use are all from this OFED version.
>>>>
>>>> On nodes with the mthca hardware *only*, we get frequent, but
>>>> intermittent job startup failures, with messages like:
>>>>
>>>> /////////////////////////////////
>>>>
>>>> [[19373,1],54][btl_openib_component.c:3320:handle_wc] from compute-c3-07
>>>> to: compute-c3-01 error polling LP CQ with status RECEIVER NOT READY
>>>> RETRY EXCEEDED ERROR status
>>>> number 13 for wr_id 2a25c200 opcode 128 vendor error 135 qp_idx 0
>>>>
>>>> --------------------------------------------------------------------------
>>>> The OpenFabrics "receiver not ready" retry count on a per-peer
>>>> connection between two MPI processes has been exceeded. In general,
>>>> this should not happen because Open MPI uses flow control on per-peer
>>>> connections to ensure that receivers are always ready when data is
>>>> sent.
>>>>
>>>> [further standard error text snipped...]
>>>>
>>>> Below is some information about the host that raised the error and the
>>>> peer to which it was connected:
>>>>
>>>> Local host: compute-c3-07
>>>> Local device: mthca0
>>>> Peer host: compute-c3-01
>>>>
>>>> You may need to consult with your system administrator to get this
>>>> problem fixed.
>>>> --------------------------------------------------------------------------
>>>>
>>>> /////////////////////////////////
>>>>
>>>> During these job runs, I have monitored the InfiniBand performance
>>>> counters on the endpoints and switch. No telltale counters for any of
>>>> these ports change during these failed job initiations.
>>>>
>>>> ibdiagnet works fine and properly enumerates the fabric and related
>>>> performance counters, both from the affected nodes, as well as other
>>>> nodes attached to the IB switch. The IB connectivity itself seems fine
>>>> from these nodes.
>>>>
>>>> Other nodes with different HCAs use the same InfiniBand fabric
>>>> continuously without any issue, so I don't think it's the fabric/switch.
>>>>
>>>> I'm at a loss for what to do next to try and find the root cause of the
>>>> issue. I suspect something perhaps having to do with the mthca
>>>> support/drivers, but how can I track this down further?
>>>>
>>>> Thank you,
>>>>
>>>> V. Ram.
>
> --
> http://www.fastmail.fm - One of many happy users:
> http://www.fastmail.fm/docs/quotes.html
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users

-- 
Jeff Squyres
jsquyres_at_[hidden]
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/