Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] Error launching w/ 1.5.3 on IB mthca nodes
From: Yevgeny Kliteynik (kliteyn_at_[hidden])
Date: 2011-12-18 04:39:16


On 16-Dec-11 4:28 AM, Jeff Squyres wrote:
> Very strange. I have a lot of older mthca-based HCAs in my Cisco MPI test cluster, and I don't see these kinds of problems.
>
> Mellanox -- any ideas?

So if I understand it right, you have a mixed cluster - some
machines with ConnecX HCAs family (mlx4), and some with InfiniHost
HCAs (mthca), and the problem arises only on machines with mthca.

When exactly do you see this RNR problem:
 - when all the participating nodes are mthcas?
 - when the MPI job runs on both types of HCAs?

-- YK

 
>
> On Dec 15, 2011, at 7:24 PM, V. Ram wrote:
>
>> Hi Terry,
>>
>> Thanks so much for the response. My replies are in-line below.
>>
>> On Thu, Dec 15, 2011, at 07:00 AM, TERRY DONTJE wrote:
>>> IIRC, RNR's are usually due to the receiving side not having a segment
>>> registered and ready to receive data on a QP. The btl does go through a
>>> big dance and does its own flow control to make sure this doesn't happen.
>>>
>>> So when this happens are both the sending and receiving nodes using
>>> mthca's to communicate with?
>>
>> Yes. For the newer nodes using onboard mlx4, this issue doesn't arise.
>> The mlx4-based nodes are using the same core switch as the mthca nodes.
>>
>>> By any chance is it a particular node (or pair of nodes) this seems to
>>> happen with?
>>
>> No. I've got 40 nodes total with this hardware configuration, and the
>> problem has been seen on most/all nodes at one time or another. It
>> doesn't seem, based on the limited number of observable parameters I'm
>> aware of, to be dependent on the number of nodes involved.
>>
>> It is an intermittent problem, but when it happens, it happens at job
>> launch, and it does occur most of the time.
>>
>> Thanks,
>>
>> V. Ram
>>
>>> --td
>>>>
>>>> Open MPI InfiniBand gurus and/or Mellanox: could I please get some
>>>> assistance with this? Any suggestions on tunables or debugging
>>>> parameters to try?
>>>>
>>>> Thank you very much.
>>>>
>>>> On Mon, Dec 12, 2011, at 10:42 AM, V. Ram wrote:
>>>>> Hello,
>>>>>
>>>>> We are running a cluster that has a good number of older nodes with
>>>>> Mellanox IB HCAs that have the "mthca" device name ("ib_mthca" kernel
>>>>> module).
>>>>>
>>>>> These adapters are all at firmware level 4.8.917 .
>>>>>
>>>>> The Open MPI in use is 1.5.3 , kernel 2.6.39 , x86-64. Jobs are
>>>>> launched/managed using Slurm 2.2.7. The IB software and drivers
>>>>> correspond to OFED 1.5.3.2 , and I've verified that the kernel modules
>>>>> in use are all from this OFED version.
>>>>>
>>>>> On nodes with the mthca hardware *only*, we get frequent, but
>>>>> intermittent job startup failures, with messages like:
>>>>>
>>>>> /////////////////////////////////
>>>>>
>>>>> [[19373,1],54][btl_openib_component.c:3320:handle_wc] from compute-c3-07
>>>>> to: compute-c3-01 error polling LP CQ with status RECEIVER NOT READY
>>>>> RETRY EXCEEDED ERROR status
>>>>> number 13 for wr_id 2a25c200 opcode 128 vendor error 135 qp_idx 0
>>>>>
>>>>> --------------------------------------------------------------------------
>>>>> The OpenFabrics "receiver not ready" retry count on a per-peer
>>>>> connection between two MPI processes has been exceeded. In general,
>>>>> this should not happen because Open MPI uses flow control on per-peer
>>>>> connections to ensure that receivers are always ready when data is
>>>>> sent.
>>>>>
>>>>> [further standard error text snipped...]
>>>>>
>>>>> Below is some information about the host that raised the error and the
>>>>> peer to which it was connected:
>>>>>
>>>>> Local host: compute-c3-07
>>>>> Local device: mthca0
>>>>> Peer host: compute-c3-01
>>>>>
>>>>> You may need to consult with your system administrator to get this
>>>>> problem fixed.
>>>>> --------------------------------------------------------------------------
>>>>>
>>>>> /////////////////////////////////
>>>>>
>>>>> During these job runs, I have monitored the InfiniBand performance
>>>>> counters on the endpoints and switch. No telltale counters for any of
>>>>> these ports change during these failed job initiations.
>>>>>
>>>>> ibdiagnet works fine and properly enumerates the fabric and related
>>>>> performance counters, both from the affected nodes, as well as other
>>>>> nodes attached to the IB switch. The IB connectivity itself seems fine
>>>>> from these nodes.
>>>>>
>>>>> Other nodes with different HCAs use the same InfiniBand fabric
>>>>> continuously without any issue, so I don't think it's the fabric/switch.
>>>>>
>>>>> I'm at a loss for what to do next to try and find the root cause of the
>>>>> issue. I suspect something perhaps having to do with the mthca
>>>>> support/drivers, but how can I track this down further?
>>>>>
>>>>> Thank you,
>>>>>
>>>>> V. Ram.
>>
>> --
>> http://www.fastmail.fm - One of many happy users:
>> http://www.fastmail.fm/docs/quotes.html
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>