Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] Error launching w/ 1.5.3 on IB mthca nodes
From: TERRY DONTJE (terry.dontje_at_[hidden])
Date: 2011-12-19 07:29:43


On 12/19/2011 2:10 AM, V. Ram wrote:
> On Thu, Dec 15, 2011, at 09:28 PM, Jeff Squyres wrote:
>> Very strange. I have a lot of older mthca-based HCAs in my Cisco MPI
>> test cluster, and I don't see these kinds of problems.
> Can I ask what version of OFED you're using, or what version of OFED the
> IB software stack is coming from?
>
Just to set expectations here, Jeff is on vacation until January so he
might not reply to this anytime soon.

--td
> Thank you.
>
> V. Ram
>
>> On Dec 15, 2011, at 7:24 PM, V. Ram wrote:
>>
>>> Hi Terry,
>>>
>>> Thanks so much for the response. My replies are in-line below.
>>>
>>> On Thu, Dec 15, 2011, at 07:00 AM, TERRY DONTJE wrote:
>>>> IIRC, RNR's are usually due to the receiving side not having a segment
>>>> registered and ready to receive data on a QP. The btl does go through a
>>>> big dance and does its own flow control to make sure this doesn't happen.
>>>>
>>>> So when this happens are both the sending and receiving nodes using
>>>> mthca's to communicate with?
>>> Yes. For the newer nodes using onboard mlx4, this issue doesn't arise.
>>> The mlx4-based nodes are using the same core switch as the mthca nodes.
>>>
>>>> By any chance is it a particular node (or pair of nodes) this seems to
>>>> happen with?
>>> No. I've got 40 nodes total with this hardware configuration, and the
>>> problem has been seen on most/all nodes at one time or another. It
>>> doesn't seem, based on the limited number of observable parameters I'm
>>> aware of, to be dependent on the number of nodes involved.
>>>
>>> It is an intermittent problem, but when it happens, it happens at job
>>> launch, and it does occur most of the time.
>>>
>>> Thanks,
>>>
>>> V. Ram
>>>
>>>> --td
>>>>> Open MPI InfiniBand gurus and/or Mellanox: could I please get some
>>>>> assistance with this? Any suggestions on tunables or debugging
>>>>> parameters to try?
>>>>>
>>>>> Thank you very much.
>>>>>
>>>>> On Mon, Dec 12, 2011, at 10:42 AM, V. Ram wrote:
>>>>>> Hello,
>>>>>>
>>>>>> We are running a cluster that has a good number of older nodes with
>>>>>> Mellanox IB HCAs that have the "mthca" device name ("ib_mthca" kernel
>>>>>> module).
>>>>>>
>>>>>> These adapters are all at firmware level 4.8.917 .
>>>>>>
>>>>>> The Open MPI in use is 1.5.3 , kernel 2.6.39 , x86-64. Jobs are
>>>>>> launched/managed using Slurm 2.2.7. The IB software and drivers
>>>>>> correspond to OFED 1.5.3.2 , and I've verified that the kernel modules
>>>>>> in use are all from this OFED version.
>>>>>>
>>>>>> On nodes with the mthca hardware *only*, we get frequent, but
>>>>>> intermittent job startup failures, with messages like:
>>>>>>
>>>>>> /////////////////////////////////
>>>>>>
>>>>>> [[19373,1],54][btl_openib_component.c:3320:handle_wc] from compute-c3-07
>>>>>> to: compute-c3-01 error polling LP CQ with status RECEIVER NOT READY
>>>>>> RETRY EXCEEDED ERROR status
>>>>>> number 13 for wr_id 2a25c200 opcode 128 vendor error 135 qp_idx 0
>>>>>>
>>>>>> --------------------------------------------------------------------------
>>>>>> The OpenFabrics "receiver not ready" retry count on a per-peer
>>>>>> connection between two MPI processes has been exceeded. In general,
>>>>>> this should not happen because Open MPI uses flow control on per-peer
>>>>>> connections to ensure that receivers are always ready when data is
>>>>>> sent.
>>>>>>
>>>>>> [further standard error text snipped...]
>>>>>>
>>>>>> Below is some information about the host that raised the error and the
>>>>>> peer to which it was connected:
>>>>>>
>>>>>> Local host: compute-c3-07
>>>>>> Local device: mthca0
>>>>>> Peer host: compute-c3-01
>>>>>>
>>>>>> You may need to consult with your system administrator to get this
>>>>>> problem fixed.
>>>>>> --------------------------------------------------------------------------
>>>>>>
>>>>>> /////////////////////////////////
>>>>>>
>>>>>> During these job runs, I have monitored the InfiniBand performance
>>>>>> counters on the endpoints and switch. No telltale counters for any of
>>>>>> these ports change during these failed job initiations.
>>>>>>
>>>>>> ibdiagnet works fine and properly enumerates the fabric and related
>>>>>> performance counters, both from the affected nodes, as well as other
>>>>>> nodes attached to the IB switch. The IB connectivity itself seems fine
>>>>>> from these nodes.
>>>>>>
>>>>>> Other nodes with different HCAs use the same InfiniBand fabric
>>>>>> continuously without any issue, so I don't think it's the fabric/switch.
>>>>>>
>>>>>> I'm at a loss for what to do next to try and find the root cause of the
>>>>>> issue. I suspect something perhaps having to do with the mthca
>>>>>> support/drivers, but how can I track this down further?
>>>>>>
>>>>>> Thank you,
>>>>>>
>>>>>> V. Ram.
>>> --
>>> http://www.fastmail.fm - One of many happy users:
>>> http://www.fastmail.fm/docs/quotes.html
>>>
>>> _______________________________________________
>>> users mailing list
>>> users_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>> --
>> Jeff Squyres
>> jsquyres_at_[hidden]
>> For corporate legal information go to:
>> http://www.cisco.com/web/about/doing_business/legal/cri/
>>
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>

-- 
Oracle
Terry D. Dontje | Principal Software Engineer
Developer Tools Engineering | +1.781.442.2631
Oracle *- Performance Technologies*
95 Network Drive, Burlington, MA 01803
Email terry.dontje_at_[hidden] <mailto:terry.dontje_at_[hidden]>



picture