Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] Error launching w/ 1.5.3 on IB mthca nodes
From: V. Ram (vramml0_at_[hidden])
Date: 2011-12-19 02:10:54


On Thu, Dec 15, 2011, at 09:28 PM, Jeff Squyres wrote:
> Very strange. I have a lot of older mthca-based HCAs in my Cisco MPI
> test cluster, and I don't see these kinds of problems.

Can I ask what version of OFED you're using, or what version of OFED the
IB software stack is coming from?

Thank you.

V. Ram

> On Dec 15, 2011, at 7:24 PM, V. Ram wrote:
>
> > Hi Terry,
> >
> > Thanks so much for the response. My replies are in-line below.
> >
> > On Thu, Dec 15, 2011, at 07:00 AM, TERRY DONTJE wrote:
> >> IIRC, RNR's are usually due to the receiving side not having a segment
> >> registered and ready to receive data on a QP. The btl does go through a
> >> big dance and does its own flow control to make sure this doesn't happen.
> >>
> >> So when this happens are both the sending and receiving nodes using
> >> mthca's to communicate with?
> >
> > Yes. For the newer nodes using onboard mlx4, this issue doesn't arise.
> > The mlx4-based nodes are using the same core switch as the mthca nodes.
> >
> >> By any chance is it a particular node (or pair of nodes) this seems to
> >> happen with?
> >
> > No. I've got 40 nodes total with this hardware configuration, and the
> > problem has been seen on most/all nodes at one time or another. It
> > doesn't seem, based on the limited number of observable parameters I'm
> > aware of, to be dependent on the number of nodes involved.
> >
> > It is an intermittent problem, but when it happens, it happens at job
> > launch, and it does occur most of the time.
> >
> > Thanks,
> >
> > V. Ram
> >
> >> --td
> >>>
> >>> Open MPI InfiniBand gurus and/or Mellanox: could I please get some
> >>> assistance with this? Any suggestions on tunables or debugging
> >>> parameters to try?
> >>>
> >>> Thank you very much.
> >>>
> >>> On Mon, Dec 12, 2011, at 10:42 AM, V. Ram wrote:
> >>>> Hello,
> >>>>
> >>>> We are running a cluster that has a good number of older nodes with
> >>>> Mellanox IB HCAs that have the "mthca" device name ("ib_mthca" kernel
> >>>> module).
> >>>>
> >>>> These adapters are all at firmware level 4.8.917 .
> >>>>
> >>>> The Open MPI in use is 1.5.3 , kernel 2.6.39 , x86-64. Jobs are
> >>>> launched/managed using Slurm 2.2.7. The IB software and drivers
> >>>> correspond to OFED 1.5.3.2 , and I've verified that the kernel modules
> >>>> in use are all from this OFED version.
> >>>>
> >>>> On nodes with the mthca hardware *only*, we get frequent, but
> >>>> intermittent job startup failures, with messages like:
> >>>>
> >>>> /////////////////////////////////
> >>>>
> >>>> [[19373,1],54][btl_openib_component.c:3320:handle_wc] from compute-c3-07
> >>>> to: compute-c3-01 error polling LP CQ with status RECEIVER NOT READY
> >>>> RETRY EXCEEDED ERROR status
> >>>> number 13 for wr_id 2a25c200 opcode 128 vendor error 135 qp_idx 0
> >>>>
> >>>> --------------------------------------------------------------------------
> >>>> The OpenFabrics "receiver not ready" retry count on a per-peer
> >>>> connection between two MPI processes has been exceeded. In general,
> >>>> this should not happen because Open MPI uses flow control on per-peer
> >>>> connections to ensure that receivers are always ready when data is
> >>>> sent.
> >>>>
> >>>> [further standard error text snipped...]
> >>>>
> >>>> Below is some information about the host that raised the error and the
> >>>> peer to which it was connected:
> >>>>
> >>>> Local host: compute-c3-07
> >>>> Local device: mthca0
> >>>> Peer host: compute-c3-01
> >>>>
> >>>> You may need to consult with your system administrator to get this
> >>>> problem fixed.
> >>>> --------------------------------------------------------------------------
> >>>>
> >>>> /////////////////////////////////
> >>>>
> >>>> During these job runs, I have monitored the InfiniBand performance
> >>>> counters on the endpoints and switch. No telltale counters for any of
> >>>> these ports change during these failed job initiations.
> >>>>
> >>>> ibdiagnet works fine and properly enumerates the fabric and related
> >>>> performance counters, both from the affected nodes, as well as other
> >>>> nodes attached to the IB switch. The IB connectivity itself seems fine
> >>>> from these nodes.
> >>>>
> >>>> Other nodes with different HCAs use the same InfiniBand fabric
> >>>> continuously without any issue, so I don't think it's the fabric/switch.
> >>>>
> >>>> I'm at a loss for what to do next to try and find the root cause of the
> >>>> issue. I suspect something perhaps having to do with the mthca
> >>>> support/drivers, but how can I track this down further?
> >>>>
> >>>> Thank you,
> >>>>
> >>>> V. Ram.
> >
> > --
> > http://www.fastmail.fm - One of many happy users:
> > http://www.fastmail.fm/docs/quotes.html
> >
> > _______________________________________________
> > users mailing list
> > users_at_[hidden]
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
> --
> Jeff Squyres
> jsquyres_at_[hidden]
> For corporate legal information go to:
> http://www.cisco.com/web/about/doing_business/legal/cri/
>
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>

-- 
http://www.fastmail.fm - Faster than the air-speed velocity of an
                          unladen european swallow