Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] OpenIB Error in ibv_create_srq
From: Allen Barnett (allen_at_[hidden])
Date: 2010-08-03 10:01:58


Hi: In response to my own question, by studying the file
mca-btl-openib-device-params.ini, I discovered that this option in
OMPI-1.4.2:

-mca btl_openib_receive_queues P,65536,256,192,128

was sufficient to prevent OMPI from trying to create shared receive
queues and allowed my application to run to completion using the IB
hardware.

I guess my question now is: What do these numbers mean? Presumably the
size (or counts?) of buffers to allocate? Are there limits or a way to
tune these values?

Thanks,
Allen

On Mon, 2010-08-02 at 12:49 -0400, Allen Barnett wrote:
> Hi Terry:
> It is indeed the case that the openib BTL has not been initialized. I
> ran with your tcp-disabled MCA option and it aborted in MPI_Init.
>
> The OFED stack is what's included in RHEL4. It appears to be made up of
> the RPMs:
> openib-1.4-1.el4
> opensm-3.2.5-1.el4
> libibverbs-1.1.2-1.el4
>
> How can I determine if srq is supported? Is there an MCA option to
> defeat it? (Our in-house cluster has more recent Mellanox IB hardware
> and is running this same IB stack and ompi 1.4.2 works OK, so I suspect
> srq is supported by the OpenFabrics stack. Perhaps.)
>
> Thanks,
> Allen
>
> On Mon, 2010-08-02 at 06:47 -0400, Terry Dontje wrote:
> > My guess is from the message below saying "(openib) BTL failed to
> > initialize" that the code is probably running over tcp. To
> > absolutely prove this you can specify to only use the openib, sm and
> > self btls to eliminate the tcp btl. To do that you add the following
> > to the mpirun line "-mca btl openib,sm,self". I believe with that
> > specification the code will abort and not run to completion.
> >
> > What version of the OFED stack are you using? I wonder if srq is
> > supported on your system or not?
> >
> > --td
> >
> > Allen Barnett wrote:
> > > Hi: A customer is attempting to run our OpenMPI 1.4.2-based application
> > > on a cluster of machines running RHEL4 with the standard OFED stack. The
> > > HCAs are identified as:
> > >
> > > 03:01.0 PCI bridge: Mellanox Technologies MT23108 PCI Bridge (rev a1)
> > > 04:00.0 InfiniBand: Mellanox Technologies MT23108 InfiniHost (rev a1)
> > >
> > > ibv_devinfo says that one port on the HCAs is active but the other is
> > > down:
> > >
> > > hca_id: mthca0
> > > fw_ver: 3.0.2
> > > node_guid: 0006:6a00:9800:4c78
> > > sys_image_guid: 0006:6a00:9800:4c78
> > > vendor_id: 0x066a
> > > vendor_part_id: 23108
> > > hw_ver: 0xA1
> > > phys_port_cnt: 2
> > > port: 1
> > > state: active (4)
> > > max_mtu: 2048 (4)
> > > active_mtu: 2048 (4)
> > > sm_lid: 1
> > > port_lid: 26
> > > port_lmc: 0x00
> > >
> > > port: 2
> > > state: down (1)
> > > max_mtu: 2048 (4)
> > > active_mtu: 512 (2)
> > > sm_lid: 0
> > > port_lid: 0
> > > port_lmc: 0x00
> > >
> > >
> > > When the OMPI application is run, it prints the error message:
> > >
> > > --------------------------------------------------------------------
> > > The OpenFabrics (openib) BTL failed to initialize while trying to
> > > create an internal queue. This typically indicates a failed
> > > OpenFabrics installation, faulty hardware, or that Open MPI is
> > > attempting to use a feature that is not supported on your hardware
> > > (i.e., is a shared receive queue specified in the
> > > btl_openib_receive_queues MCA parameter with a device that does not
> > > support it?). The failure occured here:
> > >
> > > Local host: machine001.lan
> > > OMPI
> > > source: /software/openmpi-1.4.2/ompi/mca/btl/openib/btl_openib.c:250
> > > Function: ibv_create_srq()
> > > Error: Invalid argument (errno=22)
> > > Device: mthca0
> > >
> > > You may need to consult with your system administrator to get this
> > > problem fixed.
> > > --------------------------------------------------------------------
> > >
> > > The full log of a run with "btl_openib_verbose 1" is attached. My
> > > application appears to run to completion, but I can't tell if it's just
> > > running on TCP and not using the IB hardware.
> > >
> > > I would appreciate any suggestions on how to proceed to fix this error.
> > >
> > > Thanks,
> > > Allen
> >
>

-- 
Allen Barnett
Transpire, Inc
E-Mail: allen_at_[hidden]
Skype:  allenbarnett
Ph:     518-887-2930