Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |  

This web mail archive is frozen.

This page is part of a frozen web archive of this mailing list.

You can still navigate around this archive, but know that no new mails have been added to it since July of 2016.

Click here to be taken to the new web archives of this list; it includes all the mails that are in this frozen archive plus all new mails that have been sent to the list since it was migrated to the new archives.

Subject: Re: [OMPI users] OpenIB Error in ibv_create_srq
From: Terry Dontje (terry.dontje_at_[hidden])
Date: 2010-08-04 12:03:28


Allen Barnett wrote:
> Thanks for the pointer!
>
> Do you know if these sizes are dependent on the hardware?
>
They can be, the following file sets up the defaults for some known cards:

ompi/mca/btl/openib/mca-btl-openib-device-params.ini

--td
> Thanks,
> Allen
>
> On Tue, 2010-08-03 at 10:29 -0400, Terry Dontje wrote:
>
>> Sorry, I didn't see your prior question glad you found the
>> btl_openib_receive_queues parameter. There is not a faq entry for
>> this but I found the following in the openib btl help file that spells
>> out the parameters when using Per-peer receive queue (ie receive queue
>> setting with "P" as the first argument).
>>
>> Per-peer receive queues require between 2 and 5 parameters:
>>
>> 1. Buffer size in bytes (mandatory)
>> 2. Number of buffers (mandatory)
>> 3. Low buffer count watermark (optional; defaults to (num_buffers /
>> 2))
>> 4. Credit window size (optional; defaults to (low_watermark / 2))
>> 5. Number of buffers reserved for credit messages (optional;
>> defaults to (num_buffers*2-1)/credit_window)
>>
>> Example: P,128,256,128,16
>> - 128 byte buffers
>> - 256 buffers to receive incoming MPI messages
>> - When the number of available buffers reaches 128, re-post 128 more
>> buffers to reach a total of 256
>> - If the number of available credits reaches 16, send an explicit
>> credit message to the sender
>> - Defaulting to ((256 * 2) - 1) / 16 = 31; this many buffers are
>> reserved for explicit credit messages
>>
>> --td
>> Allen Barnett wrote:
>>
>>> Hi: In response to my own question, by studying the file
>>> mca-btl-openib-device-params.ini, I discovered that this option in
>>> OMPI-1.4.2:
>>>
>>> -mca btl_openib_receive_queues P,65536,256,192,128
>>>
>>> was sufficient to prevent OMPI from trying to create shared receive
>>> queues and allowed my application to run to completion using the IB
>>> hardware.
>>>
>>> I guess my question now is: What do these numbers mean? Presumably the
>>> size (or counts?) of buffers to allocate? Are there limits or a way to
>>> tune these values?
>>>
>>> Thanks,
>>> Allen
>>>
>>> On Mon, 2010-08-02 at 12:49 -0400, Allen Barnett wrote:
>>>
>>>
>>>> Hi Terry:
>>>> It is indeed the case that the openib BTL has not been initialized. I
>>>> ran with your tcp-disabled MCA option and it aborted in MPI_Init.
>>>>
>>>> The OFED stack is what's included in RHEL4. It appears to be made up of
>>>> the RPMs:
>>>> openib-1.4-1.el4
>>>> opensm-3.2.5-1.el4
>>>> libibverbs-1.1.2-1.el4
>>>>
>>>> How can I determine if srq is supported? Is there an MCA option to
>>>> defeat it? (Our in-house cluster has more recent Mellanox IB hardware
>>>> and is running this same IB stack and ompi 1.4.2 works OK, so I suspect
>>>> srq is supported by the OpenFabrics stack. Perhaps.)
>>>>
>>>> Thanks,
>>>> Allen
>>>>
>>>> On Mon, 2010-08-02 at 06:47 -0400, Terry Dontje wrote:
>>>>
>>>>
>>>>> My guess is from the message below saying "(openib) BTL failed to
>>>>> initialize" that the code is probably running over tcp. To
>>>>> absolutely prove this you can specify to only use the openib, sm and
>>>>> self btls to eliminate the tcp btl. To do that you add the following
>>>>> to the mpirun line "-mca btl openib,sm,self". I believe with that
>>>>> specification the code will abort and not run to completion.
>>>>>
>>>>> What version of the OFED stack are you using? I wonder if srq is
>>>>> supported on your system or not?
>>>>>
>>>>> --td
>>>>>
>>>>> Allen Barnett wrote:
>>>>>
>>>>>
>>>>>> Hi: A customer is attempting to run our OpenMPI 1.4.2-based application
>>>>>> on a cluster of machines running RHEL4 with the standard OFED stack. The
>>>>>> HCAs are identified as:
>>>>>>
>>>>>> 03:01.0 PCI bridge: Mellanox Technologies MT23108 PCI Bridge (rev a1)
>>>>>> 04:00.0 InfiniBand: Mellanox Technologies MT23108 InfiniHost (rev a1)
>>>>>>
>>>>>> ibv_devinfo says that one port on the HCAs is active but the other is
>>>>>> down:
>>>>>>
>>>>>> hca_id: mthca0
>>>>>> fw_ver: 3.0.2
>>>>>> node_guid: 0006:6a00:9800:4c78
>>>>>> sys_image_guid: 0006:6a00:9800:4c78
>>>>>> vendor_id: 0x066a
>>>>>> vendor_part_id: 23108
>>>>>> hw_ver: 0xA1
>>>>>> phys_port_cnt: 2
>>>>>> port: 1
>>>>>> state: active (4)
>>>>>> max_mtu: 2048 (4)
>>>>>> active_mtu: 2048 (4)
>>>>>> sm_lid: 1
>>>>>> port_lid: 26
>>>>>> port_lmc: 0x00
>>>>>>
>>>>>> port: 2
>>>>>> state: down (1)
>>>>>> max_mtu: 2048 (4)
>>>>>> active_mtu: 512 (2)
>>>>>> sm_lid: 0
>>>>>> port_lid: 0
>>>>>> port_lmc: 0x00
>>>>>>
>>>>>>
>>>>>> When the OMPI application is run, it prints the error message:
>>>>>>
>>>>>> --------------------------------------------------------------------
>>>>>> The OpenFabrics (openib) BTL failed to initialize while trying to
>>>>>> create an internal queue. This typically indicates a failed
>>>>>> OpenFabrics installation, faulty hardware, or that Open MPI is
>>>>>> attempting to use a feature that is not supported on your hardware
>>>>>> (i.e., is a shared receive queue specified in the
>>>>>> btl_openib_receive_queues MCA parameter with a device that does not
>>>>>> support it?). The failure occured here:
>>>>>>
>>>>>> Local host: machine001.lan
>>>>>> OMPI
>>>>>> source: /software/openmpi-1.4.2/ompi/mca/btl/openib/btl_openib.c:250
>>>>>> Function: ibv_create_srq()
>>>>>> Error: Invalid argument (errno=22)
>>>>>> Device: mthca0
>>>>>>
>>>>>> You may need to consult with your system administrator to get this
>>>>>> problem fixed.
>>>>>> --------------------------------------------------------------------
>>>>>>
>>>>>> The full log of a run with "btl_openib_verbose 1" is attached. My
>>>>>> application appears to run to completion, but I can't tell if it's just
>>>>>> running on TCP and not using the IB hardware.
>>>>>>
>>>>>> I would appreciate any suggestions on how to proceed to fix this error.
>>>>>>
>>>>>> Thanks,
>>>>>> Allen
>>>>>>
>
>

-- 
Oracle
Terry D. Dontje | Principal Software Engineer
Developer Tools Engineering | +1.650.633.7054
Oracle * - Performance Technologies*
95 Network Drive, Burlington, MA 01803
Email terry.dontje_at_[hidden] <mailto:terry.dontje_at_[hidden]>



picture